Conference PaperPDF Available

Analyzable Publish-Subscribe Communication through a Wait-Free FIFO Channel for MPSoC Real-Time Applications



As a transparent communication protocol for concurrent distributed applications, the Publish-Subscribe (Pub-Sub) paradigm is a trending programming model in many recent industrial use-cases in robotics and avionics. To apply the Pub-Sub programming model for safety-critical concurrent real-time applications in Multi-Processor Systems on Chip (MPSoC) environments, a non-blocking wait-free First-In-First-Out (FIFO) channel is a fundamental requirement. However, the proposed approaches in the literature have no proven real-time guarantees. In this paper, we propose a novel wait-free FIFO approach for single-producer-single-consumer core-to-core communication through shared memory. By analyzing the execution paths of each involved process, we prove that the execution time of each read/write operation is bounded by a Worst Case Execution Time (WCET). Moreover, we define a Timed Automata model of our approach. Using the UPPAAL model checker, we prove freedom of deadlock and starvation. For the performance evaluation of the proposed approach, we apply a stochastic analysis technique on the defined UPPAAL model. Finally, we implement the proposed approach on the CompSOC platform as the underlying real-time MPSoC to show that the implementation conforms to the proposed formal model and to experimentally validate the formal properties. The experimental evaluation on an instance of CompSOC that works at 40 MHz has a throughput of 109K tokens per second.
Analyzable Publish-Subcribe Communication through a
Wait-Free FIFO Channel for MPSoC Real-Time Applications
Saeid Dehnavi, Dip Goswami, Kees Goossens
Department of Electrical Engineering, Eindhoven University of Technology, The Netherlands
{S.Dehnavi, D.Goswami, K.G.W.Goossens}
Abstract—As a transparent communication protocol for con-
current distributed applications, the Publish-Subscribe (Pub-
Sub) paradigm is a trending programming model in many
recent industrial use-cases in robotics and avionics. To apply the
Pub-Sub programming model for safety-critical concurrent real-
time applications in Multi-Processor Systems on Chip (MPSoC)
environments, a non-blocking wait-free First-In-First-Out (FIFO)
channel is a fundamental requirement. However, the proposed
approaches in the literature have no proven real-time guarantees.
In this paper, we propose a novel wait-free FIFO approach
for single-producer-single-consumer core-to-core communication
through shared memory. By analyzing the execution paths of
each involved process, we prove that the execution time of each
read/write operation is bounded by a Worst Case Execution Time
(WCET). Moreover, we define a Timed Automata model of our
approach. Using the UPPAAL model checker, we prove freedom
of deadlock and starvation. For the performance evaluation of the
proposed approach, we apply a stochastic analysis technique on
the defined UPPAAL model. Finally, we implement the proposed
approach on the CompSOC platform as the underlying real-
time MPSoC to show that the implementation conforms to
the proposed formal model and to experimentally validate the
formal properties. The experimental evaluation on an instance
of CompSOC that works at 40 MHz has a throughput of 109K
tokens per second.
Keywords-Multi-Processor Systems on Chip; Real-Time
Systems; Wait-Free Asynchronous Communication; Publish-
Subscribe Communication;
The Pub-Sub paradigm [1] is a design pattern for transparent
communication in many recent distributed applications. In
this paradigm, the communication between different entities is
done through a decentralized whiteboard concept. This means
that every entity publishes/subscribes to topics of interest,
ignoring the identity of the consumer/producer of that topic.
This paradigm is standardized by Data Distribution Service
(DDS) [2] as a machine to machine communication standard
for distributed systems. In addition to many industrial use-
cases in Internet of Things (IoT) and Cyber-Physical Systems
(CPS) [3], DDS is the underlying communication layer for
Robot Operating System-2 (ROS2) [4]. However, analyzable
and predictable Pub-Sub communication for safety-critical
real-time applications in the MPSoC domain is a gap in
the literature. To address this gap, we first need to have an
analyzable and predictable queue approach that is the funda-
mental requirement for shared-memory based asynchronous
communication between concurrent applications in a MPSoC.
As a baseline classification, FIFO approaches are classified
into Non-Blocking Queues (NBQ), and Blocking Queues
(BQ). In the former class, the producer/consumer is able to
perform write/read operation even in case of a full/empty
queue, while in the latter class the entity blocks until the
queue is not full/empty. The write operation on a full buffer
in NBQ is lossy/overwriting, and the read operation on an
empty buffer is either polling (i.e. return NULL) or returns
the latest token (sample & hold). A NBQ algorithm is called
wait-free, or predictable in the real-time systems terminology,
if every producer/consumer process completes its operation in
a finite number of steps [5], assuming that each step has a
WCET. The guaranteed WCET of each read/write operation
is a critical prerequisite to provide an analyzable formal
model for the safety-critical concurrent real-time applications.
To the best of our knowledge, the reported wait-free NBQ
approaches in the literature are based on the Compare-And-
Swap (CAS) operation and require much memory. Moreover,
CAS and similar instructions affect the predictability of the
system [6] [7], and are not supported by certain hardware
architectures such as RISC-V [8]. In this paper, we provide
Pub-Sub programming model for MPSoC real-time platforms
through a novel wait-free NBQ approach using only standard
load/store instructions. More specifically, our contribution are:
A novel wait-free approach to provide Pub-Sub communi-
cation in MPSoC platforms. We use overwriting strategy for
the write, and polling strategy for the read operation.
Analysis and guarantee of the WCET for both producer
and consumer. By exploring the execution path, we show
that the WCET is guaranteed for each read/write operation.
This helps to formally model and verify the correctness and
performance of the parallel applications. The WCET relies
on the execution platform, such as CompSOC [9].
Formal verification. We define a timed-automaton model
of our wait-free FIFO and using the UPPAAL [10] model
checker verify deadlock-free and starvation-free properties.
We use the UPPAAL Stochastic Model Checker (UPPAAL-
SMC) [11] to evaluate the expected throughput and loss-rate
in the long term. Although the WCET is guaranteed, the
performance is stochastically affected by variation in the
execution time and the number of overwritten tokens.
Implementation and validation of the proposed approach on
CompSOC as the underlying MPSoC platform. We show
that the implementation results conform to the simulation
result and the verified model properties.
The works related to the scope of this paper are categorized
into the following 3 classes:
Non-blocking Wait-free Asynchronous Communication: A
wait-free and predictable NBQ approach is required as the
the underlying communication protocol for real-time Pub-
Sub communication. Clark et al. [12] made a comparison
between the well-known asynchronous communication mech-
anisms (ACM) through the PetriNet modeling and analysis.
They showed that Simpson [13] algorithm is the state-of-
the-art approach for fully asynchronous and wait-free core-
to-core communication through shared memory. However,
these approaches only provide asynchronous communication
to access the latest token instead of being a queue. In the
Pub-Sub systems, there is a Quality of Service (QoS) named
History that indicates the number of latest published tokens
that the consumer node should see at any point of time. This
QoS is only achievable through a FIFO queue that is not
provided by the classic ACM approaches. There are some
approaches [14]–[17] that provide asynchronous FIFO com-
munication, but through atomic Compare-And-Swap (CAS)
operation, or through adaptive dynamic memory allocation
mechanisms. However, both the CAS operation and dynamic
memory allocation affect the predictability of the system [6]
that makes these approaches not suitable for the safety critical
real-time systems concerned in this paper.
Pub-Sub Communication for real-time embedded MPSoC:
Perez et al. [18] proposed a system architecture that integrates
the DDS standard into the XtratuM hypervisor for transparent
communication in partitioned real-time systems. However,
timing requirements (guaranteed WCET) and resource limita-
tions are ignored in their proposed approach. embeddedRTPS
[19] is a lightweight implementation of DDS that involves
small embedded devices as the first-class DDS participants.
The implementation is based on FreeRTOS and lightweightIP
on a single STM32 micro-controller. However, no analysis
model is presented. Hamerski et al. [20] proposed a Pub-Sub
communication approach to decouple, in time and space, the
application development for MPSoC platforms. However, their
approach requires support from FreeRTOS as the underlying
operating system. Overall, the proposed approaches in this
category require support from underlying operating system,
and lack formal performance analysis.
Modeling, Verification, and Analysis of Wait-free Com-
munication Protocols: A model checking approach for the
four-slot Simpson algorithm is presented in [21]. They use
the SAL environment to formally prove the correctness of the
algorithm, and interference-free property of both the write and
read operations. However, the proposed formal model doesn’t
address the performance analysis of the algorithm. Correctness
verification of the one-place communication mechanisms is
addressed by [12] through Petri-Nets. However, the discussed
Petri-Net models are not presented in the paper. In general,
the proposed verification and performance analysis approach
focus on correctness and absence of deadlock and livelock,
Fig. 1: System model and application model
but don’t consider the timing properties of the model, or only
verify some model properties without proposing a performance
analysis approach. In this paper, we use Timed Automata (TA)
theory [22] and the UPPAAL model checker to address both
property verification and performance analysis.
At a high level, applications consist of concurrent tasks
that communicate (read or writing tokens) interspersed with
computation, as captured by models of computation such as
Dataflow [23], Kahn process networks [24], TTA [25], etc. For
real-time applications both computation and communication
must have a worst-case execution time (WCET) and pos-
sibly a best-case execution time (BCET) ([prod comp bcet,
prod comp wcet] and etc. in Fig. 1). In this paper we focus
on the design and performance analysis of the communica-
tion, which is a prerequisite for application-level performance
analysis in various models of computation. In particular, we
propose a wait-free FIFO with bounded capacity as a basis for
pub-sub communication.
An MPSoC consists of multiple processors that commu-
nicate through interconnect and distributed shared memories.
To be able to offer guaranteed minimum performance, all
processor instructions, interconnect and memory transactions
must have a WCET. Architecture choices such as speculation,
caching, and unpredictable arbitration may prohibit this. MP-
SoCs that offer WCET include CompSOC, T-CREST [26], and
TTA [25]. The experiments in this paper use CompSOC, as
described in Section VII. The system model and application
model in this paper are shown in Fig. 1.
In this section, we present a FIFO approach that provides
the required properties for Pub-Sub communication in MPSoC
platforms. The following requirements need to be addressed:
1) tokens are of a fixed user-defined data type (which may be
different per FIFO); 2) the FIFO has capacity for ktokens;
3) FIFO read/write operations lead to consistent tokens in the
FIFO; 4) FIFO operations are non-blocking even in case of
a read on an empty FIFO, or a write on a full FIFO; 5)
FIFO read and write operations have a WCET; 6) the FIFO is
free of deadlock and starvation. The proposed algorithm for
the producer (write) and the consumer (read) are presented
in Algorithm 1 and Algorithm 2 respectively. At a high level,
the FIFO is implemented using a circular buffer by using write
counter (wc) and read counter (rc). The FIFO is empty when
wc = rc, and it is full when (wc + 1) % k = rc. Initially,
Algorithm 1 Write to the wait-free channel
1while ((wc+1) % k) == rc /*FIFO full*/ do
2if cclaim==0 then
4if cclaim==0 then
5rc=(rc+1) % k
7Write token
Algorithm 2 Read from the wait-free channel
0if wc==rc /*FIFO empty*/ then return Null;
2while (pclaim==1);
3Read token
4rc = (rc+1) % k
In the traditional FIFO approaches, variables are updated
exclusively by either producer (wc) or consumer (rc). Also in
our case, the write counter is updated only by the producer.
However, to be wait-free the oldest token must be overwritten
when the FIFO is full. This is implemented by the producer
removing the oldest token before writing a new token. The
read counter is therefore written by the producer as well as
the consumer. The potential conflict on the read counter is
managed through locking. We introduce two flags that indicate
a consumer claim (cclaim) and producer claim (pclaim). To
avoid deadlock the producer backs off in case both producer
and consumer claim exclusive access to the read counter (since
the consumer will create space the producer avoids removing
a token needlessly). The consumer sets cclaim at the start of
each read operation, and it is unset at the end. At the start of
the read operation, the consumer checks if the read counter
is locked by the producer to remove a token and if so, waits
until the producer is finished. In the write operation, if the
buffer is not full, the producer writes into the FIFO as usual,
without the need to (check the) lock. However, if the buffer is
full the producer claims the lock only when the consumer has
not done so. More specifically, it checks if cclaim is not set,
then sets pclaim, and then backs off in case the consumer set
cclaim in the mean time. If the producer passes this condition
it removes the oldest token by updating the read counter and
releasing the lock.
The producer critical section, i.e. the instructions from
setting pclaim to unsetting it, is very short. The subsequent
(normal) writing of token may take more time, depending
on the size of the token. The consumer critical section is
longer and entails claiming the lock, copying the token from
the FIFO into a local variable, and then releasing the lock. It
should be mentioned that the critical section is implemented
without disabling interrupts, hardware locks, etc. but purely on
top of distributed shared memory and our proposed protocol.
It is important to notice that the write operation copies a
complete token into the FIFO and the read operation copies
a complete token from the FIFO. In other words, the com-
putation on tokens is strictly separated from communication,
unlike protocols such as C-HEAP [27] where it could lead to
unbounded communication times. Specifically, the WCET of
a read operation could depend on the WCET of another task’s
(possibly unbounded) computation WCET, which is prohibited
for a wait-free FIFO.
In the previous section, the proposed algorithm for both
producer and consumer sides were presented. As it was
mentioned, there are some scenarios in which one process
(producer or consumer) waits until the other process exits the
critical region. One may argue that the proposed algorithm is
not wait-free for this reason. In this section, we show that this
property is satisfied by proving a WCET for each read/write
operation. For every involved process, producer and consumer,
we explore the execution path of that process to show that it
takes a finite number of steps.
Producer Worst Case Execution Time: In the execution of
a producer process, the following scenarios may happen:
S1 (Available space): In this scenario, the process writes
into the buffer without going into the while loop.
S2 (Full buffer, consumer reading): In this case, the buffer
is full and the producer enters the while loop. However, since
the consumer is reading (cclaim=1), the producer waits until
the consumer is finished. When the consumer is done, it has
made a space available. Therefore, in the next iteration of
the while loop, the producer will be in the first scenario
with a bounded WCET. In the next section, we show that
the WCET of the consumer is bounded. Thus, the waiting
time for the producer in this scenario is also bounded.
S3 (Full buffer, pclaim=1, consumer active): This is a
special scenario in which the consumer is activated just
before the producer enters the critical region. In this case,
we give the priority to the consumer, since it makes a space
available by reading a token, and we go to scenario 2.
S4 (Full buffer, consumer inactive): In this scenario, the
producer process enters the critical region to make a space
by removing the oldest token. This operation is done in a
finite time by updating the rc variable, and after that, the
process continues in the scenario 1.
Consumer Worst Case Execution Time: In the execution of
the consumer process, the following scenarios may happen:
S1 (Available token, available space): In this case, the
buffer is not full nor empty. Therefore, the producer is not
Fig. 2: Execution Paths of Producer and Consumer
removing the oldest token (pclaim=0) and the consumer
doesn’t wait at the while loop. The process reads the token
and updates the rc variable in a finite time.
S2 (Available token, producer overwriting): In this case,
pclaim=1 and the consumer should wait until removing
the oldest token is done by the producer (pclaim=0). This
waiting time is equal to the execution of the producer
process in the scenario 4, that it is bounded by a WCET.
The execution paths of both the producer and the consumer
processes are presented in Fig. 2. As shown in Fig. 2.B, since
P5 and P6 are finite number of instructions, the two scenarios
of the consumer process are bounded by a WCET. Therefore,
the execution time for the consumer process is finite. This
implies that the producer execution time is also bounded since
the execution time of S2 in the producer process depends on
the execution time of the consumer. This is shown in Fig. 2.A.
In this section, we formally model the behaviour of the
proposed algorithm in both producer and consumer levels.
We use this model to verify the required properties such
as deadlock-freedom and starvation-freedom through model
checking. Moreover, we use stochastic model checking to
evaluate the performance. A Timed Automaton (TA) [22] is
a finite automaton that is extended by clock variables. Clock
variables increase synchronously at the same speed and they
can be compared/reset to integer variables on the transitions
of the automaton. As an integrated environment, UPPAAL is
used for modeling, validation and verification of the real-time
systems through timed automaton.
A. Modeling and Verification of Properties by UPPAAL
The timed automata of the producer and consumer are
presented in Fig. 3 and Fig. 4 respectively. For the consumer
model with 4 different states (State0 - State3), we use the
clock variable t2 to keep track of the passed time in every
state of the model. To model the time being in each state,
the clock variable needs to be reset on every transition. Each
state has an invariant condition (colored in purple) that shows
the maximum time that we can stay in that state. If the guard
condition over a transition (colored in green) is satisfied, the
transition takes place by executing the transition operation
(colored in blue) and moving to the new state. With the
same definitions, the timed automaton of the producer, has
6 different states (State0 - State5) with the clock variable t.
Moreover, some global variables are defined to be used in
both producer and consumer models. The global variables rc,
wc,cclaim and pclaim have the same definition as in Section
IV. To keep track of the number of available tokens and
spaces in the buffer, and to improve the readability of the
model, two global variables tokens and spaces are defined.
These variables could be derived from rc and wc, but it would
reduce the efficiency and readability of the model. It should
be mentioned that tC[x] and tP[x] indicate the execution
times for the line xof the consumer and producer algorithms
respectively. These values follow from the predictability of the
MPSoC platform either analytically or by execution (if the
platform is predictable and composable) [28]. The real values
in this paper are presented in Section VII-A.
UPPAAL utilizes efficient model checking to provide sys-
tem verification through property evaluation on the defined
timed automaton. In this section, we define and verify two im-
portant properties for the proposed wait-free FIFO channel. 1)
Deadlock-freedom: this property guarantees that a producer
and consumer are never blocked on each other. Starvation-
freedom: this property guarantees that every process finishes
its operation in a finite number of steps. In other words,
a producer and consumer are never busy waiting on each
other for an unbounded time. It should be mentioned that
the starvation-free property is also a proof for the correctness
of our WCET analysis presented in Section V. To check the
defined properties in UPPAAL, we define them by Eq.1,2,3.
All the defined properties are verified in less than one second.
Deadlock Free :A[ ] not d eadl ock (1)
Prod Starvation Free :(P.State1)− − >(P.Stat e2)(2)
Fig. 3: Producer Timed Automaton Fig. 4: Consumer Timed Automaton
Cons Starvation Free :(C.State1)− − >(C.State3)(3)
B. Stochastic Performance Analysis by UPPAAL-SMC
As mentioned in Section III, the execution time for each
component of the proposed approach is bounded between
BCET and WCET of that component. This time variation is
due to variation of tokens, and the complexity of processing
them. Although the execution time variation doesn’t affect
the correctness of the approach nor the discussed guaran-
teed properties, it affects some other measurements such as
throughput and overwrite-rate/loss-rate. UPPAAL-SMC is a
fork of UPPAAL that provides stochastic model checking
and simulation on the defined timed automaton models of
UPPAAL. We evaluate Throughput (TP): the number of
successfully published tokens per time unit, and Overwrite-
Rate/Loss-Rate (LR): the rate of overwriting tokens by the
producer in the full buffer scenario. This measurement is
important in control applications that need a tolerable loss-
rate for the published tokens. As it can be seen in the
producer timed automaton in Fig. 3, we have introduced some
global variables such as counter, lost, and lossrate to keep
track of these values during the execution. We express these
measurements in UPPAAL-SMC as follows:
T P :E[<=40000000;1000](max :(counter lost)) (4)
LR :E[<=40000000;1000](max :lossrat e)(5)
To evaluate the expected throughput in the long-run, we find
the expected maximum number of published tokens in 40 mil-
lion time unites (1 second assuming the processor frequency
at 40MHz). The maximum value is calculated among 1000
different random walks over the model. To find the expected
loss-rate in the long-run, we create the same expression but
over the lossrate variable. Moreover, UPPAAL-SMC provides
a simulation feature on the defined time automatons. We
use this feature to observe the convergence of the evaluated
metrics (TP and LR) over the time. The parameter values of
the model, such as the execution time of each component,
are based on Tbl. II presented in Section VII. An important
point to mention is that, we evaluate throughput and loss-
rate at two levels, channel level and application level. At the
channel level, we ignore the execution time of the Computation
components in both producer and consumer sides by setting
the related parameters ([prod comp bcet, prod comp wcet]
and [cons comp bcet, cons comp wcet] respectively) to zero.
This is due to fact that we want to evaluate the proposed
underlying wait-free FIFO approach at the lowest level (read
and write) ignoring the upper application level. By setting the
parameters in the mentioned way and applying Eq. 4 and Eq. 5,
we found the expected throughput 108110 tokens per second,
and loss-rate zero. The reason that the measured loss-rate is
zero is due to the fact that both producer and consumer works
at the same frequency ([prod comp bcet, prod comp wcet]=
[cons comp bcet, cons comp wcet]=0) and same average ex-
ecution time, and the buffer is never full.
To measure throughput and latency at the application
level, we involve the Computation component in both pro-
ducer and consumer sides by setting the related parameters
([prod comp bcet, prod comp wcet] and [cons comp bcet,
cons comp wcet] respectively) based on the reported values
in Tbl. II. The expected application level throughput and loss-
rate of the proposed approach in the long-run are presented
in Fig. 5 and Fig. 6 respectively. In these plots, we present
the cumulative function, average, and probability distribution
for the analyzed metrics. To check the validity of the reported
expected values, we used the simulation feature of UPPAAL-
SMC to see the metrics of interest 2 million time units. This
experiment is done 1000 times for each metric. As it can be
seen in Fig. 7, the simulation result of the loss-rate metric in
the long-run conforms the reported expected result in Fig. 6.
In the next section, we show that the experimental results of
the implementation conforms the presented stochastic results
in this section. Therefore, the sensitivity analysis (expected
results per different parameter values) are performed only on
the implementation instead of stochastic performance model.
In this section, we make an implementation of the proposed
approach on the CompSOC platform, as the underlying MP-
SoC. There are various implementations for CompSOC on
FPGA and on ASIC. We use the Verintec [29] implementation
Fig. 5: Stochastic Expected Throughput at Application Level Fig. 6: Stochastic Expected Loss Rate at Application Level
Fig. 7: Simulation Result for Loss-Rate at Application Level
of CompSOC on the programmable logic (PL, FPGA) side
of a Xilinx PYNQ-Z2 board. In this instance, CompSOC
contains 3 predictable RISC-V processors (40MHz, 128KB
memory) and a 16KB dual-ported shared memory between
any pair of RISC-V processors. On top of each RISC-V
processor, there is a virtualization kernel (VKERNEL) [30]
that creates predictable and composable Virtual Execution
Platforms (VEP) [9] on that processor through spatial and
cycle-accurate temporal partitioning. Scheduling the created
VEPs on the RISC-V processor is done through a static
Time Division Multiplexing (TDM) table. This allows us to
compute the Worst Case Response Time (WCRT) that takes
into account the preemption and TDM arbitration, due to
the presence of other applications. To find the WCRT, the
techniques such as those of [28] can be used. In this paper,
we configure VKERNEL to have only one VEP on each RISC-
V processor. This is to assign the whole processor to a single
task. The defined FIFO channel is placed in the shared memory
between the two processors that host producer and consumer.
A. Parameter Measurement
To measure the required parameters, we use the partition
timer of the CompSOC platform. Each partition has its own
local timer that counts only in those clock cycles that the
processor is assigned to the partition. Since in this paper
we use only one partition on a processor, with no system
application running on the processor, the local timer of that
partition counts every clock cycle of the processor. The
measured parameters for a Pub-Sub communication using
a channel with k=10, and token size 1 Byte is presented
in Tbl. II. In this table, the execution time for the re-
quired parameters in the timed automaton models of the
producer and consumer (Fig. 3 and Fig. 4 respectively) are
reported. As shown in Fig. 1, the execution time of the
Computation components in both producer and consumer sides
([prod comp bcet, prod comp wcet] and [cons comp bcet,
cons comp wcet] respectively) are bounded by a best case
and worst case execution time (reported values in Tbl. II).
We implement the execution of these tasks by a waiting
loop that its execution time is a random value in the re-
ported range for these parameters. The BCET and WCET of
the Communication components for the producer and con-
sumer processes ([prod comm bcet,prod comm wcet] and
[cons comm bcet,cons comm wcet] respectively) are dis-
cussed in Section VII-D.
B. Channel Throughput for pure Write & Read
To evaluate the channel throughput, we set the
[prod comp bcet, prod comp wcet] and [cons comp bcet,
Fig. 8: Throughput per [cons comp bcet, cons comp wcet] Fig. 9: Loss-Rate per [cons comp bcet, cons comp wcet]
Fig. 10: Producer Actual Execution Time and WCET Fig. 11: Consumer Actual Execution Time and WCET
cons comp wcet] parameters (in Fig. 1) to zero. This is
due to the fact that these parameters indicate the execution
time for the upper level Computation components in the
producer and consumer tasks respectively, while in this
section we evaluate the throughput for the pure write read
operation over the channel. As it can be seen in Tbl. I, the
throughput decreases linearly by increasing the message size.
This was expected since an important part of the write/read
operation is to copy the tokens to/from the shared memory
from/to the local memory. As it was expected, the loss-rate
in these experiments are zero, since both the producer and
consumer work at the same frequency without any delay in
the application level. Therefore, the buffer is never full and
no token is overwritten/lost.
C. Throughput and Loss-Rate in the application level
In the previous experiment, we evaluated throughput and
loss-rate in the channel level. In this section, we involve the
[prod comp bcet, prod comp wcet] and [cons comp bcet,
cons comp wcet] parameters, that represent the execution
time of the Computation components in the producer and
consumer sides respectively, to evaluate throughput and loss-
rate in the application level. It is very important to notice that
the throughput and loss-rate in the application level heavily
depends on the execution frequency of the producer and
consumer tasks. In the ideal case, the consumer task runs
at an equal or larger frequency than the producer. In this
case, the throughput is maximal, and the loss-rate is zero
since there is always an empty space for the producer and no
token is overwritten. However, the larger frequency rate for
the producer compared to the consumer (fast producer, slow
consumer), results to more overwritten tokens (loss-rate) and
decreases the final throughput. In the conducted experiments
of this section, we change the frequency of the consumer
Msg Size (Byte) 1 5 10 20 50
TP (Msg/Sec) 109351 98182 82576 62853 35638
task by slowing down the Computation(Process & Control)
component. In this experiments, the execution frequency
of the producer task is fixed by setting [prod comp bcet,
prod comp wcet]=[1000,2000] as defined in Tbl. II. As
shown in Fig. 8 and Fig. 9, by increasing the frequency
difference (consumer slower) , the throughput (number of
successfully received messages by the consumer) is decreased,
and the loss-rate is increased. respectively.
D. Predictability Evaluation
To validate if the WCET for each operation is respected, we
conducted an experiment of 20K iterations on each write and
read operation. As discussed in Sec. V, the BCET and WCET
of the Communication components of the producer and con-
sumer processes ([prod comm bcet,prod comm wcet] and
[cons comm bcet,cons comm wcet] respectively) are equal
to the execution time of the shortest and longest scenarios
([S1,S4] for the producer and [S1,S2] for the consumer pre-
sented in Fig. 2) of that process. Therefore, the WCET of
the write and read operations are equal to the execution time
of Scenario 4 (S4), and Scenario 2 (S2) in the producer and
consumer execution paths respectively. The WCET values are
calculated based on the reported values in Tbl. II. For example,
the WCET of the read operation (S2 in the consumer execution
path) is equal to tC0+tC1+tC2+tP5+tP6+tC3+tC4+tC5=490.
The calculated WCET and the measured actual execution
time of each operation are valid for three reasons; 1) The
underlying processor is a predictable RISC-V processor with a
proved deterministic instruction set; 2) Deterministic memory
access is provided by a predictable peripheral bus in Comp-
Parameter tP1 tP2 tP3P4 tP5 tP6 tP7P8 tC0 tC1 tC2 tC3C4C5 [prod bc, prod wc] [cons bc,cons wc]
Value 120 30 120 80 25 250 60 25 50 250 [1000,2000] [1500,2500]
prod bc=prod comp bcet; prod wc=prod comp wcet; cons bc=cons comp bcet; cons wc=cons comp wcet
SOC; 3) A cycle-accurate timer with fixed time to read is
provided by CompSOC; These three arguments make both
the computation and communication parts of an application
predictable on the CompSOC platform. As it can be seen in
Fig. 10 and Fig. 11 for the write and read operations respec-
tively, the calculated WCET of each operation is respected in
every iteration. The variation in each iteration is related to the
execution of that operation in various mentioned scenarios.
The maximum value of the write operation is the value for S4
of the producer, and the minimum value of the read operation
happens when the channel is empty and the consumer returns
NULL value.
In this paper, we presented a novel wait-free FIFO ap-
proach for core-to-core real-time Pub-Sub communication in
MPSoC platforms. We proposed a timed automaton UPPAAL
model for the approach to verify deadlock-free and starvation-
free properties using UPPAAL model checker, and stochastic
performance analysis using UPPAAL-SMC. Moreover, the
proposed approach was implemented and evaluated on the
CompSOC platform as a hard real-time underlying MPSoC.
The experimental results show a throughput of 109K token
per second on an instance of CompSOC what works at
40MHz. Moreover, The conducted experiments validated that
the time predictability of approach is guaranteed for both
read and write operations, and the measured throughput and
loss-rate confirms the usability of the proposed approach
for safety-critical real-time MPSoC applications. As a fu-
ture work, a multi-producer-multi-consumer version of the
approach can be considered. ECSEL JU grant agreement No
826610 (COMP4DRONES) supported this work.
[1] P. T. Eugster et al., “The many faces of publish/subscribe,
ACM computing surveys (CSUR), 2003.
[2] G. Pardo-Castellote, “OMG Data Distribution Service: Archi-
tectural overview,” in International Conference on Distributed
Computing Systems (ICDC), 2003.
[3] E. A. Lee, “Cyber physical systems: Design challenges,”
in IEEE international symposium on object and component-
oriented real-time distributed computing (ISORC), 2008.
[4] M. Quigley et al., “ROS: an open-source Robot Operating
System,” in ICRA Workshop on Open Source Software, 2009.
[5] J. D. Valois, “Implementing lock-free queues,” in International
Conference on Parallel and Distributed Computing, 1994.
[6] D. Orozco et al., “Toward high-throughput algorithms on
many-core architectures,ACM Transactions on Architecture
and Code Optimization (TACO), 2012.
[7] M. Herlihy, “Wait-free synchronization,” ACM Transactions
on Programming Languages and Systems (TOPLAS), 1991.
[8] A. Waterman et al., “The RISC-V instruction set manual, vol-
ume i: Base user-level ISA,EECS Department, UC Berkeley,,
[9] K. Goossens et al., “NoC-Based Multiprocessor Architecture
for Mixed-Time-Criticality Applications,” in Handbook of
Hardware-Software Codesign, Sprinter, 2017.
[10] G. Behrmann et al., “Uppaal 4.0,” 2006.
[11] A. David et al., “Uppaal SMC tutorial,International Journal
on Software Tools for Technology Transfer, 2015.
[12] I. G. Clark and A. C. Davies, “A comparison of some wait-
free communications mechanisms,” Asynchronous Interfaces:
Tools, Techniques and Implementations (AINT’2000), 2000.
[13] H. Simpson, “Four-slot fully asynchronous communication
mechanism,” IEE Proceedings E (Computers and Digital Tech-
niques), 1990.
[14] A. Barrington et al., “A scalable multi-producer multi-
consumer wait-free ring buffer,” in ACM Symposium on Ap-
plied Computing, 2015, pp. 1321–1328.
[15] S. Arnautov et al., “FFQ: A fast single-producer/multiple-
consumer concurrent FIFO queue,” in International Parallel
and Distributed Processing Symposium (IPDPS), 2017.
[16] J. Wang et al., “EQueue: Elastic Lock-Free FIFO Queue
for Core-to-Core Communication on Multi-Core Processors,”
IEEE Access, 2020.
[17] P. Tsigas and Y. Zhang, “A simple, fast and scalable non-
blocking concurrent FIFO queue for shared memory multi-
processor systems,” in ACM symposium on Parallel algorithms
and architectures, 2001.
[18] H. P´
erez and J. J. Guti´
errez, “Enabling data-centric distribution
technology for partitioned embedded systems,” IEEE Transac-
tions on Parallel and Distributed Systems, 2016.
[19] A. Kampmann et al., “A portable implementation of the
real-time publish-subscribe protocol for microcontrollers in
distributed robotic applications,” in IEEE Intelligent Trans-
portation Systems Conference (ITSC), 2019.
[20] J. C. Hamerski et al., “Publish-subscribe programming for a
NoC-based multiprocessor system-on-chip,” in International
Symposium on Circuits and Systems (ISCAS), 2017.
[21] J. Rushby, “Model checking Simpson’s four-slot fully asyn-
chronous communication mechanism,” Computer Science
Laboratory–SRI International, Tech. Rep., 2002.
[22] R. Alur and D. L. Dill, “A theory of timed automata,”
Theoretical computer science, 1994.
[23] E. A. Lee and T. M. Parks, “Dataflow process networks,
Proceedings of the IEEE, 1995.
[24] K. Gilles, “The semantics of a simple language for parallel
programming,” Information processing, 1974.
[25] H. Kopetz and G. Bauer, “The time-triggered architecture,
Proceedings of the IEEE, 2003.
[26] M. Schoeberl et al., “T-CREST: Time-predictable multi-core
architecture for embedded systems,” Journal of Systems Ar-
chitecture, 2015.
[27] A. Nieuwland et al., “C-HEAP: A heterogeneous multi-
processor architecture template and scalable and flexible pro-
tocol for the design of embedded signal processing systems,”
Design Automation for Embedded Systems, 2002.
[28] A. Nelson et al., “Dataflow formalisation of real-time stream-
ing applications on a composable and predictable multi-
processor SOC,” Journal of Systems Architecture, 2015.
[29] In,
[30] A. Nelson et al., “CoMik: A predictable and cycle-accurately
composable real-time microkernel,” in design, Automation and
Test in Europe (DATE), 2014.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
In recent years, the number of CPU cores in a multi-core processor keeps increasing. To leverage the increasing hardware resource, programmers need to develop parallelized software programs. One promising approach to parallelizing high-performance applications is pipeline parallelism, which divides a task into a serial of subtasks and then maps these subtasks to a group of CPU cores, making the communication scheme between the subtasks running on different cores a critical component for the parallelized programs. One widely-used implementation of the communication scheme is software-based, lock-free first-in-firstout queues that move data between different subtasks. The primary design goal of the prior lock-free queues was higher throughput, such that the technique of batching data was heavily used in their enqueue and dequeue operations. Unfortunately, a lock-free queue with batching heavily depends on the assumption that data arrive at a constant rate, and the queue is in an equilibrium state. Experimentally we found that the equilibrium state of a queue rarely happens in real, high-performance use cases (e.g., 10Gbps+ network applications) because data arriving rate fluctuates sharply. As a result, existing queues suffer from performance degradation when used in real applications on multi-core processors. In this paper, we present EQueue, a lock-free queue to handle this robustness issue in existing queues. EQueue is lock-free, efficient, and robust. EQueue can adaptively (1) shrink its queue size when data arriving rate is low to keep its memory footprint small to utilize CPU cache better, and (2) enlarge its queue size to avoid overflow when data arriving rate is in burstiness. Experimental results show that when used in high-performance applications, EQueue can always perform an enqueue/dequeue operation in less than 50 CPU cycles, which outperforms FastForward and MCRingBuffer, two state-of-the-art queues, by factors 3 and 2, respectively.
Conference Paper
Full-text available
Shared memory and message passing are traditional parallel programming models used on multiprocessor system-on-chip environments. However, these models are traditionally meant for static scenarios where all communicating entities are known a priori by the user and are available at the same time and space to complete the interaction. Therefore, the system design following these programming models became complex in distributed dynamic environments, which are subject to changes at runtime. Dynamic application domains, such as robotics and Internet of things, are using middlewares that implement a programming model based on the publish-subscribe communication paradigm. The goal of this work is to incorporate a publish-subscribe programming model to an MPSoC framework to decouple, in the time and space, the application development. The modified MPSoC framework is composed of a FreeRTOS kernel running on homogeneous processing elements distributed into a network-on-chip. The results present reduction of 2% to 30% in application execution time and low memory footprint when comparing the original MPI primitives with the publish-subscribe programming model.
In this chapter we define what a mixed-time-criticality system is and what its requirements are. After defining the concepts that such systems should follow, we described CompSOC, which is one example of a mixed-time-criticality platform. We describe, in detail, how multiple resources, such as processors, memories, and interconnect, are combined into a larger hardware platform, and especially how they are shared between applications using different arbitration schemes. Following this, the software architecture that transforms the single hardware platform into multiple virtual execution platforms, one per application, is described.
Conference Paper
A ring buffer or cyclical queue is a First In, First Out (FIFO) queue that stores elements on a fixed-length array. This allows for efficient O(1) operations, cache-aware optimizations, and low memory overhead. Because ring buffers are limited to only the array and two counters they are desirable for systems with limited memory. Many applications (e.g. cloud-based services) depend on ring buffers to pass work from one thread to another. The rise in many-core architecture has resulted in increased performance from shared data structures such as ring buffers. Existing research has forgone the use of locks and permitted greater scalability and core utilization for such designs. Such non-blocking designs are categorized by the level of progress they guarantee with wait-freedom being the most desirable categorization. Such a guarantee provides freedom from deadlock, livelock, and thread starvation. Lock-free and obstruction-free designs are not safe from all of these pitfalls [5].
Modern complex embedded systems are evolving into mixed-criticality systems in order to satisfy a wide set of non-functional requirements such as security, cost, weight, timing or power consumption. Partitioning is an enabling technology for this purpose, as it provides an environment with strong temporal and spatial isolation which allows the integration of applications with different requirements into a common hardware platform. At the same time, embedded systems are increasingly networked (e.g., cyber-physical systems) and they even might require global connectivity in open environments so enhanced communication mechanisms are needed to develop distributed partitioned systems. To this end, this work proposes an architecture to enable the use of data-centric real-time distribution middleware in partitioned embedded systems based on a hypervisor. This architecture relies on distribution middleware and a set of virtual devices to provide mixed-criticality partitions with a homogeneous and interoperable communication subsystem. The results obtained show that this approach provides low overhead and a reasonable trade-off between temporal isolation and performance.
Embedded systems often contain multiple applications, some of which have real-time requirements and whose performance must be guaranteed. To efficiently execute applications, modern embedded systems contain Globally Asynchronous Locally Synchronous (GALS) processors, network on chip, DRAM and SRAM memories, and system software, e.g. microkernel and communication libraries. In this paper we describe a dataflow formalisation to independently model real-time applications executing on the CompSOC platform, including new models of the entire software stack. We compare the guaranteed application throughput as computed by our tool flow to the throughput measured on an FPGA implementation of the platform, for both synthetic and real H.263 applications. The dataflow formalisation is composable (i.e. independent for each real-time application), conservative, models the impact of GALS on performance, and correctly predicts trends, such as application speed-up when mapping an application to more processors.
Real-time systems need time-predictable platforms to allow static analysis of the worst-case execution time (WCET). Standard multi-core processors are optimized for the average case and are hardly analyzable. Within the T-CREST project we propose novel solutions for time-predictable multi-core architectures that are optimized for the WCET instead of the average-case execution time. The resulting time-predictable resources (processors, interconnect, memory arbiter, and memory controller) and tools (compiler, WCET analysis) are designed to ease WCET analysis and to optimize WCET performance. Compared to other processors the WCET performance is outstanding.
Conference Paper
The functionality of embedded systems is ever increasing. This has lead to mixed time-criticality systems, where applications with a variety of real-time requirements co-exist on the same platform and share resources. Due to inter-application interference, verifying the real-time requirements of such systems is generally non trivial. In this paper, we present the CoMik microkernel that provides temporally predictable and composable processor virtualisation. CoMik's virtual processors are cycle-accurately composable, i.e. their timing cannot affect the timing of co-existing virtual processors by even a single cycle. Real-time applications executing on dedicated virtual processors can therefore be verified and executed in isolation, simplifying the verification of mixed time-criticality systems. We demonstrate these properties through experimentation on an FPGA prototyped hardware platform.