ArticlePDF Available

Mixed-criticality scheduling on cluster-based manycores with shared communication and storage resources

Authors:

Abstract and Figures

The embedded system industry is facing an increasing pressure for migrating from single-core to multi- and many-core platforms for size, performance and cost purposes. Real-time embedded system design follows this trend by integrating multiple applications with different safety criticality levels into a common platform. Scheduling mixed-criticality applications on today’s multi/many-core platforms and providing safe worst-case response time bounds for the real-time applications is challenging given the shared platform resources. For instance, sharing of memory buses introduces delays due to contention, which are non-negligible. Bounding these delays is not trivial, as one needs to model all possible interference scenarios. In this work, we introduce a combined analysis of computing, memory and communication scheduling in a mixed-criticality setting. In particular, we propose: (1) a mixed-criticality scheduling policy for cluster-based many-core systems with two shared resource classes, i.e., a shared multi-bank memory within each cluster, and a network-on-chip for inter-cluster communication and access to external memories; (2) a response time analysis for the proposed scheduling policy, which takes into account the interferences from the two classes of shared resources; and (3) a design exploration framework and algorithms for optimizing the resource utilizations under mixed-criticality timing constraints. The considered cluster-based architecture model describes closely state-of-the-art many-core platforms, such as the Kalray MPPA®-256. The applicability of the approach is demonstrated with a real-world avionics application. Also, the scheduling policy is compared against state-of-the-art scheduling policies based on extensive simulations with synthetic task sets.
Content may be subject to copyright.
Real-Time Systems manuscript No.
(will be inserted by the editor)
Mixed-Criticality Scheduling on Cluster-Based Manycores
with Shared Communication and Storage Resources
Georgia Giannopoulou ·Nikolay Stoimenov ·
Pengcheng Huang ·Lothar Thiele ·Benoˆ
ıt
Dupont de Dinechin
Received: date / Accepted: date
Abstract The embedded system industry is facing an increasing pressure for mi-
grating from single-core to multi- and many-core platforms for size, performance
and cost purposes. Real-time embedded system design follows this trend by inte-
grating multiple applications with different safety criticality levels into a common
platform. Scheduling mixed-criticality applications on today’s multi/many-core plat-
forms and providing safe worst-case response time bounds for the real-time appli-
cations is challenging given the shared platform resources. For instance, sharing of
memory buses introduces delays due to contention, which are non-negligible. Bound-
ing these delays is not trivial, as one needs to model all possible interference sce-
narios. In this work, we introduce a combined analysis of computing, memory and
communication scheduling in a mixed-criticality setting. In particular, we propose:
(1) a mixed-criticality scheduling policy for cluster-based many-core systems with
two shared resource classes, i.e., a shared multi-bank memory within each cluster,
and a network-on-chip for inter-cluster communication and access to external memo-
ries; (2) a response time analysis for the proposed scheduling policy, which takes into
account the interferences from the two classes of shared resources; and (3) a design
exploration framework and algorithms for optimizing the resource utilizations under
mixed-criticality timing constraints. The considered cluster-based architecture model
describes closely state-of-the-art many-core platforms, such as the Kalray MPPA R
-
256. The applicability of the approach is demonstrated with a real-world avionics
application. Also, the scheduling policy is compared against state-of-the-art schedul-
ing policies based on extensive simulations with synthetic task sets.
Keywords mixed criticality scheduling ·resource contention ·shared memory ·
NoC ·multi-core / many-core systems
Georgia Giannopoulou ·Nikolay Stoimenov ·Pengcheng Huang ·Lothar Thiele
Computer Engineering and Communication Networks Laboratory, ETH Zurich, 8092 Zurich, Switzerland
E-mail: {firstname.lastname}@tik.ee.ethz.ch
Benoˆ
ıt Dupont de Dinechin
Kalray S.A., F38330 Montbonnot Saint Martin, France
E-mail: benoit.dinechin@kalray.eu
2 Georgia Giannopoulou et al.
1 Introduction
Following the prevalence of multi-core systems in the electronics market, the field
of embedded systems has experienced an unprecedented trend towards integrating
multiple applications into a single platform. This applies even to real-time embedded
systems for safety-critical domains, such as avionics and automotive. Applications in
these domains are usually characterized by several criticality levels, known as Safety
Integrity Levels or Design Assurance Levels, which express the required protection
against failure.
For the integration of mixed-criticality applications into a common platform,
many scheduling approaches have been proposed in the research literature. How-
ever, most of them do not explicitly address the timing effects of resource sharing.
Moreover, existing industrial certification standards require complete timing isolation
among applications of different criticalities. For this, system designers rely mainly
on operating system and hardware-level partitioning mechanisms, e.g., based on the
ARINC-653 standard [6]. No existing standards, however, specify how isolation is
achieved when several cores share platform resources. Obviously, if several cores
access synchronously a memory bus, timing interference among applications of dif-
ferent criticalities cannot be avoided. Also, the resulting blocking delays cannot be
always bounded, since the certification authorities for applications of a particular
criticality level typically do not possess any information about applications of lower
criticality that are co-hosted on the platform. But even if this information is avail-
able, the problem of bounding the contention-induced delays is notoriously difficult
since micro-architectural features, such as the resource arbitration policy, the access
overheads, the memory sub-system organisation, have to be known and accounted for
under all interference scenarios.
Several commercial many-core platforms are cluster-based, e.g., the Kalray
MPPA R
-256 [15] and the STHorm/P2012 [31]. A cluster usually consists of sev-
eral cores with a local shared memory and a private address space, while clusters are
connected by specialized networks-on-chip (NoC). On such architectures, tasks can
be delayed when accessing local cluster shared resources not only because of con-
currently executing tasks mapped to the same cluster, but also because of data being
received/sent from/to other clusters. Typically, the data arriving from other clusters
are written to the local cluster memory with the highest priority, thus introducing
timing delays to all tasks that try to access the local memory at the same time. Cur-
rently, no mixed-criticality scheduling and analysis methods exist that address such
interference effects present in modern cluster-based architectures.
This article extends the state-of-the-art by proposing a combined computa-
tion, memory, and communication analysis and optimization framework for mixed-
criticality systems deployed on cluster-based platforms. In particular, the main con-
tributions of the article can be summarized as follows:
An architecture abstraction for cluster-based manycores with shared computing
(processing cores), storage (local cluster memory, external DDR memory), and
communication (NoC) resources is proposed.
A mixed-criticality multicore scheduling policy is proposed. It follows a flexible
time-triggered criticality-monotonic scheme, which enforces isolation between
applications of different criticality levels and allows interference on the shared
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 3
communication and storage infrastructure only among applications of the same
criticality that run in parallel in a cluster.
A worst-case response time (WCRT) analysis for the flexible time-triggered
scheduling policy is proposed. It accounts for the blocking delays not only due
to concurrently executed tasks within the same cluster, but also due to NoC data
transfers. We assume that NoC flows are statically routed and regulated at the
source node [30]. The NoC is modeled and analysed using network and real-time
calculus [27,41].
A heuristic approach for finding an optimized mapping of mixed-criticality task
sets to the cores of a cluster with the flexible time-triggered scheduling is pro-
posed. It accounts for the interferences on the shared cluster memory due to both
concurrently executed tasks and inter-cluster NoC communication.
A heuristic approach for finding an optimized partitioning of task data to the
memory banks of a cluster is proposed. It also accounts for the interferences on
the shared cluster memory banks due to both concurrently executed tasks and
inter-cluster NoC communication.
The two inter-dependent heuristic approaches are efficiently combined which al-
lows to find optimized mappings of tasks to cores and data to memory banks such
that the workload distribution is balanced among cores and interference effects
are minimized within a cluster.
We demonstrate the applicability and efficiency of the design optimization ap-
proaches as well as the effect of memory sharing and inter-cluster communica-
tion on mixed-criticality schedulability using a real-world avionics application.
We, also, compare the efficiency of the proposed scheduling policy against state-
of-the-art policies based on extensive simulations with synthetic task sets. The
proposed policy can outperform the compared policies for harmonic workloads.
The article is organised as follows. Sec. 2 provides an overview of recent pub-
lications concerning mixed-criticality scheduling and resource interference analysis.
Sec. 3 introduces the considered mixed-criticality task model and the many-core ar-
chitecture abstraction. Sec. 4 describes the flexible time-triggered scheduling policy
(FTTS) for multicore mixed-criticality systems. Sec. 5 presents a method for response
time analysis under FTTS, which considers explicitly the delays that each task suf-
fers due to contention on the shared memory path by concurrently executed tasks
within a cluster and by incoming NoC traffic. Sec. 6 suggests heuristic optimiza-
tion approaches for (i) mapping tasks to the cores of a cluster, (ii) mapping data to
memory banks, and (iii) an integrated approach for solving these two inter-dependent
problems. In Sec. 7 we apply the developed optimization methods to a real-world case
study and in Sec. 8 we compare the FTTS with other state-of-the-art mixed-criticality
scheduling policies. Sec. 9 concludes the article. Note that to facilitate reading, a
summary of the most important notations is included in the Appendix. Also, Fig.
11 presents an overview of the design optimization flow and how the results of the
individual sections are integrated into it.
4 Georgia Giannopoulou et al.
2 Related Work
Mixed-Criticality Scheduling. Scheduling of mixed-criticality applications is a re-
search field attracting increasing attention nowadays. After the original work of
Vestal [44], which introduced the currently dominating mixed-criticality task model,
several scheduling policies were proposed for both single-core and multi-core sys-
tems, e.g., [10,7,9,8,35]. For an up-to-date compilation and review of these policies,
we refer the interested reader to the study of Burns and Davis [11].
Among the policies for multicores, we highlight the ones that were designed to
target strict global timing isolation among criticality levels, which is a common re-
quirement of certification authorities. Anderson et al. proposed scheduling mixed-
criticality tasks on multicores, by adopting different strategies (partitioned EDF,
global EDF, cyclic executive) for different criticality levels and utilizing a bandwidth
reservation server for timing isolation [5,33]. Tamas-Selicean and Pop presented an
optimization method for task partitioning and time-triggered scheduling on multi-
cores [40], complying with the ARINC-653 standard [6], the objective being the min-
imization of the certification cost. These works along with most existing multicore
scheduling policies, however, do not address explicitly the interference when tasks
access synchronously shared platform resources and its effect on schedulability. We
believe that this can be dangerous since Pellizzoni et al. have shown empirically that
traffic on the memory bus in commercial-off-the-shelf systems can increase the re-
sponse time of a real-time task up to 44% [36]. Also, in Sec. 7, we show that platform
parameters, such as the memory or NoC latency or the internal memory organisation,
have a significant effect on the schedulability of mixed-criticality task sets.
The scheduling policy of Giannopoulou et al. [20] seems to be one of the first
to explicitly consider the effect of resource contention on the memory bus. Based
on a flexible time-triggered scheduling scheme and barrier synchronization among
cores (FTTS), this policy enables only a statically known set of tasks of the same
criticality level to be executed in parallel and interfere on the shared memory bus.
The required timing isolation between tasks of different criticality is achieved despite
resource sharing, without any need for hardware support. In a subsequent work, Gi-
annopoulou et al. [21] present a design optimization framework for the deployment of
mixed-criticality applications under the FTTS policy. The target platforms are mul-
ticores with private caches and shared access to a global multi-banked memory. On
such platforms, interference occurs on bank level. The optimization regards the static
mapping of tasks to computation cores and the mapping of the tasks’ private data and
communication buffers to memory banks, such that core utilizations are balanced and
minimized.
The work in this article is based upon FTTS. The presented results extend and
unify the response time analysis and the design optimization methods presented
in [20] and [21]. The methods are extended towards cluster-based architectures
where a task can experience interference not only from tasks executing concurrently
in the same cluster, but also from inter-cluster NoC communication of data being
read/written from/to the cluster. Therefore, real-time properties of the NoC flows
such as delay and burst characteristics are computed using network calculus, and in-
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 5
tegrated into the response time analysis and design optimization methods.
Mixed-Criticality Resource Sharing. Recent works have targeted at bounding/
minimizing the delay that high-criticality tasks suffer due to contention on shared
resources, while assuming partitioned task scheduling under traditional (single-
criticality) policies, e.g., fixed priority. These methods differ from FTTS in nature,
since they accept interference among tasks of different criticality levels as long as it
is bounded. FTTS, on the contrary, allows interference on the shared resources among
tasks of equal criticality without any form of budgeting. Through dynamic inter-core
synchronization, it enables timing isolation among different criticalities, hence the
delay imposed to high-criticality tasks by lower-criticality ones is invariably zero.
With regards to shared memory, Yun et al. [50] and Flodin et al. [18] proposed
different software-based memory throttling mechanisms (with predefined [50] or dy-
namically allocated [18] per-core budgets) to explicitly control the inter-core interfer-
ence. Hardware solutions have also been suggested. Paolieri et al. [34] and Goossens
et al. [22] suggested novel memory controller designs for mixed hard real-time and
soft real-time systems. These methods require special hardware support as opposed
to memory throttling and FTTS. With regards to the NoC infrastructure, Tobuschat
et al. [43] implement virtualization and monitoring mechanisms to provide indepen-
dence among flows of different criticality. Particularly, using back suction [17], they
target at maximizing the allocated bandwidth for lower criticality flows, while pro-
viding guaranteed service to the higher criticality flows. In our work, we assume
real-time guarantees for all NoC flows. However, our work can be also combined
with mixed-criticality NoC guarantees, as in the work of Tobuschat et al. [43].
Data Partitioning. In this work, we also address the problem of mapping task data
to the banks of a shared memory in order to optimize the memory utilization and
minimize the interference among tasks of the same criticality. In the same line, Kim
et al. [25] and Mi et al. [32] proposed heuristics for mapping data of different appli-
cation threads to DRAM banks to reduce the average thread execution times. Con-
trary to our work, these methods do not provide any real-time guarantees. Liu et al.
implemented in [29] a bank partitioning mechanism by modifying the memory man-
agement of the operating system to adopt a custom page-coloring algorithm for the
data allocation to banks, the objective being throughput maximization. Closer to our
objective lies the work of Yun et al. [49], where the authors implemented a DRAM
bank-aware memory allocator, using the page-based virtual memory system of the
operating system to allocate memory pages of different applications / cores to private
banks. The target is performance isolation in real-time systems, since partitioning
DRAM banks among cores eliminates bank conflicts. In our work, we assume that
bank sharing is inevitable, since tasks share access to buffers for communication pur-
poses. We try to minimize the bank conflicts, though, through a combination of task
scheduling and data mapping optimization. Note, that mechanisms like in [29,49]
can be used to implement the memory mapping decisions of our design optimization
method. Finally, the works of Reineke et al. [39] and Wu et al. [48] rely on DRAM
controllers to implement bank privatization schemes, where each core accesses its
own banks. Similar to the work of Yun et al. [49], such controllers can ensure perfor-
mance isolation, however they do not consider data sharing among tasks running on
6 Georgia Giannopoulou et al.
different cores. Additionally, they are hardware solutions, not applicable to commer-
cial off-the-shelf platforms.
3 System Model
This section defines the task and platform model as well as the requirements that a
mixed-criticality scheduling strategy must fulfill for certifiability. The task model and
requirements are based on the established mixed-criticality assumptions in literature,
but also on an avionics case study we addressed in the context of an industrial col-
laboration. A description of the avionics application is presented later in Sec. 7. The
platform model is inspired mainly by the Kalray MPPA R
-256 architecture [16]. An
overview of this architecture is provided in Sec. 3.2.
3.1 Mixed-Criticality Task Model
We consider mixed-criticality periodic task sets τ={τ1, . . . , τn}with criticality
levels among 1 (lowest) and L(highest). A task is characterized by a 5-tuple τi=
{Wi, χi,Ci, Ci,deg,Dep }, where:
Wi+is the task period.
χi∈ {1, . . . , L}is the criticality level.
– Ciis a size-Lvector of execution profiles, where Ci(`) =
(emin
i(`), emax
i(`), µmin
i(`), µmax
i(`)) represents a lower and an upper bound
on the execution time (ei) and number of memory accesses (µi) of τiat level of
assurance `χi. Note that execution time eidenotes the computation or CPU
time of τi,without considering the time spent on fetching data from the memory.
Such decoupling of the execution (computation) time and the memory accessing
time is feasible on fully timing compositional platforms [47].
Ci,deg is a special execution profile for the cases when τi(χi< L) runs in de-
graded mode. This profile corresponds to the minimum required functionality of
τiso that no catastrophic effect occurs in the system. If execution of τican be
aborted without catastrophic effects, Ci,deg = (0,0,0,0).
Dep(V,E)is a directed acyclic graph representing dependencies among tasks with
equal periods. Each node τi∈ V represents a task of τ. A weighted edge e∈ E
from τito τkimplies that within a period the job of τimust precede that of τk.
The weight w(e)denotes the minimum time that must elapse from the completion
of τi’s execution until the activation of τj. If w(e) = 0,τjcan be scheduled at
earliest right after τi. In the following, we refer to w(e)as the minimum distance
constraint.
For simplicity, we assume that the first job of all tasks is released at time 0 and that
the relative deadline Diof τiis equal to its period, i. e. Di=Wi. Furthermore,
the worst-case parameters of Ci(`)are monotonically increasing for increasing `
and the best-case parameters are monotonically decreasing, respectively. Namely, the
min./max. interval of execution times and memory accesses in Ci(`)is included in
the corresponding interval of Ci(`+ 1). Note that the best-case parameters are only
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 7
needed (i) to ensure that the minimum distance constraint of dependent tasks is not
violated and (ii) to obtain more accurate results from the response time analysis, as
discussed in Sec. 5.
The bounds for the execution times and accesses can be obtained by different
tools. For instance, at the lowest level of assurance (`= 1), the system designer may
extract them by profiling and measurement, as in [36]. At higher levels, certification
authorities may use static analysis tools with more and more conservative assump-
tions as the required confidence increases. Note that the execution profile Ci(`)for
each task τiis derived only for `χi. For ` > χi, there is no valid execution profile
since certification at level `ignores all tasks with a lower criticality level. At runtime,
if a task with criticality level greater than χirequires more resources than initially
expected, then τimay run in degraded mode with execution profile Ci,deg.
The motivation behind defining a degraded execution profile, Ci,deg, is that in
safety-critical applications, tasks typically cannot be aborted due to safety reasons.
Several prior mixed-criticality scheduling policies in the literature assume, nonethe-
less, that if a higher criticality task requires at runtime more resources than initially
assigned to it (according to some optimistic resource allocation), then all lower crit-
icality tasks can be aborted from that point on, permanently or temporarily (see
study [11]). In our work, we assume that each task has a minimal functionality that
must be executed under all circumstances so that no catastrophic effect occurs in the
system. The corresponding execution requirements (Ci,deg) must be fulfilled by any
mixed-criticality scheduling policy.
Finally, the dependency graph Dep(V,E)is a common consideration in schedul-
ing to model e.g., data dependencies among tasks. In our work, we introduce the
minimum distance constraint for any dependent task pair. This is necessary to model
scheduling constraints that stem from inter-cluster communication through a NoC in
cluster-based architectures. Such constraints are discussed later in Sec. 3.2 and 6.1.
3.2 Resource-Sharing Platform Model
We consider a cluster Pof mprocessing cores, P={p1, . . . , pm}. Here, the cores
are identical but there are no obstacles to extend our approach to heterogeneous
platforms. The mapping of the task set τto the cores in Pis defined by function
Mτ:τ→ P. Note that Mτis not given, but it will be determined by our approach
in Sec. 6.
Each core in Phas access to a private cache memory (we restrict our interest to
data caches, denoted by ‘D’ in Fig. 1), to a shared RAM memory and to an external
DDR memory, which is local to another cluster. The shared cluster memory is or-
ganized in several banks. Each bank must have a sequential address space (not inter-
leaved among banks) to limit potential inter-task interference. Under this assumption,
two concurrently executed tasks on different cores can perform parallel accesses to
the shared memory without delaying each other provided that they access different
banks. We assume that each memory bank has a dedicated request arbiter. Also, each
core has a private path (bus) to the shared memory. The private paths of the cores
are connected to all bank arbiters, as depicted abstractly (for Arb1) in Fig. 1. For the
8 Georgia Giannopoulou et al.
Tx Rx RM
router
router
router
router
Compute
Cluster
Compute
Cluster
I/O
Subsystem
External
DDR
Memory
Compute
Cluster RAM
Fig. 1 Shared memory architecture
bank arbitration, we consider the class of round-robin-based policies, potentially with
higher priority for some bank masters other than the cores in P(if such exist), e.g.,
the Rx interface in Fig. 1. We assume that only one core can access a bank at a time
and that once granted, a bank access is completed within a fixed time interval, Tacc
(same for read/write operations and for all banks). In the meantime, pending requests
to the same bank from other cores stall execution on their cores until they are served.
Additionally, the cores in Phave access to external memories of remote clusters,
which they access through a Network-on-Chip (NoC). Contention may occur in the
NoC on router level every time two flows (virtual channels) need to be routed to the
same outgoing link. Again, the assumed arbitration policy is round-robin at packet
level.
The discussed abstract architecture model fits very well commercial manycore
platforms, such as the Kalray MPPA R
-256 [16] and the STHorm/P2012 [31]. In the
remainder of the paper we motivate our models and methods based on the former
architecture; therefore we look at it into greater detail in the following. Note that
our response time analysis in Sec. 5 is valid for hardware platforms without tim-
ing anomalies, such as the fully timing compositional architecture which is defined
in [47]. On such architectures, a locally worse behavior (e.g., a cache miss instead of
a cache hit) cannot lead to a globally better effect for a task (e.g., reduced worst-case
response time). The absence of timing anomalies allows the decoupling of execution
and communication times during timing analysis. For a more detailed discussion on
the property of timing compositionality, the interested reader is referred to [47] and
for a rigorous definition of the term in resource-sharing systems to [23], respectively.
The MPPA R
-256 cores have been shown to be fully timing compositional [16].
Our response time analysis and optimization methods could be also extended to
cover other arbitration policies, e.g. the first-ready first-come-first-serve, which seems
to be a common policy on COTS multicores [24], on condition that the assumption
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 9
(a) MPPA R
-256 D-NoC topology
(b) MPPA R
-256 D-NoC router model
Fig. 2 MPPA R
-256 D-NoC topology and router model
of timing compositionality still holds.
Kalray MPPA R
Architecture. The Kalray MPPA R
-256 Andey processor integrates
256 processing cores and 32 resource management cores (denoted by ‘RM’ in Fig. 1
and 3), which are distributed across 16 compute clusters and four I/O sub-systems.
Each compute cluster includes 16 processing cores and one resource management
core, each with private instruction and data caches. The processing cores and the
management core implement the same VLIW architecture. However, the manage-
ment core is distinguished by its connection to the NoC interfaces. Each I/O sub-
system includes four resource management cores that share a single data cache, and
10 Georgia Giannopoulou et al.
no processing cores. Application code is executed on the compute clusters (process-
ing cores), whereas the I/O sub-systems are dedicated to the management of exter-
nal DDR memories, Ethernet I/O devices, etc. Each compute cluster and I/O sub-
system owns a private address space. The DDR memory is only visible in the ad-
dress space of the resource management cores of the I/O sub-system. Communica-
tion and synchronization among compute clusters and I/O sub-systems is supported
by two explicitly routed, parallel networks-on-chip, the data (D-NoC) and the con-
trol (C-NoC) network-on-chip. Here, we consider only the D-NoC. This is dedicated
to high-bandwidth data transfers and may operate with guaranteed services, thanks
to non-blocking routers with flow regulation at the source nodes [30], which is an
important feature for the deployment of safety-critical applications.
Fig. 2(a) presents an overview of the MPPA R
-256 architecture and the D-NoC
topology. Each square in the figure corresponds to a switching node and interface of
the D-NoC, for a total of 32 nodes: one per compute cluster (16 internal nodes) and
four per I/O sub-system (16 external nodes). The I/O sub-systems are depicted on
the four sides. The NoC topology is based on a 2D torus augmented with direct links
between I/O sub-systems. Fig. 2(b) depicts the internal structure of a D-NoC router.
The MPPA R
-256 routers multiplex flows originating from different directions (input
ports). Each originating direction (north N, south S, west W, east E, local node L)
has its own FIFO queue at the output interface, so flows interfere on a node only
if they share a link (output port) to the next node. This interface performs a round-
robin arbitration at the packet granularity between the FIFOs that contain data to
send on a link. NoC traffic through a router interferes with the memory buses of the
underlying I/O sub-system or compute cluster only if the NoC node is a destination
for the transfer.
Each of the compute clusters and the I/O sub-systems have a local on-chip mem-
ory, which is organized in 16 independent banks with a dedicated access arbiter for
each bank. In the compute clusters, this arbiter always grants access to data received
(Rx) from the D-NoC. That is, if an access request from D-NoC Rx arrives, it will be
immediately served after the current access to the memory bank is completed. The
remaining bandwidth is allocated to two groups of bus masters in round-robin fash-
ion. The first group comprises the resource management core, the debug support unit,
and a DMA engine dedicated to data transmission (Tx) over the D-NoC. The second
group is composed by the processing cores. Inside each group, the allocation policy
is also round-robin. In practice, one may abstract the arbitration policy of memory
banks as illustrated in Fig. 3, where the debug support unit is omitted for simplicity.
Within a compute cluster, the memory address mapping can be configured either
as interleaved or as sequential. In the sequential address configuration, each bank
spans 128 KB consecutive addresses. This is the assumed configuration in our work.
By using linker scripts, one can statically map private code and data of each process-
ing onto the different memory banks. This guarantees that no interference between
processing cores occurs on the memory buses or the arbiters of the memory banks.
Such memory mapping optimization to eliminate inter-core interference will be con-
sidered in Sec. 6.
When a processing core from a compute cluster requires access to an external
DDR memory, this is achieved through the I/O sub-systems, since compute clusters
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 11
Fig. 3 Memory bank request arbitration in an MPPA R
-256 cluster
do not have direct access to the external memories. Each I/O sub-system includes
a DDR memory controller, which arbitrates the access among different initiators
according to a round-robin policy. The initiators include, among others, the D-NoC
interfaces and the resource management cores of the I/O sub-systems.
Communication Protocol between a Compute Cluster and an I/O sub-system. In
the remainder of this paper, we consider the processing cores of one MPPA R
-256
compute cluster as the core set P, introduced earlier in the abstract model. The cores
in Pshare access to the SRAM memory banks of the cluster and can transfer data
from/to an external DDR memory over the D-NoC. For the data transfer, we consider
a specific protocol. This is illustrated in Fig. 4 and described below:
For each periodic task τiτ, which requires data from an external memory, there
is a preceding task τinit,i with the same period and criticality level, which initiates
the data transfer. Specifically, τinit,i communicates with a dedicated listener task
which is executed in an I/O sub-system with access to the target DDR. τinit,i sends
a notification to the listener, including all relevant information for the DDR access,
e.g., base address in DDR, data length, base address of allocated space in cluster
memory. To send the notification, τinit,i activates the cluster’s NoC Tx interface.
The transfer is asynchronous, i.e., τinit,i can complete its execution after sending
the notification, without expecting any acknowledgement of its reception. We de-
note the maximum time required for the transmission of the notification packet(s)
over the D-NoC as worst-case notification time (WCNT). This time is computed
by our response time analysis method is Sec. 5.2.
Upon reception of the notification, the remote listener (i) decodes the request, (ii)
allocates a D-NoC Tx DMA channel and a flow regulator, (iii) sets up the D-NoC
Tx DMA engine with the transfer parameters, and (iv) initiates the transfer. From
there the DMA engine will transmit the data through the D-NoC to the target com-
pute cluster. We denote the maximum required time interval for actions (i)-(iv)
as worst-case remote set-up time (WCRST). We assume that the WCRST can be
derived based on measurements on the target platform.
The transmitted packets follow a pre-defined route on the D-NoC before they are
written to the local cluster memory by the cluster’s NoC Rx interface. We denote
the maximum time required for the data transmission over the D-NoC as worst-
case data fetch time (WCDFT). We show how to compute this time for pre-routed
and regulated flows in Sec. 5.2.
12 Georgia Giannopoulou et al.
listener
notication transfer from CC to I/O
WCNT
WCRST
data transfer from I/O to CC
WCDFT
Activate Tx;
Return
(i) Decode request
(ii) Allocate Tx
DMA & regulator
(iii) Set up Tx DMA
(iv) Initiate transfer
from DDR
Read data
minimum distance constraint
Compute Cluster (CC) I/O Subsystem
D-NoC
External
DDR
Memory
Fig. 4 Communication protocol for reading data from external DDR memory
Note that for the considered protocol, τinit,i should be scheduled early enough so
that the required data are already in the cluster memory when τiis activated. This
implies that for every pair of (τinit,i, τi) in τ, where τinit,i initiates a data transfer
from a remote memory for τito use these data, an edge between τinit,i and τimust
exist in the dependency graph Dep. The edge is weighted by the minimum distance
constraint, which in this case, equals the sum of the worst-case notification time,
WCNT, the worst-case remote set-up time, WCRST, and the worst-case data fetch
time, WCDFT.
The communication protocol has been described for data transfer between a com-
pute cluster and an I/O sub-system. Note, however, that a similar procedure takes
place also for inter-cluster data exchange. Note also that for sending data to a re-
mote memory, a task can directly initiate the transfer through the cluster’s NoC Tx
interface or by setting up a DMA transfer over the D-NoC. The transfer is assumed
asynchronous and no handshaking with the remote cluster is required.
3.3 Mixed-Criticality Scheduling Requirements
Under the above system assumptions, we seek a correct scheduling strategy for the
mixed-criticality task set τon P, which will enable composable and incremental cer-
tifiability. We define below the properties of correctness, composable and incremental
certifiability, which are crucial for a successful and economical certification process.
Definition 1 A scheduling strategy is correct if it schedules a task set τsuch that
the provided schedule is admissible at all levels of assurance. A schedule of τis
admissible at level `if and only if:
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 13
the jobs of each task τi, satisfying χi`, receive enough resources between
their release time and deadline to meet their real-time requirements according to
execution profile Ci(`),
the jobs of each task τi, satisfying χi< `, receive enough resources between
their release time and deadline to meet their real-time requirements according to
execution profile Ci,deg.ut
The term resources, in this context, refers to both processing time and communication
time for accessing the shared memory and NoC.
Definition 2 A scheduling strategy enables composable certifiability if all tasks of
a criticality level `are temporally isolated from tasks with lower criticality, for all
`∈ {1, . . . , L}. Namely, the execution and access activities of a task τimust not
delay in any way any task with criticality level greater than χi.ut
The requirement for composability enables different certification authorities to cer-
tify task subsets of a particular criticality level `even without any knowledge of the
tasks with lower criticality in τ. This is important when several certification author-
ities need to certify not the whole system, but individual parts of it. Each authority
needs information on the scheduling of tasks with higher criticality level than the one
considered. Such information can be provided by the responsible authorities for the
higher-criticality task subsets.
Definition 3 A scheduling strategy enables incremental certifiability if the real-time
properties of the tasks at all criticality levels `∈ {1, . . . , L}are preserved when new
tasks are added to the system. ut
This property implies that if the schedule of a task set τis certified as admissible,
the certification process will not need to be repeated if new tasks are added later to
the system. This is reasonable, since repeating the certification process of already
certified tasks if the system is designed incrementally results in excessive costs.
Note that the above notion of correctness (Definition 1) is not new in mixed-
criticality scheduling theory. On the other hand, the requirements for composable and
incremental certifiability seem to be crucial in safety-critical domains, e.g., avion-
ics, for reducing the effort and cost of certification. Nonetheless, they are usually
not considered explicitly in mixed-criticality scheduling literature (see study [11]).
Exceptions are the works that are based on temporal partitioning as defined in the
ARINC-653 standard [6], such as the approach of Tamas-Selicean and Pop [40], and
the works that use servers for performance isolation among applications of different
criticality levels, such as the approach of Yun et al. [50].
4 Flexible Time-Triggered Scheduling
This section discusses briefly the Flexible Time-Triggered and Synchronisation-
based (FTTS) mixed-criticality scheduling policy for multicores. The reader is re-
ferred to [20] for a more detailed presentation. In this section, we assume that an
FTTS schedule for a particular task set and platform is given. For the given schedule,
14 Georgia Giannopoulou et al.
Fig. 5 Global FTTS schedule for 2 cycles (dark annotation: criticality level 2, light: criticality level 1)
we describe the runtime behavior of the scheduler and introduce useful notation. We
show how to determine an FTTS schedule (when it is not given) later, in Sec. 6.1.
The non-preemptive FTTS scheduling policy combines time and event-triggered
task activation. A global FTTS schedule repeats over a scheduling cycle equal to
the hyper-period Hof the tasks in τ. The scheduling cycle consists of fixed-size
frames (set F). Each frame is divided further into Lflexible-length sub-frames. The
beginning of frames and sub-frames is synchronized among all cores in a cluster. The
frame lengths can differ, but they are upper bounded by the minimum period in τ.
Each sub-frame (except the first of a frame) starts once all tasks of the previous sub-
frame complete execution across all cores. Synchronisation is achieved dynamically
via a barrier mechanism, for the sake of efficient resource utilization. Each sub-frame
contains only tasks of the same criticality level. Note that the sub-frames within a
frame are ordered in decreasing order of their criticality and that within a sub-frame,
tasks are scheduled sequentially on each core following a predefined order, namely
every task is triggered upon completion of the previous one.
An illustration of an FTTS schedule is given in Fig. 5 for seven tasks with hyper-
period H= 200 ms. Fig. 5 depicts two consecutive scheduling cycles. The solid lines
define the frames and the dashed lines the sub-frames, i.e., potential points, where
barrier synchronisation is performed. The FTTS schedule has a cycle of H= 200 ms
and is divided into four frames of equal lengths (50 ms), each with L= 2 sub-frames:
the first for criticality 2 (high) and the second for criticality 1 (low), respectively. A
scheduling cycle includes H/Wiinvocations of each task τi, i.e., the number of jobs
of τithat arrive within a hyper-period.
At runtime, the length of each sub-frame varies based on the different execution
times and accessing patterns that the concurrently executed tasks exhibit. For exam-
ple, in Fig. 5, the first sub-frame of f1finishes earlier when τ1, τ2run w.r.t. their
level-1, i.e., low-criticality profiles (cycle 1) than when at least one task runs w.r.t. its
level-2, i.e., high-criticality profile (cycle 2). Despite this dynamic behavior, the sub-
frame worst-case lengths can be computed offline for a given FTTS schedule by ap-
plying worst-case response time analysis under memory contention (Sec. 5).
Function barriers :F × {1, . . . , L} → RLdefines the worst-case length of all
sub-frames in a frame, for a particular level of assurance. We denote the worst-case
length of the k-th sub-frame of frame fat level `as barriers(f , `)k. Note that the
k-th sub-frame of fcontains tasks of criticality level (Lk+1). Also, `corresponds
to the highest level execution profile that the tasks of fexhibit at runtime. For ` > 1,
execution in certain sub-frames of f(with index k > 1) may be degraded.
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 15
Runtime behavior. Given an admissible FTTS schedule and the barriers function, the
scheduler manages task execution on each core within each frame f∈ F as follows
(init., `max = 1):
For the k-th sub-frame, the scheduler triggers sequentially the corresponding jobs.
Upon completion of the jobs’ execution, it signals the event and waits until the
remaining cores reach the barrier.
Let the elapsed time from the beginning of the k-th sub-frame until the barrier
synchronisation be t. Given `max:
`max = max (argmin
`∈{1,...,L}
{tbarriers(f, `)k}, `max),(1)
the scheduler will trigger jobs in all next sub-frames such that tasks with criticality
level lower than `max run in degraded mode.
The two previous steps are repeated for each sub-frame, until the next frame is
reached.
Note that the decision on whether a task will run in degraded mode affects only the
current frame.
Admissibility. Let an FTTS schedule be constructed such that all H/Wijobs of each
task τiτare scheduled on the same core within their release times and deadlines
and all dependency constraints hold. The FTTS schedule is `-admissible if and only
if it fulfills the following condition:
L
X
k=1
barriers(f, `)k≤ Lf,f∈ F ,(2)
where Lfdenotes the length of frame f. If the condition holds for all frames f∈ F,
all scheduled jobs in Scan meet their deadlines at level of assurance `. If it holds for
all levels `∈ {1,· · · , L}, it follows that the FTTS schedule is admissible according
to Definition 1 of Sec. 3.3. That is, it can be accepted by any certification authority at
any level of assurance and the scheduling strategy is correct.
If different certification authorities certify task subsets of different criticality
levels, then for composable certifiability (Definition 2), the authorities of lower-
criticality subsets need information on the resource allocation for the higher-
criticality subsets. For the FTTS scheduling strategy, this information is fully rep-
resented by function barriers. Therefore, global FTTS enables composable certifi-
ability. Similarly, it enables incremental certifiability (Definition 3), since new tasks
can be inserted into their respective criticality level sub-frame if there is sufficient
slack time in the frame.
It follows from the above that the computation of function barriers is neces-
sary to evaluate if an FTTS schedule is admissible, but also as an interface among
certification authorities and system designers. In the next section, we show how to
compute this function step-by-step, considering all possible task interferences for a
given FTTS schedule.
16 Georgia Giannopoulou et al.
5 Response Time Analysis for the Computation of barriers
This section describes how to compute function barriers for a given FTTS schedule.
For the computation of barriers, we need to bound the worst-case length of each
sub-frame of the FTTS schedule at every level of assurance `∈ {1, . . . , L}. For this,
first we perform worst-case response time (WCRT) analysis for every single task that
is scheduled within the sub-frame. Second, based on the results of the first step, we
derive the worst-case response time of the sequence of tasks which is executed on
every core within the sub-frame (per-core WCRT, CWCRT). Once the last value is
computed for all cores in P, the worst-case sub-frame length follows trivially as the
maximum among all per-core WCRTs.
The challenge in the above procedure lies in the computation of an upper bound
for the response time of a task in a specific sub-frame. Note that for the timing com-
positional architectures which we consider, such as the MPPA R
-256, it is safe to
bound the WCRT of a task by the sum of its worst-case execution (CPU) time and the
worst-case delay it experiences due to memory accessing and communication [47].
The worst-case execution time of each task is known at different levels of assurance
as part of its execution profile C. However, to bound the second WCRT component,
one needs to account for the interference on the shared cluster memory, i.e., interfer-
ence among tasks running in parallel and from the NoC interface, when some of them
try to access the same memory bank simultaneously. Therefore, to derive the WCRT
of a task in a specific sub-frame, we need to model the possible interference scenar-
ios based on the tasks that are concurrently executed and the NoC traffic patterns and
then, to analyse the worst-case delay that such interference can incur to the execution
of the task under analysis.
Sec. 5.1 shows how to bound the worst-case delay a task can experience on the
shared memory path. For this, we consider as given: the mapping of tasks to the
cores of P,Mτ, the mapping of the tasks’ data to memory banks, Mmem, the in-
coming NoC traffic patterns, and the memory access latency, Tacc. Sec. 5.2 describes
a method for NoC analysis, based on the network and real-time calculus [27,41],
which enables us to compute a bound on all NoC incoming traffic patterns at the
shared memory of a cluster. This bound is used in the analysis of Sec. 5.1.
5.1 Bounding Delay on Cluster Memory Path
Within an FTTS sub-frame, we identify two sources of delay that a task may experi-
ence on the memory path based on the platform model of Sec. 3.2:
I. Blocking on a memory bank arbiter due to contention from other access requesters,
specifically any other processing core or the NoC Tx DMA interface. Since con-
tention is resolved among these requesters in a round-robin fashion, the task under
analysis will have to wait for its turn in the round-robin cycle to be granted access
to the memory bank.
II. Blocking on a memory bank arbiter due to contention from the NoC Rx interface.
This requester has higher priority when accessing the memory, so in the worst case,
the task under analysis will have to wait for all accesses of the NoC Rx interface
to be served before it can gain access to the memory bank.
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 17
In the following, we model interference on the shared memory in the form of a
memory interference graph. Based on this graph, we compute the maximal delay that
each task can cause to another when they are executed in parallel and in the presence
of incoming traffic from the NoC Rx interface. This way, we are able to estimate the
WCRT of each task within the FTTS sub-frame and subsequently, the global worst-
case sub-frame length.
5.1.1 Memory Interference among Requesters with Equal Priorities
To model the inter-task interferences due to round-robin-arbitrated memory con-
tention, we introduce a graph representation, called the memory interference graph
I(V,E). We define V=Vτ∪ VBL ∪ VB, where Vτrepresents all tasks in τ(running
on processing cores and NoC Tx), VBL represents all memory blocks BL accessed
by τ, i.e., the tasks’ instructions, data and communication buffers, and VBrepresents
all banks Bof the shared memory. Each memory block (bank) node is annotated with
a corresponding size (capacity) in bytes. Iis composed by two sub-graphs: (i) the bi-
partite graph I1(VT∪ VBL,E1), where an edge e∈ E1from τi∈ VTto blj∈ VBL
with weight w(e)implies that task τiperforms at maximum w(e)accesses to mem-
ory block bljper execution, and (ii) the bipartite graph I2(VBL ∪ VB,E2), where an
edge e∈ E2from blj∈ VBL to bk∈ VBdenotes the allocation of memory block
bljin exactly one memory bank bk. Note that the weighted sum over all outgoing
edges of a task τiequals the memory access bound of its execution profile at its own
criticality level, i.e., µmax
i(χi). The weights can be, however, refined (reduced) if
WCRT analysis is performed for lower levels of assurance, ` < χi. In this case, the
weighted sum over the outgoing edged of τiequals the (more optimistic) µmax
i(`),
which enables tighter WCRT analysis for the specific level of assurance.
Definition 4 Tasks τiand τjare interfering if and only if k, l, r N+: (τi, blk)
E1,(τj, bll)∈ E1and (blk, br)∈ E2,(bll, br)∈ E2, i.e., they access blocks in the
same bank. ut
Fig. 6 presents a possible memory interference graph for a set of five tasks, ac-
cessing in total five memory blocks. The memory blocks can be allocated to two
banks. Ellipsoid, rectangular and diamond nodes denote tasks, memory blocks, and
banks, respectively. Note that for the depicted mapping of memory blocks to banks,
tasks τ1and τ2are interfering, whereas τ1and τ3or τ4or τ5are not. Interfering tasks
can delay each other when executed in parallel.
In the general problem setting, the mapping of memory blocks to banks, Mmem :
BL B(E2of I), is not known, but derived by our optimization approach (see
Sec. 6). Here, however, for the WCRT analysis we assume that it is fixed. Based on
it, we introduce the mutual delay matrix,D.Dis a two-dimensional matrix (n×n),
where Di,j specifies the maximum delay that task τican suffer when executed con-
currently with τj.Di,j is positive if τiand τj(i6=j) are (i) of the same criticality
level, i.e., potentially concurrently executing in an FTTS schedule, and (ii) interfer-
ing, i.e., accessing memory blocks in at least one common bank.
18 Georgia Giannopoulou et al.
bl1
bl2
bl3
bank1
bank2
10
(64)
(16)
bl4
bl5
(512)
(128)
τ1
τ2
τ3
τ4
(64 K)
(64 K)
(16K)
τ5
20
10
10
5
1024
Fig. 6 Memory Interference Graph Ifor a dual-
bank memory
τ1τ2τ3τ4τ5
τ10 10·Tacc 0 0 0
τ210·Tacc 0 10·Tacc 0 0
τ30 10·Tacc 0 0 0
τ40 0 0 0 5·Tacc
τ50 0 0 5·Tacc 0
Table 1 Mutual delay matrix Dfor round-robin arbitra-
tion policy for Fig. 6
For the computation of D, we need to consider the bank arbitration policy, which
in the case of MPPA R
-256 and for the memory requesters that we consider is round-
robin. For the round-robin policy, each access request from a task τican be delayed
by at most one access from any other concurrently executed task that can read/write
from/to the same memory bank which τiis targeting. That is because we assume
that each core has at most one pending request at a time (Sec. 3). In other words, τi
can be maximally delayed by a concurrently executed task τjfor the duration of τjs
accesses to a shared memory bank, provided that τj’s accesses are not more that τis
accesses to this bank. If τiand τjshare access to more than one memory bank, the
sum of potential delays across the banks has to be considered. This yields:
Di,j =X
b,bl:(τi,bl)∈E1
(bl,b)∈E2X
bl0:(τj,bl0)∈E1
(bl0,b)∈E2
min{w((τi, bl)), w((τj, bl0))} · Tacc .(3)
As an example, the mutual delay matrix Dis given in Table 1 for the memory in-
terference graph of Fig. 6. We assume that tasks τ1,τ2,τ3are of criticality level 2,
whereas τ4and τ5of criticality level 1. Drepresents the worst-case mutual delays
when τ1,τ2,τ3are executed in parallel in the same FTTS sub-frame. Tasks τ4and τ5
are assumed to run in parallel, too, in a different sub-frame compared to τ1-τ3.
We use matrix Dto compute the worst-case length of the k-th sub-frame of an
FTTS frame fat level of assurance `. According to the notation introduced in Sec. 4,
the computed length corresponds to barriers(f, `)k.
First, we compute the WCRT of every task τiexecuted in the k-th sub-frame of
frame fas
W CR Ti(f, `) = emax
i(`) + µmax
i(`)·Tacc +di(f, `),(4)
namely, as the sum of its worst-case execution time, the total access time of its mem-
ory accesses under no contention, and the worst-case delay it encounters due to con-
tention. This last term, di(f, `), is defined as:
di(f, `) = min
X
τjparallel(τi,f )
Di,j , µmax
i(`)·(m1) ·Tacc
,(5)
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 19
where function parallel :τ× F Sτdefines a set of tasks Sττthat are
executed in parallel to a task τi(on different cores) in frame fand mis the number
of interfering requesters with equal priorities. Note that µmax
i(`)·(m1) ·Tacc
is a safe upper bound on the delay that a task can suffer due to contention under
round-robin arbitration. In Eq. 5, we take the minimum of the two terms to achieve
a more accurate estimation. This is useful in cases e.g., where (some of) the parallel
executed tasks with τiare scheduled sequentially on a single core. It is then possible
that not all of them can delay τion the memory path. In such cases, the second bound
may be tighter than the one based on matrix D.
Example. For a demonstration of the use of the above equations, let us consider
the FTTS schedule of Fig. 5. In the 1st sub-frame of frame f1, the tasks τ1and
τ2with criticality level χ1=χ2= 2 are executed in parallel on m= 2 process-
ing cores. Therefore, according to the definition of function parallel, it holds that
parallel(τ1, f1) = {τ2}, and parallel(τ2, f1) = {τ1}. The accessing behavior of
tasks τ1and τ2is described by the memory interference graph of Fig. 6. The de-
picted weights on the edges of the memory interference graph, i.e., the number of
accesses that each task performs to the respective memory blocks, are derived at
level of assurance `= 2. Based on the graph of Fig. 6, tasks τ1and τ2can interfere
only on bank1because τ1accesses block bl1and τ2accesses block bl2, with both
blocks being mapped to the same memory bank bank1. Given this, Eq. 3 yields that
D1,2=D2,1= 10 ·Tacc. In other words, task τ1can delay τ2at most 10 times when
accessing the shared memory, by issuing interfering access requests to the same bank.
The same also holds for the maximal delay that τ2can cause to τ1. These results can
be seen in the mutual delay matrix D, which is already given in Table 1. At a next
step, by applying Eq. 5, we compute the worst-case delay that task τ1can experience
due to memory contention in the first frame of the FTTS schedule of Fig. 5, at level
of assurance `= 2:
d1(f1,2) = min {D1,2, µmax
1(2) ·(2 1) ·Tacc}= min {10 ·Tacc,10 ·Tacc }.
Similarly for task τ2,
d2(f1,2) = min {D2,1, µmax
2(2) ·(2 1) ·Tacc}= min {10 ·Tacc,30 ·Tacc }.
Hence, d1(f1,2) = d2(f1,2) = 10 ·Tacc. Namely, the WCRT of both tasks is
augmented by 10·Tacc as a result of the inter-core interference on the shared memory.
Once the WCRT of all tasks in the k-th sub-frame of frame fare computed at all
levels of assurance `∈ {1, . . . , L}, we derive the WCRT of the task sequence on
each core p,CW C RTp,k(f, `), by summing up the WCRTs of the tasks that are
mapped on pin the particular sub-frame. For an illustration of the notation used,
please refer to Fig. 7. It follows trivially that:
barriers(f, `)k= max
1pmCW C RTp,k(f, `)(6)
20 Georgia Giannopoulou et al.
0 20050
...
...
250
Fig. 7 Computation of barriers(f1, `)kfor `={1,2}and k={1,2}for the FTTS schedule of Fig. 5
5.1.2 Memory Interference from Requesters with Higher Priority
When introducing the memory interference graph in Sec. 5.1.1, we implicitly as-
sumed that all memory blocks (tasks’ instructions, private data and communication
buffers) fit into the memory banks of the shared cluster memory so that the mapping
of blocks to banks can be decided offline and no remote access to other compute clus-
ters or the I/O sub-systems is required. However, in realistic applications, such as the
flight management system which is used for evaluation in Sec. 7, there may be tasks
which need access to databases or generally, complex data structures that do not fit
in the shared memory of a compute cluster. We assume that these data structures are
stored in the external DDR memories and parts of them (e.g., some database entries),
which can fit into the cluster memory together with the data of the remaining tasks,
are fetched whenever required.
One possible implementation of a remote data fetch protocol has been described
for the MPPA R
-256 platform and similar manycore architectures in Sec. 3.2. Here,
we focus on its last step, namely the actual transfer of data packets from the I/O
subsystem to the cluster over the NoC. During the data transfer, the NoC Rx interface
tries to write to the shared cluster memory every time a new packet arrives at the
cluster. Therefore, any task in the cluster attempting to access the memory bank,
where the remote data are stored, will experience blocking due to the higher priority
of the NoC Rx interface (see Fig. 3). In the worst case, it will have to wait for the
whole remote transfer to complete before it is granted access to the memory bank.
To model the interference from the NoC Rx interface, first, we extend the mem-
ory interference graph, as shown in Fig. 8. The task with the bold outline represents
the DMA transfer from the I/O sub-system (resp. another compute cluster) to the
compute cluster. There can be arbitrarily many tasks representing DMA transfers.
The weight δof the newly added edge from Rx to the target memory block can be
derived as the minimum between (i) the total number of fetched packets and (ii) the
maximum number of packets that can be fetched over the NoC in the time interval
of one frame. To compute δfor a particular data flow, we need information about the
flow regulation, the flow route, and the NoC configuration. We show how to use this
information to derive δin Sec. 5.2 (Eq. 13).
Second, we update the mutual delay matrix Dby setting entries Dj,j to δ·Tacc for
all tasks τjthat are interfering with Rx, as shown in Table 2. This applies to all tasks
independently of their criticality level or whether they run in parallel to Rx, and it
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 21
Fig. 8 Memory Interference Graph Ifor a dual-
bank memory with higher-priority interference
from the NoC Rx requester
τ1τ2τ3τ4τ5
τ10 10·Tacc 000
τ210·Tacc δ·Tacc 10·Tacc 0 0
τ30 10·Tacc δ·Tacc 0 0
τ40 0 0 δ·Tacc 5·Tacc
τ50 0 0 5·Tacc δ·Tacc
Table 2 Mutual delay matrix Dfor round-robin arbitra-
tion with higher priority for Rx for Fig. 8
expresses that an interfering task τjcan be delayed by δhigher-priority requests any
time it executes. The update helps during memory mapping optimization to distribute
the memory block(s) to which Rx writes in disjoint banks compared to all remaining
blocks.
Third, we update the computed in Sec. 5.1.1 per-core WCRTs, C W CRTp,k(f, `)
for all p∈ P. Recall that CW C RTp,k(f , `)in a given FTTS schedule denotes the
sum of WCRTs of the tasks that are mapped on core pin the k-th sub-frame of frame
f, given the task execution profiles at level of assurance `. In the following, we con-
sider the communication protocol between a compute cluster and an I/O subsystem,
as described in Sec. 3.2. For every task τiτ, which uses remote data from the DDR,
and its preceding task τinit,i, which initiates the transfer from the I/O sub-system,
let fprec and fsucc be the FTTS frames, where τinit,i and τiare scheduled, respec-
tively. Since tasks τiand τinit,i have the same criticality level, χi, it follows that they
are scheduled in the k-th sub-frame of frames fprec and fsucc, respectively, where
k=Lχi+ 1. For instance, in a dual-criticality system (L= 2), if τiand τinit,i
have criticality level χi= 2, they will be scheduled in the 1st sub-frame of their cor-
responding FTTS frames. For the update of the per-core WCRTs CW C RTp,k(f , `),
we distinguish two cases, depending on whether frames fprec and fsucc are equal or
not. Particularly, for every core p∈ P:
If fprec =fsucc and in the k-th sub-frame of fprec there are tasks on pscheduled
between τinit,i and τi, which are interfering with Rx, then CW C RTp,k(fprec, `)
is increased by δ·Tacc, for all `∈ {1, . . . , L}. In other words, if there is at least one
task executing between τinit,i and τi, which can access a common memory bank as
Rx, this (these) task(s) can be delayed by the higher priority NoC Rx data transfer
by up to δ·Tacc. This is accounted for by increasing the C W CRTp,k (fprec , `)
accordingly.
If fprec 6=fsucc, then for each sub-frame k0from the k-th sub-frame of fpr ec up to
and including the k-th sub-frame of fsucc: if there are tasks on pother than τinit,i ,
τi, which are interfering with Rx, then CW C RTp,k0(f0, `)for the including FTTS
frame f0:fprec f0fsucc is increased by δ·Tacc, for all `∈ {1, . . . , L}.
22 Georgia Giannopoulou et al.
The intuition is similar as in the previous case. If any sub-frame between the one,
where τinit,i is scheduled and the one, where τiis scheduled, includes tasks that
are accessing a common bank as Rx, then this (these) task(s) can be delayed by the
higher priority NoC Rx data transfer by up to δ·Tacc. Note, however, that in every
frame f0:fprec f0fsucc, such an increase to C W C RTp,k0(f0, `)happens
only once, for the first sub-frame k0that includes interfering tasks with Rx. This is
done for tighter analysis1.
Example. As an illustration of the per-core WCRT updates of the third step above,
consider again the FTTS schedule of Fig. 5. Suppose that τ4initiates a remote
data transfer for τ5, which reads the fetched data. Both tasks have criticality level
χ4= 1. For the first instance of the dependent tasks, fprec =f1(frame where τ4
is scheduled), fsucc =f2(frame where τ5is scheduled), and k= 2 (corresponding
sub-frame within the above frames). Since fprec 6=fsucc, for the per-core WCRT
updates we have to consider all FTTS sub-frames starting from the 2nd sub-frame
of f1up to and including the 2nd subframe of f2. In this case, there are 3 such
sub-frames to be considered. We assume that the memory accessing behavior of
tasks τ1to τ5and the DMA transfer Rx are modelled by the memory interference
graph of Fig. 8. Tasks τ6and τ7of the FTTS schedule are not modelled in this graph
because they perform no accesses to the shared memory. Given these assumptions,
it follows that CW C RTp1,2(f1, `)and CW C RTp2,2(f1, `)for the 2nd sub-frame
of f1and `={1,2}remain unchanged. This is because on core p1no other task
than τ4is scheduled, so the DMA transfer cannot cause any delay in this sub-frame
on this core. Also, on core p2, although there is a scheduled task, τ6, this is not
interfering with Rx. So, again, the DMA transfer cannot delay its execution. In
contrast, CW C RTp1,1(f2, `)and C W CRTp2,1(f2, `)for the 1st sub-frame of
f2and `={1,2}are increased by δ·Tacc. This is because the scheduled tasks
τ3(on core p1) and τ2(on core p2) are interfering with Rx (accessing the same
memory bank bank2). Finally, CW C RTp1,2(f2, `),CW C RTp2,2(f2, `)for the 2nd
sub-frame of f2remain unchanged, since no task other than τ5is scheduled on p1
and the unique task on p2,τ6, is not interfering with Rx.
Finally, after the updates discussed in the above three steps are completed,
the computation of function barriers follows from Eq. 6 for the updated
CW C RTp,k(f, `)values.
5.1.3 Tighter Response Time Analysis
In Sec. 5.1.1 and 5.1.2, we derived closed-form expressions for the worst-case sub-
frame lengths of the FTTS schedule (function barriers). Although several sources
of pessimism were avoided, there may be still cases where the computed bounds are
not tight. E.g., if the memory accesses of the tasks follow certain patterns (dedicated
access phases, non-overlapping in time) such that even if two tasks are executed in
1Given the definition of δ,Rx cannot perform more than δhigh-priority memory accesses within one
single frame. Therefore, it is too pessimistic to increase CW C RTp,k0(f0, `)for several sub-frames k0of
the same frame f0. This would lead to a potential increase of barriers(f0, `)Lby multiples of δ·Tacc .
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 23
parallel, they cannot interfere on the memory path, then the given bounds do not re-
flect this knowledge. If more accurate response time analysis is required, the method
of [19], which uses a model of the system with timed automata [4] and model check-
ing to derive the tasks’ WCRT, is suggested. The system model in [19] specifies
shared-memory multicores, however it can be easily extended to model also the in-
coming traffic from the NoC, as computed in Sec. 5.2 (Eq. 11).
Since a model checker explores exhaustively all feasible resource interference
scenarios, the above method is highly complex. Therefore, during design optimiza-
tion (Sec. 6), where we need to compute function barriers for often thousands of
potential FTTS schedules, it is preferred to use the WCRT bounds as derived earlier.
We can then apply the method of [19] to the optimized FTTS solution to refine the
computation of barriers. Also, if no admissible FTTS schedule can be found during
optimization, the same method can be applied to the best encountered solutions, as
the more accurate computation of barriers may reveal admissible schedules.
5.2 Bounding Delay for Data Transfers over NoC
This section shows how to characterize the incoming NoC traffic at a shared cluster
memory. Based on the incoming traffic model, we compute upper bounds on (i) the
delay for transferring a given amount of packets over a NoC and (ii) the number of
accesses that the NoC Rx interface performs to the shared cluster memory in a given
time interval. The first result (Eq. 12) is used for defining the minimum distance
constraint between a task initiating a remote data transfer, τinit,i, and a task using the
fetched data, τi, in the remote fetch protocol of Sec. 3.2. The second result (Eq. 13)
is used as a parameter of the memory interference graph, representing the maximal
interference from the NoC Rx interface, as discussed in Sec. 5.1.2.
We consider an explicitly routed NoC with wormhole switching and assume that
each traffic stream uses a dedicated predetermined virtual channel throughout the
NoC which is in line with previous approaches, see [51]. Each NoC node is a router
and also a flow source / sink. Routers contain only FIFO queues, with one set of
queues per outgoing link, with round-robin arbitration. Routers are work-conserving,
i.e., no idling if data is ready to be transmitted.
All flows are (σ, ρ)regulated at the sources [30], and the parameters of the regu-
lators are selected such that performance guarantees are provided for all flows, and no
FIFO queue in a router can overflow, therefore, stalls due to backpressure flow control
are not present. However, we assume that stalls due to switch contention are present,
i.e., when packets from different input ports or virtual channels compete for the same
output port. Such assumptions simplify the presented NoC analysis, however, if nec-
essary, backpressure stalls can be easily integrated by using existing results, see [13,
42,51].
Since we consider hard real-time guarantees on the NoC, we have to show that
each network packet in a certain flow needs to be delivered to its destination within
a fixed deadline. For the analysis, we use the theory of Network and Real-time cal-
culus [14,27,41]. It is a theory of deterministic queuing systems for communication
networks and scheduling of real-time systems. Network calculus has been applied in
24 Georgia Giannopoulou et al.
αi… …
j
(σ, ρ)
regulator router
NOC
delNOC
delREG WCDFT
αi
αiα(σ,ρ)
βiNOC
flow
Fig. 9 Modeling the packet flow ithrough a (σ, ρ)-regulator and a sequence of routers j.
recent works and its effectiveness has been validated for the analysis of NoC-based
systems, see [37,38,51].
The theory analyzes the flow of packet streams through a network of processing
and communication resources in order to compute worst-case backlogs, end-to-end
delays, and throughput. The overall modelling approach is shown in Fig. 9.
A General Packet Stream Model. Packet streams are abstracted by the function α(),
i. e. an upper arrival curve which provides an upper bound on the number of packets
in any time interval of length where α() = 0 for all 0. Arrival curves
substantially generalize conventional stream models such as sporadic, periodic or
periodic with jitter. Note, a packet here is defined as a fixed-length basic unit of
network traffic. Variable-length packets can be viewed as a sequence of fixed-length
packets.
A General Resource Model. The availability of processing or communication re-
sources is described by the function β(), a lower service curve which provides
a lower bound on the available service in any time interval of length where
(β() = 0 for all 0). The service is expressed in an appropriate workload unit
compatible to that of the arrival curve, e.g. , packets.
Packet stalls at routers can happen when packets from different input ports or
virtual channels compete for the same output port. We assume a round-robin arbiter
for every router jwhere each flow iis given a fixed slot of size si, e. g. , proportional
to the maximum packet size for this flow. The packet size is defined as the number of
(fixed-length) packets of a flow. The accumulated sizes of all flows going through a
router jcan be expressed as sj=Piflows through jsi. If the lower service curve that the
router can provide is denoted as βj, then the lower service curve under round-robin
arbitration for a flow with slot size siis [51]:
βj
i() = si
sjβj((sjsi)),(7)
which expresses the fact that in the worst-case a packet may always have to wait for
sjsitime units before its slot becomes available.
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 25
A Resource Model for a Network. When a packet flow traverses a system of multiple
interconnected components, one needs to consider the service curve provided by the
system as a whole, i.e., the system service curve is a concatenation of the individual
service curves [27,42]. For example, the concatenation of two routers 1and 2with
lower service curves β1
iand β2
ifor a flow ican be obtained as:
β1,2
i() = β1
iβ2
i(),
where is the min-plus algebra convolution operator that is defined as:
(fg)() = inf
0λ{f(λ) + g(λ)}.
As a result, the cumulative service βNOC
ifor a flow iis the convolution of all individ-
ual router service curves on the path of the flow through the network:
βNOC
i()=( O
iflows through j
βj
i)().(8)
Flow Regulator. A flow regulator with a (σ, ρ)shaping curve delays packets of an
input flow such that the output flow has the upper arrival curve α(σ,ρ)()=(ρ·+σ)
for all ∆ > 0independent of the timing characteristics of the input flow, and it outputs
packets as soon as possible without violating the upper bound α(σ,ρ).
Delay Bounds. A packet stream constrained by an upper arrival curve αiis first regu-
lated by a (σ, ρ)regulator and then traverses a network that offers a cumulative lower
service curve βNOC
iof the routers on the packet path. As is well known from the net-
work and real-time calculus, the maximum packet delay is related to the maximal
horizontal distance between functions (see Fig. 10) which is defined as:
del(f, g) = sup
λ0
{inf{τ0 : f(λ)g(λ+τ)}} .
Now, the worst-case delay at the regulator delREG experienced by a packet from
a flow iconstrained by an arrival curve αiand regulated by a (σ, ρ)regulator can be
computed as follows [45]:
delREG = del(αi, α(σ,ρ)).(9)
The output of the regulator is constrained by (αiα(σ,ρ))()and therefore, the
worst-case packet delay delNOC for flow iwithin the NOC that has a cumulative
lower service βNOC
ican be determined as:
delNOC = del(αiα(σ,ρ), βNOC
i).(10)
An example of worst-case delay computation is shown in Fig. 10. We consider a
single router which serves a single flow constrained by an upper arrival curve α(σ,ρ).
The NOC consists of a single router that provides to the flow a lower rate-latency
service curve βr,T () = r(T)for ∆>T, and β(r,T )()=0otherwise. The
arrival curve implies that the source can send at most σpackets at once, but not more
26 Georgia Giannopoulou et al.
Δ
f, g
f
g
del(f,g)
Δ
α, β
ρ Δ β(r, T)
delNOC
σ
T
r Δ
α(σ,ρ)
Fig. 10 Delay bound defined as the maximum horizontal distance illustrated for an arrival curve of a (σ, ρ)
regulated flow and a single router providing a rate-latency service curve βl
r,T
than ρpackets/cycle in the long run, while the service curve implies a pipeline delay
of Tfor a packet to traverse the router, and an average service rate of rpackets/cycle.
As shown in Fig. 10, the worst-case delay bound corresponds to the maximum hor-
izontal distance between the upper output arrival curve of the flow regulator and the
lower service curve of the NOC.
Output Flow Bounds. When a packet stream constrained by arrival curve αiis reg-
ulated by a (σ, ρ)regulator and traverses a network that offers a cumulative lower
service curve βNOC
i, the processed output flow is bounded by α0
icomputed as fol-
lows [46]:
α0
i() = ((αiα(σ,ρ))βNOC
i)(),(11)
where is the min-plus algebra deconvolution operator that is defined as:
(fg)() = sup
λ0
{f(+λ)g(λ)}.
Data Transfer Delay. Finally, we compute an upper bound on the total delay for
transferring a buffer of data using multiple packets, e.g. , the maximum delay for
transferring 4KB of data from external memory over a NoC. The availability of data
can be modeled as an upper arrival curve which has the form of a step function:
αi() = Bfor ∆ > 0, where Bis the total amount of data measured as the number
of packets. Then an upper bound on the total delay can be computed as a sum of the
delays computed with equations (9) and (10), where the first one gives a bound on
the delay experienced by all of the data at the regulator, i.e., the bound on the delay
for the regulator to transmit all of the data, while the second one gives a bound on the
delay for transferring the last packet of the data through the NoC .
In other words a bound on the total delay for transferring a buffer of Bpackets
regulated by a (σ, ρ)regulator over a network NOC providing a cumulative lower
service curve of βNOC
ican be computed as:
WCDFT =delREG +delNOC,(12)
where αi() = Bfor ∆ > 0, delREG and delNOC are defined in (9) and (10), re-
spectively, and WCDFT denotes the worst-case data fetch time, as originally defined
in the remote fetch protocol of Sec. 3.2.
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 27
Note that the above model is valid under the assumption that during the data trans-
fer, the regulator does not stall because there are no packets available for transmis-
sion, e. g. , in the case of external memory data fetch, the memory controller should be
able to put packets fast enough in the buffer of the regulator. Moreover, the regulator
should not experience stalls due to backpressure.
Bound on NoC Rx Memory Accesses within an FTTS Frame. For representing in-
terference from the NoC Rx interface within a compute cluster, in Sec. 5.1.2 we
introduced the value δ, which bounds the number of memory accesses that the NoC
Rx can perform within a time frame f. For the time interval of the frame f, denoted
as Lf,δcan be not greater than α0(Lf)(Eq. 11), but also not greater than the total
number of packets in the transmitted buffer, B. Therefore,
δ= min {α0(Lf), B}.(13)
The D-NoC in the Kalray Platform. The models and methods described above are
compatible with the manycore MPPA R
-256 platform (Sec. 3.2). In the following,
we give a short summary of flow regulation for the MPPA R
NoC, which is based
on (σ, ρ)regulators. Precisely, in the MPPA R
-256 processor, each connection is reg-
ulated at the source node by a packet shaper and a traffic limiter in tandem. This
regulator can be configured via two parameters, both defined in units of 32-bit flits:
(i) a window length (Tw), which is set globally for the NoC node, and (ii) the band-
width quota (Nmax), which is set separately for each regulator. At each cycle, the
regulator compares the length of a packet scheduled for injection plus the number of
flits sent within the previous Twcycles to Nmax. If not greater, the packet is injected
at the rate of one flit per cycle.
The (σ, ρ)parameters can be set at the source node through Twand Nmax (all
measured in units of 32-bit flits, including header flits). We link these parameters
with the (σ, ρ)model by observing that ρ=Nmax/Tw. This corresponds to the fact
that no regulator may let more than Nmax flits pass over any duration Tw. On the
other hand, the regulator is allowed to emit continuously until having sent Nmax flits
within exactly Nmax cycles. This defines a point on the ρ+σlinear time function and
by regression, the value of the function at time t= 0 (corresponding to σ) is found
to be σ=Nmax(1 Nmax /Tw). Note that σ0. For a more detailed presentation
of the MPPA R
NoC flow regulation, the interested readers are referred to [16].
The MPPA R
NoC routers multiplex flows originating from different directions.
Each originating direction has its own FIFO queue at the output interface, so flows
interfere on a node only if they share a link to the next node. This interface performs
a round-robin arbitration at the packet granularity between the FIFOs that contain
data to send on a link. The NoC routers have thus been designed for simplicity while
inducing a minimal amount of perturbations on (σ, ρ)flows. An additional benefit of
this router design is that eliminating backpressure from every single queue through
(σ, ρ)flow regulation at the source effectively prevents deadlocks. It is thus not nec-
essary to resort to specific routing techniques such as turn-models. Selecting (σ, ρ)
parameters for all flows is treated as an optimization step during design time, which
however is outside of the scope of this article.
28 Georgia Giannopoulou et al.
Design Optimization
- Task Mapping to Cores (Sec. 6.1)
- Data Mapping to Banks (Sec. 6.2)
- Task and Data Mapping (Sec. 6.3)
Mixed-Criticality
Task Set (Sec. 3.1)
Dependency Graph
Dep (Sec. 3.1)
Memory Interference
Graph (Sec. 5.1)
Cores , Memory
Latency
(Sec. 3.2)
NoC Analysis
(Sec. 5.2)
Minimum Distance
Constraint for
Dependent Tasks
- Computation of WCDFT, WCNT (Eq. 12)
- Computation of maximum NoC
Rx Accesses (Eq. 13)
NoC accesses
Memory blocks
Memory banks
Task accesses to banks
Response Time Analysis
(Sec. 5.1)
Computation of barriers (Eq. 6)
Task Mapping
(FTTS Schedule)
Data Mapping
NoC Flows;
Routes &
regulation parameters
Fig. 11 Design Flow
6 Design Optimization
Sec. 4 presented the runtime behavior of the FTTS scheduler and Sec. 5 the response
time analysis for tasks that are scheduled under FTTS and can experience blocking
delays on the shared memory path of a cluster, either by other tasks executing con-
currently in the cluster or by the incoming traffic from the NoC. For both parts, we
assumed a given FTTS schedule, with known task mapping to cores and data map-
ping to memory banks. In this section, we discuss the problem of actually finding an
FTTS schedule while optimizing resource utilization in our system.
The problem can be formulated as follows. Given: (i) a periodic mixed-criticality
task set τ, (ii) a cluster consisting of processing cores Pwith access to a banked
memory, (iii) the memory interference graph Iwith undefined edge set E2, (iv) the
memory access latency Tacc;Determine: the mapping Mτ:τ P of tasks to
processing cores and the mapping Mmem :BL B(E2of I) of memory blocks to
banks, such that:
all tasks meet their mixed-criticality real-time requirements at all levels of assur-
ance,
the workload is balanced among the cores,
the minimum distance constraints of the dependency graph Dep hold, and
the memory bank capacities are not surpassed.
The mapping Mτdefines both the spatial partitioning of tasks among the cores
in Pas well as the timing partitioning into frames and the execution order on each
individual core. These three aspects (spatial, timing partitioning, relative execution
order) determine fully an FTTS schedule for task set τ.
For each of the two considered optimization problems (Mτin Sec. 6.1 and
Mmem in Sec. 6.2), we assume an existing solution to the other one. Finally, we
show how to solve both optimization problems in an integrated manner (Sec. 6.3).
To facilitate reading, please refer to Fig. 11, which depicts the inputs and outputs of
the optimization procedure, as well as the flow of analyses (NoC analysis, response
time analysis) and information (task set, platform model, etc.) which enable us to de-
termine some of the inputs, e.g., the memory interference graph, and to evaluate the
visited solutions during optimization.
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 29
6.1 Task Mapping MτOptimization
The problem of optimal task mapping on multiple cores is known to be NP-hard, re-
sembling the combinatorial bin-packing problem. A possible approach to optimizing
Mτfor FTTS was suggested in [20] and is summarized below.
The approach implements a heuristic method based on simulated annealing [26].
Initially, it determines a random task mapping solution, resp. FTTS schedule for the
given task set τ. Specifically, it selects the FTTS cycle as the hyper-period Hof tasks
in τand also, the FTTS frame lengths depending on the task periods (the greatest
common divisor of the periods is used unless otherwise specified by the system de-
signer). For every task τiτ, it selects arbitrarily a core on which the task will be
mapped. Then, it computes the number of jobs that are released by task τiwithin a
hyper-period Hand the range of FTTS frames, to which each job can be scheduled,
such that it is executed between its release time and absolute deadline. For every job
of τi, it selects arbitrarily an FTTS frame from the allowed range. This procedure is
repeated for all tasks in τ. The constraints that must be respected during the genera-
tion of the FTTS schedule are:
All jobs of the same task are scheduled on the same core.
For every dependency τiτjin the dependency graph Dep, the jobs of the two
tasks are scheduled on the same core, with a job of τjbeing scheduled in the same
or a later frame than the corresponding job of τiwithin their common period. If
they are scheduled in the same frame, the job of τjmust succeed that of τi. In
any case, the sum of the best-case execution times of the jobs that are scheduled
in between τiand τjand the lengths of the intermediate frame(s) (if any exist
between the two jobs) must be no lower than their minimum distance constraint,
i.e., the weight of the corresponding edge in Dep. Note that this is an extension to
the original method of [20].
Note that this is the procedure that a system designer would, also, follow to generate a
random FTTS schedule for a given task set τ. It provides no guarantee on the schedule
admissibility.
Once a random initial mapping solution is determined, the optimizer applies sim-
ulated annealing [26] to explore the design space for task mapping. Particularly, new
solutions are found by applying two possible variations with given probabilities: (i)
re-mapping all jobs of a randomly selected task (and its dependent tasks) to a differ-
ent core or (ii) re-allocating one job of a randomly selected task to a different FTTS
sub-frame or to a different position within the same sub-frame. Design space explo-
ration is restricted to solutions that satisfy the dependency constraints in Dep. The
exploration terminates when it converges to a solution or a computational budget is
exhausted.
A task mapping solution is considered optimal if all jobs meet their deadlines at
all levels of assurance, i.e., the schedule is admissible, and the worst-case sub-frame
lengths are minimized, implying a balanced workload distribution. Based on these
requirements, we define the cost function of the optimization problem as:
Cost(S) = c1= maxf∈F max`∈{1,...,L}late(f, `)if c1>0
c2=kbarriersk3if c10,(14)
30 Georgia Giannopoulou et al.
where late(f, `)expresses the difference between the worst-case completion time of
the last sub-frame of fand the length of f:
late(f, `) =
L
X
i=1
barriers(f, `)i− Lf.(15)
If late(f, `)>0, the tasks in fcannot complete execution by the end of the frame
for their `-level execution profiles. Therefore, with this cost function, we initially
guide design space exploration towards finding an admissible solution. When such a
solution is found, cost c1becomes negative or 0. Then, c2, i.e., the 3-norm of all sub-
frame lengths, f∈ F,`∈ {1, . . . , L}, is used to minimize the worst-case lengths
of all sub-frames. The 3-norm of a vector xwith nelements (here, positive real
numbers) is defined as ||x||3:= Pn
i=1 |xi|31/3. We selected the particular value to
map (represent) the vector with the barriers values for all f∈ F and `∈ {1, . . . , L},
as we empirically found this to be the best among other considered norms, such as the
average, the maximum, the sum or the Euclidean norm. Namely, the selected norm
provides a trade-off between reducing the worst-case sub-frame lengths (to ensure
schedulability) and enabling progress in the optimization via improving the average-
case lengths. However, alternative norms, such as the ones mentioned above can be
also used in our cost function (14).
Note that during exploration, the barriers function is computed for each vis-
ited solution, as discussed in Sec. 5. For the WCRT analysis, the memory mapping
Mmem is assumed to be known.
The task mapping optimization method can be easily extended to account for
fixed task preemption points, mapping constraints, solution ranking, among others.
Please refer to [20] for a more detailed discussion.
6.2 Memory Mapping Mmem Optimization
The goal of memory mapping optimization is to determine a static allocation of the
tasks’ instructions, private data and communication buffers (memory blocks) BL to
banks Bof the shared memory (E2of memory interference graph I), so that the tim-
ing interferences of tasks when accessing the memory are minimized. Also, the total
size of the allocated memory blocks in a bank should not surpass the bank capacity.
This constraint holds e.g., for the memory mapping in Fig. 6.
For this problem, we adopt a heuristic method based on simulated annealing,
similar to the task mapping optimization. The method is presented in the form of
pseudocode in Listing 1. It receives as inputs an initial temperature T0, a temperature
decreasing factor a(0,1), the maximum number of consecutive variations with
no cost improvement that can be checked for a particular temperature F ailmax, a
stopping criterion in terms of the final temperature Tfinal, and a stopping criterion in
terms of search time (computational budget) timemax. It returns the best encountered
solution(s) in the given time.
The algorithm starts with an arbitrary solution S, satisfying the bank capacity
constraints. If GenerateInitialSolution can provide no such solution, exploration
is aborted. Otherwise, design space exploration is performed by examining random
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 31
Algorithm 1 Modified Simulated Annealing for Memory Mapping Mmem
Input: T0,a,F ailmax,Tf inal,timemax
Output: ¯
Sbest
1: SGenerateInitialSolution()
2: if S== then
3: return null
4: end if
5: ¯
Sbest ← {S},Scur best S,C ostmin Cost(S)
6: TT0
7: F ailCount 0
8: time StartTimer()
9: while time < timemax and T > Tf inal do
10: S0Variate(S)
11: if e(Cost(S0)Cost(S))/T Random(0,1) then
12: SS0
13: end if
14: UpdateBestSolutions(S0)
15: if Cost(S0)< Costmin then
16: Scur best S0
17: Costmin Cost(S0)
18: F ailCount 0
19: else
20: F ailCount F ail Count + 1
21: end if
22: if F ailCount == F ailmax then
23: Ta·T
24: SScur best
25: F ailCount 0
26: end if
27: end while
variations of the memory mapping. Particularly, Variate selects non-deterministically
a memory block and remaps it to a different memory bank such that no bank capacity
constraint is violated. The new solution S0is accepted if e(Cost(S0)Cost(S))/T is no
lesser than a randomly selected real value in (0,1). The cost of S0is, also, compared to
the minimum observed cost, Costmin . If it is lower than Costmin , the new solution
and its cost are stored, even if transition to S0was not admitted. The temperature Tof
the simulated annealing procedure is reduced geometrically with factor a. Reduction
takes place every time a sequence of F ailmax consecutive solutions are checked,
none of which improves Costmin . After temperature reduction, exploration continues
from the so-far best found solution (Scur best). Design Space exploration terminates
when the lowest temperature Tfinal is reached or the computational budget timemax
is exhausted.
Memory mapping affects the WCRT of a task τiby defining which of the tasks
that can be executed in parallel with it are also interfering with it. The less interfering
tasks, the lower the delay τiexperiences when accessing the shared memory. There-
fore, to evaluate a memory mapping solution we select a cost function which reflects
the increase in task WCRT due to interference on the shared memory banks. The cost
function is based on the mutual delay matrix D, which was introduced in Sec. 5.1.1,
and has two alternative definitions.
32 Georgia Giannopoulou et al.
One alternative to solve the optimization problem, is to compute (part of) the
Pareto set of memory mapping solutions with minimal interference between any two
tasks of the same criticality level. The intuition behind this approach is that we try
to minimize simultaneously all elements of the mutual delay matrix D, namely all
blocking delays that a task can cause to any other task (of the same criticality). This
problem can be seen as a multi-objective optimization problem, with the n2elements
of matrix Das individual cost functions. For this set of objectives, we compute the
Pareto set of memory mapping solutions. Algorithm 1 maintains such a set ¯
Sbest
of non-dominated solutions. In particular, a newly visited mapping solution S0with
matrix D0is inserted to the set ¯
Sbest if it has a lower value for at least one element of
D0than the corresponding element of any solution in ¯
Sbest. If a solution S¯
Sbest
is dominated by S0, i.e., S0has lower or equal values for all elements of D, then Sis
removed from the set. This update is performed by UpdateBestSolutions.
Another alternative is to define the scalar cost function Davg as the average over
all elements of matrix D, i.e., the average delay tasks of the same criticality level
cause to each other when interfering on shared banks. Then we can find the best
solution in terms of Davg . In this case, ¯
Sbest contains only one solution characterized
by the minimum encountered Davg .
6.3 Integrated Task and Memory Mapping Optimization
The problems of optimizing Mτand Mmem are inter-dependent. Namely, design
space exploration for the optimization of the task mapping requires information on
the memory mapping for computing function barriers. Similarly, matrix D, which
defines the cost of a memory mapping solution, can be refined for a particular task
mapping, depending on the tasks that can be executed in parallel. In the following,
we outline two alternative approaches towards an integrated optimization solution.
I. Task mapping optimization for each memory mapping in Pareto set. As dis-
cussed previously, one can compute using Algorithm 1 part of the Pareto set ¯
Sbest
of memory mapping solutions that minimize the interference between any two
tasks of the same criticality level. These solutions consider that all tasks of the
same criticality level are potentially executed in parallel (worst-case task map-
ping). The next step is to solve the task mapping optimization problem for each
memory mapping in the set ¯
Sbest. Finally, the combination of solutions which min-
imizes ||barriers||3is selected.
II. Iterative task and memory mapping optimization. Since the complexity of
computing the Pareto set solutions for the memory mapping optimization prob-
lem can be prohibitive, one can select an iterative solution to the two problems.
Then, for each visited solution during design space exploration for Mτ, a memory
mapping optimization is also performed to find the solution with minimized cost
||barriers||3. Here, we use the cost function Davg.
It cannot be said that one method clearly outperforms the other in terms of effi-
ciency. However, depending on the sizes of the search spaces of the task mapping
and memory mapping optimization problems, one algorithm can perform faster than
the other [21].
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 33
7 A Case Study
To evaluate the proposed design optimization approaches, we use an industrial imple-
mentation of a flight management system. This application has been a major use-case
of the European Certainty project [2]. The purpose for the evaluation is first, to show
applicability of our optimization methods and demonstrate the results of the response
time analysis for the optimized task and memory mapping solution. Second, we in-
vestigate the effect of various platform parameters, such as the memory access latency
and the number of memory banks, or design choices, such as the selection of routes
and (σ,ρ) parameters for the NoC flows, on the schedulability of the application under
FTTS. The optimization framework has been implemented in Java and the evaluation
was performed on a laptop with a 4-core Intel i7 CPU at 2.7 GHz and 8 GB of RAM.
The source code of the experiments is available for downloading at [1].
Flight Management System. The flight management system (FMS) from the avion-
ics domain is responsible for functionalities, such as the localization of an aircraft,
the computation of the flightplan that guides the auto-pilot, the detection of the near-
est airport, etc. We look into a subset of the application, consisting of 14 periodic
tasks for sensor reading, localization, and computation of the nearest airport. Seven
are characterized by safety level DAL-B (we map it to criticality level 2, i.e., high)
and seven by safely level DAL-C (we map it to criticality level 1, i.e., low) based on
the DO-178B standard for certification of airborne systems [3]. The periods of the
tasks vary among 200 ms, 1 sec, and 5 sec, as shown in Table 3. Based on these,
we select the cycle and the frame length of the FTTS as H= 5 sec and Lf= 200
ms, respectively. The worst-case execution times of the tasks were derived through
measurements on a real system or for few tasks, for which the code was not available
(e.g., τinit,13), based on conservative estimations. A discussion on how the worst-
case execution time parameters of the FMS tasks can be derived can be found in the
technical report [12] of the Certainty project [2]. For the level-2 profiles Ci(2) of
tasks τiwith χi= 2, we augment the worst observed execution times by a factor of
5. Similarly, for the memory accesses, we consider conservative bounds based on the
known memory footprints for the tasks and derive the Ci(2) parameters by multiply-
ing these bounds by 5. Factor 5 is selected arbitrarily to augment the worst-case task
parameters and thus, increase the safety margins. Such augmentation of the worst-
case parameters seems a common industrial practice for safety-critical applications.
The best-case execution time and access parameters are taken equal to 0 due to lack of
more accurate information. Last, the degraded profiles Ci,deg of tasks τiwith χi= 1
correspond to no execution, i.e., Ci,deg = (0,0,0,0). The task periods, criticality
levels, level-1 and level-2 worst-case execution times and memory accesses, as well
as the memory blocks that they access and the maximum number of accesses to each
block at the tasks’ criticality levels are shown in Table 3.
To model the memory accessing behavior of the tasks according to Sec. 5.1, we
define a memory interference graph Iwith the following memory blocks: one block
per task with size equal to the size of its data as measured on the deployed system,
and one block per communication buffer with known size, too. This yields in total 27
memory blocks.
34 Georgia Giannopoulou et al.
Purpose Task CL Period Level-1 (Level-2) Level-1 (Level-2) Accessed Mem. Blocks
(ms) Exec. Time (ms) Memory Accesses (Max. Accesses)
Sensor data acquisition
τ12 200 11 (55) 213 (1065) b1(100), b2(425), b4(70),
b6(190), b8(190), b10 (90)
τ21 200 20 (0) 117 (0) b3(10), b4(107)
τ31 200 18 (0) 129 (0) b5(10), b6(119)
τ41 200 18 (0) 129 (0) b7(10), b8(119)
τ51 200 20 (0) 129 (0) b9(10), b10 (119)
Localization
τ62 200 7 (35) 145 (725) b2(425), b11 (100), b12 (100),
b13 (90), b4(10)
τ72 1000 6 (30) 56 (280) b13 (90), b14 (100), b15 (90)
τ82 5000 6 (30) 57 (285) b15 (17), b16 (100),
b17 (90), b19 (78)
τ92 1000 6 (30) 57 (285) b17 (90), b18 (100),
b21 (78), b22 (17)
τ10 1 200 20 (0) 130 (0) b12 (120), b24 (10)
τ11 1 1000 20 (0) 113 (0) b19 (103), b25 (10)
τ12 1 200 20 (0) 113 (0) b21 (103), b26 (10)
Nearest Airport
τ13 2 1000 48 (192) 1384 (6920) b17 (90), b20 (100),
b23 (1610), b27 (5120)
τinit,13 2 1000 2 (10) 18 (90) b17 (90)
Triggers Rx to b27 (403)
Table 3 Flight Management System
The FMS requires access to a navigation database with a memory footprint of
several tens of MB. Particularly, task τ13, which is responsible for the computation
of the nearest airport, needs read-access to certain entries (up to 4 KB of data). In the
following, we assume that the database is maintained in an external DDR memory
and the required data are fetched to the local memory of a cluster, where the FMS
application is executed. The data transfer is initiated by the preceding task τinit,13,
which has the same criticality level and period as τ13. The memory block correspond-
ing to the database data is bl27. We add a high-priority task Rx13 to Ito indicate the
remote data transfer. Rx13 is connected to the memory block b27 via an edge with
weight δ.
Note that the FMS contains no task dependencies other than that between the task
requesting the database entries for the computation of the nearest airport, τinit,13,
and the task that performs the computation, τ13. In the following, we show how to
compute the minimum distance constraint between τinit,13 and τ13 in the dependency
graph Dep as well as the weight δfor the edge from the node Rx13 to bl27 in the
memory interference graph I.
NoC Flow Routing and Regulation. Task τ13 reads upon each activation 4 KB of
data from the database. The data are transferred from the remote cluster with access
to the DDR over the D-NoC in packets of 4 Bytes. Namely, a transfer of 1024 packets
must be executed between any two successive executions of τ13. We assume that this
flow is (σ, ρ)regulated at the I/O sub-system, with σ= 10 packets and ρ= 2000
packets/sec. The (σ, ρ)parameters are selected arbitrarily here, such that they are rea-
sonable for the required amount of transferred data and they allow the transfer over
the NoC to be completed within a period of task τ13. The flow routing is fixed and
passes through two D-NoC routers. On the first router, the flow can encounter inter-
ference from one more flow on the output link. Respectively, on the second router,
the flow interferes with three more flows on the output direction. The clock frequency
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 35
on the chip is 400 MHz. The routers forward the packets over the D-NoC links at a
rate of 1 packet/cycle, equiv. 400,000,000 packets/sec. Given the above assumptions
and by applying Eq. 12 and 13 of Sec. 5.2, we derive the worst-case data fetch time,
WCDFT = 511.4 ms, and the maximum number of packets fetched during 200 ms
(duration of an FTTS frame), δ= 403, respectively.
Based on the remote fetch protocol of Sec. 3.2, to specify the minimum distance
constraint between task τinit,13 and τ13, we need to know besides WCDFT, the worst-
case notification time, WCNT, for the transfer of the notification from τinit,13 to
the remote listener task in the I/O sub-system as well as the worst-case remote set-
up time for the data transfer, WCRST. WCNT can be derived in a similar way as
WCDFT. Assuming that τinit,13 sends only one packet (4 Bytes) to the remote cluster,
following the exact same route as the flow from the I/O sub-system to the cluster, it
follows from Eq. 12 that WCNT = 0.4 ms. Regarding WCRST, a conservative bound
is assumed to be given, WCRST = 25 ms (here, arbitrary selection). Summarizing on
the above results, the dependency between τinit,13 and τ13 in the dependency graph
Dep is weighted with the minimum distance constraint: w=(0.4 + 25 + 511.4) ms =
536.8 ms.
Platform Parameters. For the deployment of the FMS, we consider a target platform
resembling a cluster of the MPPA R
-256 platform. In particular, Pincludes 8 pro-
cessing cores with shared access to 8 memory banks of 128 KB each. We consider
round-robin arbitration on the memory arbiters with higher priority for the NoC Rx
interface, according to the description in Sec. 3.2. Once a memory access is granted,
the fixed memory latency is Tacc =55 ns. The memory latency bound has been
empirically estimated on the MPPA R
-256 platform, using benchmark applications.
Note, however, that it is not necessarily a safe bound for the MPPA R
-256 memory
controller.
7.1 Design Optimization and Response Time Analysis
With the first experiment we intend to evaluate the applicability and efficiency of
the optimization framework, which was developed in Sec. 6, w.r.t. the deployment
of the FMS on a compute cluster. The scheduling policy in the cluster is FTTS, with
a cycle of H= 5000 ms, consisting of 25 frames with length 200 ms each, based
on the periods of our task set. Each FTTS frame is divided into two sub-frames,
since the FMS has L= 2 criticality levels. We configure the simulated-annealing
algorithm (Listing 1 of Sec. 6) for both task and memory mapping optimization with
parameters: a= 0.8,F ailmax = 100,T0based on the cost of 100 random solutions,
Tfinal = 0.1,timemax = 30 min. For the task mapping Mτoptimization, the
probabilities of selecting a sub-frame or core variation are 0.85 and 0.15, respectively.
The memory mapping optimization Mmem uses as objective function the average
value of the delay matrix, Dmax and is performed for each visited task mapping
solution during design space exploration, i.e., according to the integrated solution II
of Sec. 6.3. The overall optimization goal is to maximize the slack time at the end of
the frames (equiv. minimize ||barriers||3), which indicates a maximal exploitation
of computation parallelism and memory accessing parallelism.
36 Georgia Giannopoulou et al.
Frame Core Sub-frame 1 Sub-frame 2 Frame Core Sub-frame 1 Sub-frame 2
f1p1τinit,13 τ10,τ2,τ3f2p1-τ10 ,τ2,τ3
[0,200] p2τ6,τ1τ4,τ12,τ5[200,400] p2τ6,τ1τ4,τ12 ,τ5
f3p1-τ10,τ2,τ3f4p1τ13 τ10 ,τ2,τ3
[400,600] p2τ6,τ1τ4,τ12,τ5,τ11 [600,800] p2τ6,τ1τ4,τ12 ,τ5
f5p1τ7,τ9τ10,τ2,τ3f6p1τinit,13 τ10 ,τ2,τ3
[800,1000] p2τ6,τ1τ4,τ5,τ12 [1000,1200] p2τ6,τ1τ4,τ5,τ12
f7p1τ7τ2,τ10,τ3f8p1τ9τ10 ,τ2,τ3
[1200,1400] p2τ6,τ1τ4,τ12,τ5[1400,1600] p2τ6,τ1τ4,τ5,τ11 ,τ12
f9p1-τ10,τ3,τ2f10 p1τ13 τ10 ,τ2,τ3
[1600,1800] p2τ6,τ1τ4,τ12,τ5[1800,2000] p2τ6,τ1τ4,τ12 ,τ5
f11 p1τinit,13 τ10,τ2,τ3f12 p1-τ10 ,τ2,τ3
[2000,2200] p2τ6,τ1τ12,τ5,τ9[2200,2400] p2τ6,τ1τ4,τ11 ,τ12,τ5
f13 p1τ7τ10,τ2,τ3f14 p1τ13 τ10 ,τ2,τ3
[2400,2600] p2τ6,τ1τ12,τ4,τ5[2600,2800] p2τ6,τ1τ4,τ12 ,τ5
f15 p1τ9τ10,τ2,τ3f16 p1-τ10 ,τ2,τ3
[2800,3000] p2τ1,τ6τ4,τ12,τ5[3000,3200] p2τ6,τ1τ4,τ12 ,τ5
f17 p1τinit,13,τ9τ10 ,τ2,τ3f18 p1τ7,τ8τ10,τ2,τ3
[3200,3400] p2τ6,τ1τ4,τ12,τ5,τ11 [3400,3600] p2τ6,τ1τ12 ,τ4,τ5
f19 p1-τ10,τ2,τ3f20 p1τ13 τ10 ,τ2,τ3
[3600,3800] p2τ6,τ1τ4,τ5,τ12 [3800,4000] p2τ6,τ1τ4,τ12,τ5
f21 p1-τ10,τ2,τ3f22 p1τinit,13 ,τ9τ10,τ2,τ3
[4000,4200] p2τ6,τ1τ4,τ12,τ5[4200,4400] p2τ6,τ1τ13 ,τ4,τ5,τ11
f23 p1-τ10,τ2,τ3f24 p1τ7τ10 ,τ2,τ3
[4400,4600] p2τ6,τ1τ4,τ12,τ5[4600,4800] p2τ6,τ1τ5,τ12 ,τ4
f25 p1τ13 τ10,τ2,τ3
[4800,5000] p2τ6,τ1τ4,τ12,τ5
Table 4 Optimized task mapping Mτfor FMS on a 2-core, 2-bank subset of a compute cluster
We consider all possible configurations with one to eight processing cores and
one to eight memory banks for the deployment of the FMS. The optimized task and
memory mapping solution, which yields the minimum value for the objective func-
tion ||barriers||3, is found for the configuration with two processing cores and two
memory banks. Selecting more memory banks or more cores is not beneficial since
it does not lead to a solution of lower cost, namely a solution with more slack time at
the end of the frames. This is important information for a system designer, who tries
not only to design a safe system, but also to allocate the minimal amount of resources.
For the configuration with two cores and two memory banks, the optimization
framework returns an admissible FTTS schedule after evaluating 4919 task and mem-
ory mapping combinations and converging to one within 4.3 minutes. The optimized
task and memory mapping are shown in Table 4 and 5, respectively. Note that the de-
pendency between tasks τinit,13 and τ13 is respected, and that they are scheduled in
different frames on the same core such that the minimum distance constraint (536.8
ms) is not violated. For the optimized solution, function barriers can be computed,
based on the memory interference graph Iand the memory latency Tacc, as described
in Sec. 5.1. The values of barriers, i.e., the worst-case sub-frame lengths for every
FTTS sub-frame and level of assurance, as computed by our optimization framework,
are shown in Table 6. Note that for every FTTS frame, the sum of barriers for its two
sub-frames, under both levels of assurance, is not greater than the size of the frames,
i.e., 200 ms. This shows that the admissibility condition of Eq. 2 is valid, which yields
the FTTS schedule admissible.
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 37
Bank Mapped memory blocks
1b1to b15,b24
2b16 to b23,b25 to b27
Table 5 Optimized memory mapping Mmem for FMS on a 2-core, 2-bank subset of a compute cluster
Frame fLevel-1 barriers(f, 1) Level-2 barr iers(f, 2) Frame Level-1 barriers(f , 1) Level-2 barriers(f, 2)
Subframe1 Subframe2 Subframe1 Subframe2 Subframe1 Subframe2 Subframe1 Subframe2
f118 58.1 90.1 0 f218 58.1 90.1 0
f318 78.1 90.1 0 f448.1 57.9 192 0
f518 58.1 90.1 0 f618 58.1 90.1 0
f718 58.1 90.1 0 f818 78.1 90.1 0
f918 58.1 90.1 0 f10 48.1 57.9 192 0
f11 18 58.1 90.1 0 f12 18 78.1 90.1 0
f13 18 58.1 90.1 0 f14 48.1 57.9 192 0
f15 18 58.1 90.1 0 f16 18 58.1 90.1 0
f17 18 78.1 90.1 0 f18 18 58.1 90.1 0
f19 18 58.1 90.1 0 f20 48.1 57.9 192 0
f21 18 58.1 90.1 0 f22 18 78.1 90.1 0
f23 18 58.1 90.1 0 f24 48.1 57.9 192 0
f25 18 58.1 90.1 0
Table 6 Computation of barriers for Mmem (Table 4), Mmem (Table 5), Tacc = 55ns, memory
interference graph I
7.2 Effect of Platform Parameters and Design Choices on FTTS Schedulability
With the second experiment we intend to evaluate the sensitivity of our optimization
approach when certain parameters, e.g., the number of memory banks, the mem-
ory access latency Tacc, the incoming traffic from the NoC, the number of available
processing cores vary. We evaluate schedulability of the FMS application for the al-
ternative configurations (combinations of the above parameters), based on the cost
||barriers||3of the optimized task and memory mapping solution in each case. For
the definition of metric ||barriers||3, see the discussion on the cost function (14) in
Sec. 6.1. The lower the cost of the optimized solution, the higher the probability that
an admissible FTTS schedule for the considered configuration exists. In all following
scenarios, the optimizer converges to a task and memory mapping solution in less
than 7 minutes. For the simulated-annealing algorithm, we use the same parameters
as in the previous section.
First, we evaluate the effect of the memory access latency Tacc on the FMS
schedulability. We assume that the value of Tacc varies within {55 ns, 550 ns, 5.5
us, 55 us}. We perform design space exploration after fixing the number of memory
banks to two. Fig. 12(a) shows how the schedulability metric, ||barriers||3, changes
for the FMS as the number of available cores increases from one to eight, for different
Tacc values. For each combination of Tacc and number of cores, the depicted point
in Fig. 12(a) corresponds to the best found solution by our optimization framework.
The value on the y-axis represents the 3-norm ||barriers||3for the optimized task
and memory mapping solution. The points within the dashed rectangle correspond to
schedulable implementations, namely to combinations of Tacc and number of cores
m, for which the optimized mapping solution is admissible according to Definition 1
38 Georgia Giannopoulou et al.
(a) FMS schedulability for variable Tacc (b) FMS schedulability for variable number of
banks
(c) FMS schedulability for variable number of
NoC Rx memory accesses in an FTTS frame δ
Fig. 12 Effect of platform and design parameters on FMS schedulability under the FTTS policy.
of Sec. 3.3. The points that do not fall into this rectangle correspond to implementa-
tions, for which the optimizer could not converge to any admissible solution within
the given time budget of 30 minutes. We observe that, like in the previous section, the
FMS schedulability under FTTS increases or remains stable as the number of cores
increases. This is partly explained by the low task set utilization of the FMS. Two
processing cores suffice to find an admissible FTTS schedule, whereas more cores
are not beneficiary. Moreover, the effect of Tacc in schedulability is significant. For
Tacc ∈ {5.5 us, 55 us}no admissible schedule can be found even when all cores of the
cluster are utilized. This is an important indicator that in shared-memory multi-core
and many-core platforms, the increase of cores must be followed by a simultaneous
increase in the memory bandwidth (reduction of Tacc) or a reduction of the memory
contention, for achieving essential gain.
Second, we evaluate the effect of data partitioning on the FMS schedulability.
We fix Tacc to 550 ns and perform design space exploration for all cluster config-
urations with one to eight memory banks and one to eight cores. Fig. 12(b) shows
the schedulability metric (cost ||barriers||3of the optimized task and memory map-
ping solution), as the number of cores increases, for the configurations with one bank
(blue line) or more than one banks (magenta line). Deploying more than two memory
banks does not improve the optimized solutions. This can be partly explained by the
low memory utilization of the FMS (fraction of cluster memory required for task data
and communication buffers). Also, note that the cost of the solutions for one bank are
only marginally worse than those for several banks. By carefully examining the opti-
mized solutions, we conclude that in cases where no flexibility exists w.r.t. memory
mapping, the optimizer tends to select the task mapping by maximally distributing
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 39
the tasks across the FTTS frames (not letting empty frames), such that a minimal
set of tasks are executed in parallel in the same frame and hence, interfere on the
memory bank. The periods of the FMS tasks and the considered dimensioning of the
FTTS schedule (25 frames over H= 5sec) help in this direction, since several tasks
have a high degree of freedom in the range of frames to which they can be mapped.
We conclude that for the FMS, the combined task and memory mapping optimization
performs efficiently, in the sense that the optimizer exploits maximally the flexibility
in solving one problem (task mapping), when the flexibility of the second problem
(memory mapping) is limited.
Finally, we evaluate the effect of the incoming NoC traffic at the cluster memory
on the FMS schedulability. We assume that by selecting different regulation param-
eters and/or NoC routes for the data flow that is requested by task τinit,13, we can
affect the maximum number of NoC Rx accesses to the local memory, δ, such that
it varies within {403,803,1024,4096,8192}. We fix Tacc to 550 ns and the number
of memory banks to 2. The FMS schedulability metric (cost ||barriers||3of the op-
timized task and memory mapping solution) for increasing number of cores and for
the different δvalues is shown in Fig. 12(c). Again, schedulability is not severely
affected by increased incoming NoC traffic. This is achieved in that the optimizer
isolates the memory block corresponding to the fetched database entries (b27), so that
no or very few FMS tasks are interfering with the higher-priority Rx13 requester, thus
exploiting the memory accessing parallelism that the two memory banks enable. This
way, the WCRT of the FMS tasks becomes immune to changes of δ. This observation
justifies the benefits of the combined task and memory mapping optimization, where
the interference of the tasks on shared platform resources is explicitly considered.
8 Comparison of FTTS to Existing Mixed-Criticality Scheduling Policies
In this section, we evaluate the efficiency of the FTTS policy in finding admissi-
ble schedules against state-of-the-art scheduling policies that have been proposed
for mixed-criticality systems. For the following discussion, we distinguish previ-
ous mixed-criticality approaches into two categories: (i) scheduling policies that do
not consider sharing of platform resources, such as the memory and NoC, and (ii)
scheduling policies that target at eliminating or bounding the inter-core interference
on shared platform resources. The FTTS policy falls into the second category. In
the next sub-sections, we present quantitative or qualitative comparisons of FTTS to
representative policies of the two categories.
8.1 FTTS vs Resource-Agnostic Scheduling Policies
The benefit of using FTTS in the presence of shared platform resources, e.g., mem-
ory banks and networks-on-chip, has been discussed and evaluated in Sec. 7. Here,
we evaluate the limitations posed by the (flexible) time-triggered implementation of
FTTS and their impact on schedulability. Hence, we compare FTTS to more dy-
namic, state-of-the-art MC scheduling strategies, particularly the EDF-VD algorithm
40 Georgia Giannopoulou et al.
(a) Wi∈ {100,200,300,400,500}(b) Wi∈ {100}
(c) Wi∈ {200,400}(d) Wi∈ {200,400,800}
Fig. 13 Schedulable task sets (%) vs. normalized system utilization for FTTS and EDF-VD (m= 1),
UL= 0.05, UL= 0.75, ZL= 1, ZL= 8, P = 0.3, 1000 task sets per utilization point
for single-core [7] and its variant GLOBAL for multicore systems [28]. Since these
algorithms do not consider resource sharing, comparison is based upon synthetic task
sets that require no accesses.
For task set generation we use the algorithm of [28] (TaskGen, Fig. 4 in [28])
for 2 criticality levels. Per-task utilization Uiis selected uniformly from [UL, UH] =
[0.05,0.75] and the ratio Ziof the level-2 utilization to level-1 utilization is selected
uniformly from [ZL, ZH] =[1,8]. The probability that a task τihas χi= 2 is set
to P= 0.3. Period Wiis randomly selected from the set {100,200,300,400,500}.
Because FTTS cannot handle dynamic preemption, if the assigned execution time of
a task is larger than the maximum frame length of the FTTS scheduling cycle, the
task is split into sub-tasks, each ”fitting” within a FTTS frame.
Fig. 13(a)-13(d) (FTTS vs. EDF-VD) and Fig. 14(a)-14(d) (FTTS vs. GLOBAL)
show the fraction of task sets that are deemed schedulable by the considered algo-
rithms as a function of the ratio Usys/m (normalized system utilization). Usy s is
defined in [28] as follows:
Usys = max ULO
LO (τ) + ULO
HI (τ), U HI
HI (τ),(16)
where Uy
x(τ)represents the total utilization of the tasks with criticality level xfor
their ylevel execution profiles (LO1, HI2). Note that the normalized utilization
increases from 0.25 to 1.10 in steps of 0.05. For each utilization point in the graphs,
100 or 1000 randomly generated task sets (as annotated in respective figure) are con-
sidered. To check schedulability of each randomly generated task set for FTTS, we
use the optimization framework of Sec. 6.1 and check condition (2) for the optimized
solution. For the design space exploration, we use the same configuration for the sim-
ulated annealing algorithm (Listing 1 of Sec. 6) as in Sec. 7 and a time budget of 10
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 41
(a) Wi∈ {100,200,300,400,500}(b) Wi∈ {100}
(c) Wi∈ {200,400}(d) Wi∈ {200,400,800}
Fig. 14 Schedulable task sets (%) vs. normalized system utilization for FTTS and GLOBAL (m= 4),
UL= 0.05, UL= 0.75, ZL= 1, ZL= 8, P = 0.3, 100 task sets per utilization point
minutes. In all cases, however, the optimizer converged to a solution in less than 5
minutes. Additionally, to check schedulability of each randomly generated task set
for EDF-VD and GLOBAL, we check the sufficient conditions from [7,28]. These
conditions are given below.
For EDF-VD on single cores [7]:
UHI
HI (τ) + ULO
LO (τ)·ULO
HI (τ)
1ULO
LO (τ)1.(17)
For GLOBAL on multicores with mcores [28]:
ULO
LO (τ) + min UHI
HI (τ),ULO
HI (τ)
12·UHI
HI (τ)/(m+ 1)m+ 1
2,(18)
On single-core systems, FTTS faces two limitations compared to EDF-VD, i.e.,
the fixed preemption points and the time-triggered frames. EDF-VD is more flexible
with scheduling task jobs as they arrive and can preempt them any time. The results of
Fig. 13(a) show that as the utilization increases, EDF-VD can schedule 0 up to 52.9%
(Usys = 0.85) more MC task sets than FTTS (on average, 17.9% higher schedu-
lability than FTTS). The impact of the FTTS limitations on schedulability becomes
even clearer if we repeat the experiment such that these limitations are avoided. This
happens when all tasks have the same period (Wi= 100), hence the FTTS cycle
consists only of 1 frame. The corresponding results in Fig. 13(b) exhibit now reverse
trends, with FTTS being able to schedule up to 57.2% (Usys = 1.0) more task sets
than EDF-VD (on average, 10.5% higher schedulability than EDF-VD). In fact, if we
42 Georgia Giannopoulou et al.
consider safety-critical applications with harmonic task periods, such as the FMS of
Sec. 7, the performance of FTTS is comparable to that of EDF-VD. This can be seen
in Fig. 13(c) and 13(d), where the task periods for the generated task sets are selected
uniformly from sets {200,400}(2 periods) and {200,400,800}(3 periods), respec-
tively. In the case of 2 harmonic periods, FTTS can schedule up to 16.2% (on average
2.2%) more tasks sets than EDF-VD. In the case of 3 periods, FTTS can schedule
up to 8.1% more task sets (Usys = 1), but on average across all utilization points,
it schedules 1.2% less task sets than EDF-VD. The comparable performance of the
two policies in terms of schedulability for equal or harmonic task periods is a signifi-
cant outcome, given that FTTS was designed targeting at timing isolation rather than
efficiency.
On multicores, we expect GLOBAL to perform more efficiently than FTTS not
only because of the previously discussed advantages, but also because it enables task
migration. Namely, several jobs of the same task can be scheduled on different cores
and a preempted job can be resumed on a different core. The results of Fig. 14(a)
show that the effectiveness of GLOBAL in finding admissible schedules for the gen-
erated task sets is up to 65% higher (Usys = 0.40) than for FTTS. Recall, however,
that the increased efficiency comes at the cost of ignoring the timing effects of shared
resources which are not negligible especially in the presence of task migrations. If the
limitations of FTTS are avoided as before, the results (Fig. 13(b)) are again reversed.
Then, FTTS schedules up to 82.3% (Usys = 0.65) more task sets than GLOBAL(on
average, 20.8% higher schedulability than GLOBAL). Fig. 14(c) and 14(d) show the
schedulability vs. utilization trends when the task periods are selected from the har-
monic sets {200,400}and {200,400,800}, respectively. In the case of 2 harmonic
periods, can schedule up to 53% (on average 3.8%) more tasks sets than GLOBAL.
In the case of 3 periods, it can schedule up to 31% more task sets, but on average
across all utilization points, it schedules 3.6% less task sets than GLOBAL. There-
fore, schedulability is again comparable between the two policies when the task pe-
riods are harmonic.
It follows that FTTS, despite its imposed limitations for achieving timing iso-
lation, e.g., the lack of dynamic preemption, the static partitioning of tasks among
cores, and the fixed-length frames, can actually compete with state-of-the-art schedul-
ing algorithms, which were designed with efficiency in mind. In other words, FTTS
is not only a policy that enables global timing isolation for certifiability, but also a
competent solution for efficient (processing, memory, communication) resource uti-
lization in mixed-criticality environments.
8.2 FTTS vs Resource-Aware Scheduling Policies
The category of resource-aware scheduling includes policies, such as [50,18, 34, 22],
which were described in Sec. 2 in the context of mixed-criticality resource sharing.
Currently, a direct comparison of FTTS to the approaches of these works is not ap-
plicable for the following reasons:
The memory controllers suggested in [34,22] are custom hardware solutions for
mixed hard real-time and soft real-time systems. In our work, the proposed mixed-
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 43
safety-criticality scheduling policy and response time analysis are developed for
a specific cluster-based platform model, motivated by the Kalray MPPA R
-256
architecture. Namely, the considered system model is fundamentally different be-
tween [34,22] and our work.
The OS-supported memory throttling mechanisms proposed in [50,18] do not
consider the existence of a NoC on the platform and the interference on the shared
memory by incoming traffic from the NoC, as commonly existing in cluster-based
platforms. Unless these methods are extended to account for the NoC traffic, a
comparison to the FTTS in terms of schedulability (based on the analysis pre-
sented in this paper) is not meaningful.
To the best of our knowledge, there is no previous work combining mixed-
criticality scheduling with analysis of inter-core interference on both shared memory
and NoC resources. This is the reason why a quantitative comparison between FTTS
and existing policies of the resource-aware scheduling category cannot be provided.
9 Conclusion
This article extends the state-of-the-art for mixed-criticality systems by presenting
a unified analysis approach for computing, memory, and communication schedul-
ing. It targets modern cluster-based manycore architectures with two shared resource
classes: a shared multi-bank memory within each cluster, and a network-on-chip
(NoC) for inter-cluster communication and access to external memories. To model
such architectures and the communication flows through the NoC, we extend and
concretize the system model that was introduced in previous work [20], having the
Kalray MPPA R
-256 architecture as reference. Additionally, we introduce a proto-
col for inter-cluster communication with formally provable timing properties. For the
scheduling of mixed-criticality applications on cluster-based architectures, we pro-
pose a mixed-criticality scheduling policy (FTTS), which enforces global timing iso-
lation between applications of different criticalities in order to provide certifiability
properties. This is achieved by allowing only applications of equal criticality to be
executed in parallel and hence, interfere on the shared memory and communication
infrastructure. Response time analysis for this policy, which was introduced in [20,
21] for systems with shared memory, is substantially extended to (i) model interfer-
ence on the shared memory of a cluster by concurrently executing tasks in the cluster,
but also by incoming traffic from the NoC; (ii) bound safely and tightly the end-to-
end delays for data transfers through the NoC; (iii) model the incoming traffic from
the NoC in the form of arrival curves from the real-time calculus; (iv) integrate the
results of the extended memory interference analysis and the novel NoC analysis.
Moreover, design exploration methods are presented targeting at the optimization of
resource utilization within a cluster at the levels of computing (core utilization), mem-
ory (exploitation of internal memory structure for data partitioning), and communica-
tion (management of incoming traffic from a NoC). The applicability and efficiency
of the optimization approach are demonstrated for an industrial implementation of
a flight management system. Finally, the proposed scheduling policy is compared
44 Georgia Giannopoulou et al.
quantitatively (in terms of schedulability) and qualitatively (when a direct compari-
son was not applicable) to state-of-the-art policies for mixed-criticality systems. The
quantitative comparison shows that it performs better in terms of schedulability for
harmonic workloads.
As future work, we would like to step from the system-level design optimization
to the actual deployment of a mixed-criticality application, such as the flight man-
agement system, on a commercial many-core platform. The Kalray MPPA R
-256 is
a potential target platform since it is a concrete example of our abstract architecture
model and its runtime environment provides support for intra-cluster barrier synchro-
nization, explicit data partitioning among memory banks, and NoC flow regulation,
all key features for the low-overhead implementation of the FTTS scheduler and the
validity of the analysis methods.
Acknowledgements The authors would like to thank the anonymous reviewers for their valuable
feedback. This work has received funding from the European Union Seventh Framework Programme
(FP7/2007-2013) project CERTAINTY under grant agreement number 288175.
References
1. The dol-critical framework for mixed-criticality applications on multicores. https://www.tik.
ee.ethz.ch/˜certainty/download.html.
2. European commission’s 7th framework programme: Certification of real-time applications designed
for mixed criticality (certainty). www.certainty-project.eu.
3. RTCA/DO-178B, Software Considerations in Airborne Systems and Equipment Certification, 1992.
4. R. Alur and D. L. Dill. A theory of timed automata. Theoretical Computer Science, 126(2):183 –
235, 1994.
5. J. Anderson, S. Baruah, and B. Brandenburg. Multicore operating-system support for mixed criticality.
In Workshop on Mixed Criticality: Roadmap to Evolving UAV Certification, 2009.
6. ARINC. ARINC 653-1 avionics application software standard interface. Technical report, 2003.
7. S. Baruah, V. Bonifaci, G. D’Angelo, H. Li, A. Marchetti-Spaccamela, S. Van der Ster, and L. Stougie.
The preemptive uniprocessor scheduling of mixed-criticality implicit-deadline sporadic task systems.
In ECRTS, pages 145–154, 2012.
8. S. Baruah, B. Chattopadhyay, H. Li, and I. Shin. Mixed-criticality scheduling on multiprocessors.
Real-Time Systems, 50(1):142–177, 2014.
9. S. Baruah and G. Fohler. Certification-cognizant time-triggered scheduling of mixed-criticality sys-
tems. In RTSS, pages 3–12, 2011.
10. S. Baruah, H. Li, and L. Stougie. Towards the design of certifiable mixed-criticality systems. In RTAS,
pages 13–22, 2010.
11. A. Burns and R. Davis. Mixed criticality systems: A review. http://www-users.cs.york.
ac.uk/burns/review.pdf.
12. Certainty. D8.3 - validation results. Technical report, 2014.
13. C.-S. Chang. Performance Guarantees in Communication Networks. Springer, 2000.
14. R. L. Cruz. A calculus for network delay. i. network elements in isolation. IEEE Transactions on
Information Theory, 37(1):114–131, 1991.
15. B. de Dinechin, R. Ayrignac, P.-E. Beaucamps, P. Couvert, B. Ganne, P. de Massas, F. Jacquet,
S. Jones, N. Chaisemartin, F. Riss, and T. Strudel. A clustered manycore processor architecture for
embedded and accelerated applications. In HPEC, pages 1–6, 2013.
16. B. de Dinechin, D. van Amstel, M. Poulhies, and G. Lager. Time-critical computing on a single-chip
massively parallel processor. In DATE, pages 1–6, 2014.
17. J. Diemer and R. Ernst. Back suction: Service guarantees for latency-sensitive on-chip networks. In
NOCS, pages 155–162, 2010.
18. J. Flodin, K. Lampka, and W. Yi. Dynamic budgeting for settling dram contention of co-running hard
and soft real-time tasks. In SIES, pages 151–159, 2014.
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 45
19. G. Giannopoulou, K. Lampka, N. Stoimenov, and L. Thiele. Timed model checking with abstractions:
Towards worst-case response time analysis in resource-sharing manycore systems. In EMSOFT, pages
63–72, 2012.
20. G. Giannopoulou, N. Stoimenov, P. Huang, and L. Thiele. Scheduling of mixed-criticality applications
on resource-sharing multicore systems. In EMSOFT, pages 17:1–17:15, 2013.
21. G. Giannopoulou, N. Stoimenov, P. Huang, and L. Thiele. Mapping mixed-criticality applications on
multi-core architectures. In DATE, pages 1–6, 2014.
22. S. Goossens, B. Akesson, and K. Goossens. Conservative open-page policy for mixed time-criticality
memory controllers. In DATE, pages 525–530, 2013.
23. S. Hahn, J. Reineke, and R. Wilhelm. Towards compositionality in execution time analysis-definition
and challenges. In Workshop on Compositional Theory and Technology for Real-Time Embedded
Systems, 2013.
24. H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and R. R. Rajkumar. Bounding memory
interference delay in cots-based multi-core systems. In RTAS, pages 145–154, 2014.
25. Y. Kim, J. Lee, A. Shrivastava, and Y. Paek. Operation and data mapping for cgras with multi-bank
memory. In LCTES, pages 17–26, 2010.
26. S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science,
220:671–680, 1983.
27. J.-Y. Le Boudec and P. Thiran. Network calculus: a theory of deterministic queuing systems for the
internet, volume 2050. Springer, 2001.
28. H. Li and S. Baruah. Global mixed-criticality scheduling on multiprocessors. In ECRTS, pages 166–
175, 2012.
29. L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu. A software memory partition approach for
eliminating bank-level interference in multicore systems. In PACT, pages 367–376, 2012.
30. Z. Lu, M. Millberg, A. Jantsch, A. Bruce, P. van der Wolf, and H. T. Flow regulation for on-chip
communication. In DATE, pages 578–581, 2009.
31. D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and D. Dutoit.
Platform 2012, a many-core computing accelerator for embedded socs: Performance evaluation of
visual analytics applications. In DAC, pages 1137–1142, 2012.
32. W. Mi, X. Feng, J. Xue, and Y. Jia. Software-hardware cooperative dram bank partitioning for chip
multiprocessors. In Network and Parallel Computing, volume 6289 of LNCS, pages 329–343. 2010.
33. M. Mollison, J. Erickson, J. Anderson, S. Baruah, J. Scoredos, et al. Mixed-criticality real-time
scheduling for multicore systems. In ICCIT, pages 1864–1871, 2010.
34. M. Paolieri, E. Qui˜
nones, F. J. Cazorla, G. Bernat, and M. Valero. Hardware support for wcet analysis
of hard real-time multicore systems. In ISCA, pages 57–68, 2009.
35. R. Pathan. Schedulability analysis of mixed-criticality systems on multiprocessors. In ECRTS, pages
309–320, 2012.
36. R. Pellizzoni, B. D. Bui, M. Caccamo, and L. Sha. Coscheduling of cpu and i/o transactions in cots-
based embedded systems. In RTSS, pages 221–231, 2008.
37. Y. Qian, Z. Lu, and W. Dou. Analysis of communication delay bounds for network on chips.
38. Y. Qian, Z. Lu, and W. Dou. Analysis of worst-case delay bounds for on-chip packet-switching net-
works. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 29(5):802–
815, May 2010.
39. J. Reineke, I. Liu, H. D. Patel, S. Kim, and E. A. Lee. Pret dram controller: bank privatization for
predictability and temporal isolation. In CODES+ISSS, pages 99–108, 2011.
40. D. Tamas-Selicean and P. Pop. Design optimization of mixed-criticality real-time applications on
cost-constrained partitioned architectures. In RTSS, pages 24–33, 2011.
41. L. Thiele, S. Chakraborty, and M. Naedele. Real-time calculus for scheduling hard real-time systems.
In ISCAS, pages 101–104, 2000.
42. L. Thiele and N. Stoimenov. Modular performance analysis of cyclic dataflow graphs. In EMSOFT,
pages 127–136, 2009.
43. S. Tobuschat, P. Axer, R. Ernst, and J. Diemer. Idamc: A noc for mixed criticality systems. In RTCSA,
pages 149–156, 2013.
44. S. Vestal. Preemptive scheduling of multi-criticality systems with varying degrees of execution time
assurance. In RTSS, pages 239–243, 2007.
45. E. Wandeler, A. Maxiaguine, and L. Thiele. Performance analysis of greedy shapers in real-time
systems. In DATE, pages 444–449, Munich, Germany, 2006.
46 Georgia Giannopoulou et al.
46. E. Wandeler, L. Thiele, M. Verhoef, and P. Lieverse. System architecture evaluation using modular
performance analysis - a case study. International Journal on Software Tools for Technology Transfer,
8(6):649 – 667, 2006.
47. R. Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C. Ferdinand. Memory hierarchies,
pipelines, and buses for future architectures in time-critical embedded systems. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 28(7):966 –978, 2009.
48. Z. P. Wu, Y. Krish, and R. Pellizzoni. Worst case analysis of dram latency in multi-requestor systems.
In RTSS, pages 372–383, 2013.
49. H. Yun, R. Mancuso, Z.-P. Wu, and R. Pellizzoni. Palloc: Dram bank-aware memory allocator for
performance isolation on multicore platforms. In RTAS, pages 155–166, 2014.
50. H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha. Memory access control in multiprocessor for
real-time systems with mixed criticality. In ECRTS, pages 299–308, 2012.
51. J. Zhan, N. Stoimenov, J. Ouyang, L. Thiele, V. Narayanan, and Y. Xie. Designing energy-efficient
noc for real-time embedded systems through slack optimization. In DAC, pages 1–6, 2013.
Mixed-Criticality Scheduling on Cluster-Based Resource-Sharing Manycores 47
Appendix: Notation Summary
System Model - Sec. 3
Notation Meaning Source
τTask set Fixed
LNumber of criticality levels Fixed
Wi,χiPeriod and criticality level of task τiFixed
Ci(`)Execution profile with lower and upper bounds on execution time Fixed
(ei) and number of memory accesses (µi) of τiat level `χiFixed
Ci,deg Execution profile of τiat level of assurance `>χiFixed
Dep Dependency Graph
The edges (dependencies) are known,
but their weight (minimum distance constraint)
is determined from NoC analysis (Sec. 5.2)
PSet of processing cores on a target cluster Fixed
Tacc Memory access latency Fixed
MτMapping of tasks of τto cores of POptimized in Sec. 6
Remote Fetch Protocol - Sec. 3.2
τinit,i τiInitiating task for remote data fetch, task using fetched data Fixed
WCNT Worst-case time for transfer of notification from τinit,i Computed in Sec. 5.2 (Eq. 12)
to listener in remote cluster
WCRST Worst-case time for set-up of DMA transfer in remote cluster Based on measurements
WCDFT Worst-case time for complete transfer of data from remote cluster Computed in Sec. 5.2 (Eq. 12)
Flexible Time-Triggered and Synchronization based Scheduling - Sec. 4
HFTTS cycle Computed as hyper-period of τ
FSet of FTTS frames Computed after selecting frame lengths
LfLength of FTTS frame fSelected manually
barriers(f , l)kWorst-case length of k-th sub-frame in frame fat level `Computed for a given Mτ,Mmem
in Sec. 5.1 (Eq. 6)
Response Time Analysis - Sec. 5
IMemory interference graph Consisting of E1,E2
E1Mapping of tasks to memory blocks Fixed
E2Mapping of memory blocks to memory banks Optimized in Sec. 6
Mmem Equivalent to E2Optimized in Sec. 6
DMutual delay matrix Computed in Sec. 5.1.1 and 5.1.2
Rx Special node in Iindicating a high-priority NoC Rx access Added to Iin Sec. 5.1.2
δWeight of edge between Rx and accessed memory block(s) Computed in Sec. 5.2 (Eq. 13)
W CRTi(f , l)Worst-case response time of τiin frame fat level `Computed in Sec. 5.1.1 (Eq. 4)
CW C RTp,k(f , l)Worst-case response time of tasks executing on core pComputed in Sec. 5.1.1, updated in Sec. 5.1.2
in the k-th sub-frame of frame fat level `
Design Optimization - Sec. 6
kbarriersk33rd norm of bar riers for all f∈ F , ` ∈ {1,...,L}Computed for each candidate Mτ(Eq. 6)
T0Initial temperature Parameter of SA algorithm
aTemperature decreasing factor Parameter of SA algorithm
Tfinal Final temperature Parameter of SA algorithm
timemax Time budget Parameter of SA algorithm
... Nonetheless, the previous studies do not consider applications with real-time constraint. Hence, in the real-time off-chip memory-aware field, Giannopoulou et al. [17] have proposed a Simulated Annealing-based mapping technique for mixed-criticality tasks on Karlay MPPA NoC with a DRAM memory. Moreover, Gomony et al. [18] proposed a middleware to adapt any TDMA-NoC with a main-memory to support real-time systems. ...
... The problems involved in using a shared bus has lead Giannopoulou et al. to also include a Network-on-Chip (NoC) in their later work [250]. Burns et al. [80,133,141,239] apply a 'one criticality at a time' approach to MCS scheduled by the use of a Cyclic Executive; they considered both partitioned and global allocation of jobs to frames. ...
Book
This review covers research on the topic of mixed criticality systems that has been published since Vestal’s 2007 paper. It covers the period up to end of 2021. The review is organised into the following topics: introduction and motivation, models, single processor analysis (including job-based, hard and soft tasks, fixed priority and EDF scheduling, shared resources and static and synchronous scheduling), multiprocessor analysis, related topics, realistic models, formal treatments, systems issues, industrial practice and research beyond mixed-criticality. A list of PhDs awarded for research relating to mixed-criticality systems is also included.
... Giannopoulou et al. [34] presented a DSE framework for many/multicores, part of the Certification of Real-Time Applications designed for mixed-criticality (CERTAINTY) project, aiming at avionics applications. It uses a mixed-critically task application model with multiple critical levels and different task activation patterns and a platform model that allows to abstract of the memory and the communications of the system. ...
Thesis
Coarse-Grained Reconfigurable Architectures (CGRA) are designed to deliver high performance while drastically reducing the latency of the computing system. There are several types of CGRA according to the structure, application, type of resources, and memory infrastructure. We focus our work on a subset of CGRA designs that we call Software Programmable Streaming Coarse-Grained Reconfigurable Architectures (SPS-CGRA). An SPS-CGRA is a more or less complex array of coarse-grained heterogeneous hardware resources with a coarser granularity than the classical. An SPS-CGRA can perform spatial and temporal computations at low latency. Its stream-based processing provides high performance maintaining a level of flexibility. Although they are often highly domain-specifically optimized, they keep several levels of custom post-fabrication programmability, given by a set of parameters, so that they can be reused. However, their reuse is generally limited due to the complexity of identifying the best allocation of the processing tasks into the hardware resources. Another limiting point is the complexity of producing a reliable performance analysis for each new implementation since no mature tool exists.To solve these problems, we propose a complete mapping and scheduling framework that targets SPS-CGRA. We introduce a generic hardware model allowing one to express these intrinsically custom levels of flexibility without neglecting data access and system configuration control. We also propose a performance estimation analysis based on resource latency description, allowing to obtain the upper bound of the computing cost. To complete, we present four different solutions for the mapping and scheduling problem: a List-based algorithm with backtracking, a Lookahead-based heuristic, a Bayesian-based heuristic and, a Q-Learning mapping algorithm. We evaluate and compare our solutions against an exhaustive approach in a real-life example and illustrate the benefits and efficiency of the proposed framework
... This situation statistically keeps happening very often. The main issue is that best-effort processes and safety-critical processes have typically conflicting requirements [4][5][6][7]. ...
Preprint
Full-text available
Real-time embedded systems that combine processes of various criticalities (i.e. mixed-criticality real-time systems) represent an emerging research that faces many issues. This paper describes a new ASIC design of a coprocessor that realizes process scheduling for mixed-criticality real-time systems. The solution proposed in this paper uses Robust Earliest Deadline (RED) algorithm. Due to the on-chip implementation of the scheduler, all scheduler operations always take two clock cycles to execute. The proposed solution was verified by simulations that applied millions of random inputs. Chip area costs are evaluated by synthesis into ASIC using 28 nm TSMC technology. The proposed RED-based scheduler is compared with an existing EDF-based scheduler that supports hard real-time processes only. Even though the RED-based scheduler costs more chip area, it can handle any combinations of process criticalities, variations of process execution times and deadlines, achieves higher CPU utilization and can be used for scheduling of non-real-time, soft real-time and hard real-time processes combined within one system.
... For instance, for the DSE in References [15,35], genetic algorithms are used. In Reference [36], simulated annealing is used, while Reference [37] adopts particle swarm optimization. In spite of its advantages, this static design scheme is practical only for systems with a relatively static workload profile and is, thus, impractical for systems with dynamic workload scenarios or systems in which the complete set of applications is not known statically [25,38]. ...
Article
Full-text available
Many-core platforms are rapidly expanding in various embedded areas as they provide the scalable computational power required to meet the ever-growing performance demands of embedded applications and systems. However, the huge design space of possible task mappings, the unpredictable workload dynamism, and the numerous non-functional requirements of applications in terms of timing, reliability, safety, and so forth. impose significant challenges when designing many-core systems. Hybrid Application Mapping (HAM) is an emerging class of design methodologies for many-core systems which address these challenges via an incremental (per-application) mapping scheme: The mapping process is divided into (i) a design-time Design Space Exploration (DSE) step per application to obtain a set of high-quality mapping options and (ii) a run-time system management step in which applications are launched dynamically (on demand) using the precomputed mappings. This paper provides an overview of HAM and the design methodologies developed in line with it. We introduce the basics of HAM and elaborate on the way it addresses the major challenges of application mapping in many-core systems. We provide an overview of the main challenges encountered when employing HAM and survey a collection of state-of-the-art techniques and methodologies proposed to address these challenges. We finally present an overview of open topics and challenges in HAM, provide a summary of emerging trends for addressing them particularly using machine learning, and outline possible future directions. While there exists a large body of HAM methodologies, the techniques studied in this paper are developed, to a large extent, within the scope of invasive computing. Invasive computing introduces resource awareness into applications and employs explicit resource reservation to enable incremental application mapping and dynamic system management.
Thesis
Full-text available
Heterogeneous many-core platforms are becoming the de facto standard architectures for embedded systems, fulfilling their high demand for computational power consequent to the rapid increase in the number and the workload of the applications in such systems. Contrarily to historically being statically one-time programmed and the mix of running applications being fixed, modern embedded systems are often subject to a highly dynamic environment in which the mix of concurrently executed applications, the availability of resources, and the non-functional requirements of each application may change at run time on-demand. This necessitates adaptive application mapping and system management schemes to effectively cope with the system dynamism. In recent years, applications with hard real-time requirements are being increasingly observed in emerging embedded systems in various time-critical areas, e.g., automotive, telecommunications, and avionics. In this context, novel design methodologies are required to enable an adaptive mapping and management of hard real-time applications on many-core platforms with verified worst-case timing guarantees. Recently, so-called Hybrid Application Mapping (HAM) methodologies have been proposed that split the design process into an offline application mapping phase and an online system management phase and, thereby, can enable a dynamic mapping and management of hard real-time applications on many-core platforms. In HAM, for each application, a set of high-quality mapping alternatives with diverse resource requirements and verified timing guarantees are computed offline in a Design Space Exploration (DSE). These mappings are then used online by a Run-time Platform Manager (RPM) to launch the application on-demand by selecting a precomputed mapping that adheres to the current timing constraints of the application and the resource availability in the system. In line with this hybrid scheme, the work at hand proposes a collection of methodologies, analyses, and techniques that enable an automated mapping and adaptive management of hard real-time applications on heterogeneous many-core platforms with verified timing guarantees. Foremost, a formal timing analysis technique is proposed which can derive worst-case timing guarantees for hard real-time applications in (heterogeneous) many-core systems. In light of this analysis, a novel mapping optimization approach is proposed in which the amount of resources allocated per mapping is fine-tuned according to the timing requirements of the application to obtain high-quality mappings with satisfactory timing guarantees and enhanced resource efficiency. In many-core systems, hard real-time applications are particularly susceptible to built-in autonomous Dynamic Thermal Management (DTM) mechanisms which are responsible for preserving the thermal safety of the system. At the sight of thermal violations (overheating), DTM mechanisms suspend or decelerate the execution of the applications that are running in the thermally affected regions which may lead to the violation of their real-time constraints. To eliminate such fatal interferences, this work proposes a thermally composable HAM methodology that preserves the thermal safety of the system proactively at all times and, thereby, eliminates the exposure of real-time applications to DTM without jeopardizing the thermal integrity of the system. The offline DSE in HAM typically delivers a large set of Pareto-optimal mappings which imposes a considerable management overhead on the RPM and may degrade its responsiveness. This can be particularly fatal for hard real-time applications due to their strict dependence on prompt reactions from the RPM, e.g., upon a change in their timing constraints. As a remedy, this work proposes an automatic mapping-set distillation methodology that distills a promising subset of the mappings for online use to alleviate the RPM’s overhead while retaining a diverse blend of quality- and resource-demand trade-offs to address various run-time scenarios. For a seamless satisfaction of the timing constraints of hard real-time applications in many-core systems, on-the-fly adaptation of their mappings becomes inevitable in view of drastic changes in their timing constraints or the failure of resources they use, e.g., due to faults or overheating. To that end, this work proposes two mapping adaptation methodologies for hard real-time applications in many-core systems. In the first approach, adaptations are realized in the form of a reconfiguration between precomputed mappings, while the second approach empowers fine-grained mapping adaptations in the form of the migration of any subset of the application’s tasks to any desired locations. Both methodologies are supplied with formal timing analysis techniques and lightweight admission checks to verify the real-time conformity of each adaptation option online according to the current timing constraints of the application.
Thesis
In contexts such as embedded and cyber-physical systems, the design of a desired functionality under constraints increasingly requires a parallel execution of different tasks on heterogeneous architectures. The nature of such parallel systems implies a huge complexity in understanding and predicting performance in terms of response time. Indeed, response time depends on many factors associated with the characteristics of both the functionality and the target architecture. State-of-the art strategies derive response time by examining the operations required by each task for both processing and accessing shared resources. This procedure is often followed by the addition or elimination of potential interferences due to task concurrency. However, such approaches require an advanced knowledge of the software and hardware details, rarely available in practice. This thesis provides an alternative "topdown" strategy aimed at extending the cases in which hardware and software response times can be analyzed and predicted. The proposed strategy leverages on dataflow-based application representations and focuses on the response time estimation of reconfigurable applications mapped on both general-purpose and specialized processing elements.
Book
Full-text available
This Open Access book celebrates Professor Peter Marwedel's outstanding achievements in compilers, embedded systems, and cyber-physical systems. The contributions in the book summarize the content of invited lectures given at the workshop “Embedded Systems” held at the Technical University Dortmund in early July 2019 in honor of Professor Marwedel's seventieth birthday. • Provides a comprehensive view from leading researchers with respect to the past, present, and future of the design of embedded and cyber-physical systems; • Discusses challenges and (potential) solutions from theoreticians and practitioners on modeling, design, analysis, and optimization for embedded and cyber-physical systems; • Includes coverage of model verification, communication, software runtime systems, operating systems and real-time computing.
Chapter
Full-text available
For many embedded applications, non-functional requirements, e.g., execution time, must be guaranteed in tight bounds. Unfortunately, many applications, e.g., video streaming, exhibit high variability in, e.g., per-iteration execution time, especially on MPSoC platforms. Such jitters are partly imposed by the system management software practicing, e.g., cache strategies and power management, and partly imposed by workload variation, e.g., the number of objects in an input image. In this paper, we classify and present techniques for enforcement of non-functional properties. These techniques make the system management software become the application’s advocate instead of both acting independently. For the example of enforcement of execution time intervals, we present centralized and distributed enforcement techniques in which preferential threads called e-lets are generated to control system resources in view of application/task workload variation. The behavior of each e-let is formally specified by an enforcement automaton, derived based on a static application characterization. We consider a case study on timing enforcement of image processing applications on MPSoCs with per-core DVFS and present a DSE to construct per-task enforcement automata with provably minimal energy consumption which, in our case study, enables 41 % energy savings.