Conference PaperPDF Available

The ForeC Synchronous Deterministic Parallel Programming Language for Multicores

Authors:

Abstract and Figures

Cyber-physical systems (CPSs) are embedded systems that are tightly integrated with their physical environment. The correctness of a CPS depends on the output of its computations and on the timeliness of completing the computations. This paper proposes the ForeC language for the deterministic parallel programming of CPS applications on multi-core execution platforms. ForeC's synchronous semantics is designed to greatly simplify the understanding and debugging of parallel programs. ForeC allows programmers to express many forms of parallel patterns while ensuring that programs are amenable to static timing analysis. One of ForeC's main innovation is its shared variable semantics that provides thread isolation and deterministic thread communication. Through benchmarking, we demonstrate that ForeC can achieve better parallel performance than Esterel, a widely used synchronous language for concurrent safety-critical systems, and OpenMP, a popular desktop solution for parallel programming. We demonstrate that the worst-case execution time of ForeC programs can be estimated precisely.
Content may be subject to copyright.
The ForeC Synchronous Deterministic Parallel Programming Language
for Multicores
(Invited Paper)
Eugene Yip, Alain Girault, Partha S. Roopand Morteza Biglari-Abhari
Software Technologies Research Group, University of Bamberg, 96045 Germany. Email: eugene.yip@uni-bamberg.de
Inria, France. Universit´
e Grenoble Alpes, Lab. LIG, Grenoble, France.
CNRS, Lab. LIG, F-38000 Grenoble, France. Email: alain.girault@inria.fr
Department of ECE, The University of Auckland, New Zealand. Email: {p.roop, m.abhari}@auckland.ac.nz
Abstract—Cyber-physical systems (CPSs) are embedded sys-
tems that are tightly integrated with their physical environ-
ment. The correctness of a CPS depends on the output of
its computations and on the timeliness of completing the
computations. This paper proposes the ForeC language for
the deterministic parallel programming of CPS applications
on multi-core execution platforms. ForeC’s synchronous se-
mantics is designed to greatly simplify the understanding and
debugging of parallel programs. ForeC allows programmers
to express many forms of parallel patterns while ensuring
that programs are amenable to static timing analysis. One
of ForeC’s main innovation is its shared variable seman-
tics that provides thread isolation and deterministic thread
communication. Through benchmarking, we demonstrate that
ForeC can achieve better parallel performance than Esterel,
a widely used synchronous language for concurrent safety-
critical systems, and OpenMP, a popular desktop solution for
parallel programming. We demonstrate that the worst-case
execution time of ForeC programs can be estimated precisely.
1. Introduction
Safety-critical embedded systems must be dependable
and functionally safe [1] and certified against safety stan-
dards, such as DO-178B [2]. Certification is a costly and
time consuming exercise and is exacerbated by the use of
multi-core processors. The correctness of a safety-critical
embedded systems depends on the output of its compu-
tations and on the timeliness of completing the computa-
tions [3] in response to inputs. Thus, a key to building
successful embedded systems using multi-cores is the un-
derstanding of the timing behaviors of the computations [4].
This is typically achieved with static worst-case execution
time (WCET) analysis [5] and is a complex process because
it depends on the underlying execution platform.
C [6] is a popular language for programming embedded
systems with support for multi-threading and parallelism
provided by third-party libraries, such as Pthreads [7] and
OpenMP [8]. Unfortunately, these multi-threading solutions
are inherently non-deterministic [9] and programmers fall
Physical Environment
(Continuous-Time Domain)
Synchronous Program
(Discrete-Time Domain)
Thread 1
Thread n
Thread 2
Sample Inputs
A Logical Global Tick
(Atomic and Instantaneous)
Logical
Global Clock
Emit Outputs
Figure 1. Synchronous model of computation.
into common parallel programming traps and pitfalls [10].
To help alleviate this issue, runtime environments that en-
force deterministic thread scheduling and memory accesses
can be used. CoreDet [11] maps all thread interactions
onto a logical timeline. The program’s execution alternates
between parallel and serial phases. However, understanding
the program’s behavior at compile time remains difficult
because the determinism is enforced by the runtime. Threads
in CoreDet maintain their own snapshot of the shared mem-
ory state, which is resynchronized in every serial phase.
This concept is used and formally defined in Concurrent
revisions [12]. An alternative is to extend the C language
with deterministic parallelism, such as SharC [13], but its
time predictability is unknown.
Synchronous languages [14], such as Esterel [15], pro-
vide an alternative by offering deterministic concurrency
that is based on sound mathematical semantics, which facil-
itates system verification by formal methods [16]. Figure 1
depicts a synchronous program, defined as a set of concur-
rent threads, within its physical environment. A synchronous
program reacts to inputs by producing corresponding out-
puts. Each reaction is triggered by a hypothetical (logical)
global clock. Central to synchronous languages is the syn-
chrony hypothesis [14], which states that the execution of
each reaction is considered to be atomic and instantaneous.
Concurrent threads communicate instantaneously with each
other (dashed arrows in Figure 1) using signals. Once the
embedded system is implemented, the synchrony hypothesis
is validated by ensuring that the WCET of any global tick
does not exceed the minimal inter-arrival time of the inputs.
This is known as worst-case reaction time analysis and
techniques have been developed for multi-cores [17], [18].
C-based synchronous languages, such as PRET-C [19] and
SyncCharts in C [20], appeal to C programmers because the
learning barrier for synchronous languages is reduced. How-
ever, their inherent sequential execution semantics render
them unsuitable for multi-core execution.
Synchronous programs are considerably difficult to par-
allelize [21] due to the need to resolve instantaneous thread
communication and associated causality issues. At runtime,
all potential signal emitters must be executed before all
testers of a signal. If this is not possible, then a causality
issue arises. Hence, concurrency is typically compiled away
to produce only sequential code [22] that is then paral-
lelized [21], [23]. The Synchronized Distributed Executive
(SynDEx) [24] approach considers communication costs
when distributing code to each processing element. Yuan
et al. [25] offer static and dynamic scheduling approaches
for Esterel on multi-cores.
Contributions. The synchronous languages designed
for the single-core era must be redesigned to address the
multi-core challenges. Section 2 describes the multi-core
architecture considered by this paper. We propose a C-based
synchronous parallel programming language, called ForeC,
for simplifying the deterministic parallel programming of
embedded multi-cores. ForeC brings together the formal
deterministic semantics of synchronous languages and the
benefits of C’s control and data structures (Section 3). A
key innovation is ForeC’s shared variable semantics that
provides thread isolation and deterministic thread commu-
nication. Moreover, many forms of parallel programming
patterns can be expressed in ForeC, such as the software
pipeline design pattern in Section 4. ForeC can be com-
piled for direct execution on embedded multi-cores or for
execution on desktop multi-cores using an operating system
(Section 5). Through benchmarking in Section 6, we demon-
strate that ForeC can achieve better parallel performance
than Esterel and OpenMP, while being amenable to static
timing analysis. Section 7 concludes the paper.
2. Multi-Core Architecture
The architecture of the predictable multi-core used in
this paper is representative of existing predictable de-
signs [26]. It is a homogeneous multi-core processor that
we have designed using Xilinx MicroBlaze [27] cores. Each
MicroBlaze core has a three-stage in-order pipeline that is
free of timing anomalies and is connected to private data
and instruction scratchpads. The scratchpads are statically
allocated and loaded at compile time. A shared bus with
TDMA arbitration connects the cores to shared resources,
such as global memory and peripherals. We developed a
multi-core MicroBlaze simulator for benchmarking purposes
by extending an existing MicroBlaze simulator [28] to sup-
port cycle-accurate simulation, an arbitrary number of cores,
and a shared bus with TDMA arbitration. We observe that
the focus of the paper is not on architectural innovations for
time predictability, but rather language theoretic innovations
(Section 3).
3. The ForeC Language
ForeC inherits the benefits of synchrony, such as deter-
minism and reactivity, along with the benefits and power of
the C language, such as control and data structures. This
is unlike conventional synchronous languages, which treat
C as an external host language. A key goal of ForeC is
in providing deterministic shared variable semantics that is
agnostic to scheduling. This is essential for the reasoning
and debugging of parallel programs.
3.1. Overview of ForeC
ForeC extends a safety-critical subset of C [29] with a
minimal set of synchronous constructs. Although C [6] is
popular for programming safety-critical embedded systems,
it has unspecified and undefined behaviors. Safety-critical
programmers follow strict coding guidelines [30] to help
write deterministic programs that are understandable, main-
tainable, and easier to debug. We describe the statements,
specifiers, and qualifiers allowed in the C subset:
C statements (cst): Expressions can only be constants,
variables, pointers, and array elements that are composed
with the logical, bitwise, relational, and arithmetic operators
of C. Although the use of pointers and arrays is allowed,
pointer aliasing makes static analysis difficult [31]. Thus, we
assume that pointers are only assigned once. All C control
statements, except goto, can be used.
C type specifiers and qualifiers: All the C primitives
and qualifiers can be used. Custom data types can be defined
using struct,union, and enum.
C storage class specifiers: The C typedef,extern,
static,auto, and register specifiers can be used.
Figure 2 gives the extended syntax of ForeC and Table 1
summarizes the informal semantics. A statement (st) in
ForeC can be a traditional C statement (c st), or a barrier
(pause), fork/join (par), or preemption (abort) state-
ment. Using the sequence operator ( ; ) a statement in ForeC
can be an arbitrary composition of other statements. Like
C, extra properties can be specified for variables using type
qualifiers. A type qualifier (tq) in ForeC is a traditional C
type qualifier (c tq), an environment interface (input and
output), or a shared variable amongst threads (shared).
3.1.1. I/O, Threads, and Pausing. Like traditional C pro-
grams, the function main is the program’s entry point
and serves as the initial thread of execution. To recap, the
threads of a synchronous program execute in lock-step to
the ticking of a global clock. During each global tick, the
threads sample the environment, perform their computations,
Statements:
st ::= c st |pause |par(st,st)
|weak?abort st when immediate? (exp)|st;st
Type Qualifiers:
tq ::= c tq |input |output |shared
Figure 2. Syntactic extensions to C.
TABLE 1. FORECCON ST RUCT S AN D THE IR SE MA NTI CS .
input: Type qualifier to declare an input, the value of which is updated
by the environment at the start of every global tick.
output: Type qualifier to declare an output, the value of which is emitted
to the environment at the end of every global tick.
shared: Type qualifier to declare a shared variable, which can be accessed
by multiple threads.
pause: Pauses the executing thread until the next global tick.
par(st,st): Forks two statements st as parallel threads. The par terminates
when both threads terminate (join back).
weak?abort st when immediate? (exp): Preempts its body st when
the expression exp evaluates to a non-zero value. The optional weak and
immediate keywords modify its temporal behavior.
and emit their results to the environment. When a thread
completes its computation, we say that it completes its local
tick. When all threads complete their local ticks, we say that
the program completes its global tick. The program below
declares two input and two output variables, and a main
thread that forks the execution of two threads that each have
three sequential statements:
i n p u t i n t X, Y; o ut p ut i n t A= 0 , B = 0 ;
v o i d main ( v o i d ){par ( t 1 ( ) , t 2 ( ) ) ; }
v o i d t 1 ( v o i d ){
int a = 1+X ; p aus e ; A = a X;
}
v o i d t 2 ( v o i d ){
int b = 1+Y ; p ause ; B = bY;
}
Inputs are read-only and their values are updated by the
environment at the start of each global tick. Outputs emit
their values to the environment at the end of each global
tick. The program starts its first global tick from the main
thread. The main thread executes the par statement that
forks its arguments (the functions t1 and t2) into two
parallel child threads. The par is a blocking statement and
terminates only when both its child threads terminate, i.e.,
join together. The child threads t1 and t2 initialize their
local variables by incrementing the input values by 1. Let
the values of the inputs be X=1 and Y=2 during the first
global tick. Hence, thread t1 assigns a=2 and thread t2
assigns b=3. Both child threads execute a pause statement,
which pauses their execution and acts as a synchronization
barrier. We say that both threads have completed their local
ticks. Next, the program completes its first global tick and
the outputs A=0 and B=0 are emitted.
The program starts its second global tick by resuming
the child threads t1 and t2 from their respective pause
statements. That is, both threads begin their next local ticks.
Let the values of the inputs be X=3 and Y=4 during the
second global tick. Thread t1 assigns A=6 and thread t2
assigns B=12 to the output variables. Both child threads
terminate, causing the par statement in the main thread
to terminate. The main thread resumes its execution by
reaching the end of its body and terminating. The second
global tick ends and the outputs A=6 and B=12 are emitted.
3.1.2. Shared Variables. All variables in ForeC follow the
scoping rules of C. By default, all variables are private and
can only be accessed (read or write) by one thread through-
out its scope. To allow a variable to be accessed by multiple
threads, it must be declared as a shared variable by using the
shared type qualifier. Thus, any misuse of private variables
are easy to detect at compile time. The semantics for shared
variables permit them to be accessed deterministically in
parallel, without needing the programmer to explicitly use
mutual exclusion. We modify our program by making the
child threads t1 and t2 share the variable x:
i n p u t i n t X, Y; o ut p ut i n t A= 0 ;
v o i d main ( v o i d ){
s ha r ed i n t x =1 c ombi ne a l l with p l u s ;
par ( t 1 (& x ) , t 2 (&x ) ) ;
A = x ;
}
v o i d t 1 ( s ha r ed i n t x ) {
x = 1+X ; p ause ; x = xX;
}
v o i d t 2 ( s ha r ed i n t x ) {
x = 1+Y ; p ause ; x = xY;
}
i n t p l u s ( int t h 1 , i n t t h 2 ) {
r e t u r n ( t h 1 + t h 2 ) ;
}
The main thread now declares a shared variable called x.
C’s call by reference is used to pass xto the child threads.
When the child threads start their local tick, they each create
alocal copy of x. When the child threads need to access
x, they access their copy of xinstead. Hence, their copies
of xremain distinct from the shared variable xdeclared in
the main thread. The changes made by one thread cannot
be observed by any other, yielding mutual exclusion and
thread isolation. Thread isolation minimizes the need to
serialize parallel accesses to shared variables, thereby max-
imizing runtime parallelism. This is key to enhancing exe-
cution performance and is unlike conventional synchronous-
reactive languages [21]. Moreover, only sequential reasoning
is needed within the thread’s local tick. Next, let the values
of the inputs be X=1 and Y=2 during the first global tick.
Threads t1 and t2 assign 2and 3, respectively, to their
local copies of xbefore pausing. The first global tick ends
and the local copies of xare automatically combined into
a single value by a programmer-specified combine function.
The combine function for the shared variable xis plus,
specified in the combine clause of its declaration. Thus, the
combined value of both copies is plus(2,3)=5 and is as-
signed to the shared variable x. We call the combined value
that is assigned to the shared variable the resynchronized
value and call the process of updating the shared variable
as resynchronizing. Resynchronizing shared variables at the
end of each global tick ensures deterministic outputs at the
Expressions:
exp ::= val |var |ptr[exp]|(exp)|u op exp |exp b op exp
Unary Operators:
u op ::= *|&|!|-|˜
Binary Operators:
b op ::= || |&& |ˆ|||&|<< |>> |== |!= |<|>|<=
|>= |+|-|*|/|%
Figure 3. Syntax of preemption conditions.
end of each global tick. Finally, the first global tick ends
and the output A=0 is emitted.
When the program starts its second global tick, the child
threads start their next local ticks by creating a local copy of
x. Their copies are initialized with x’s resynchronized value
of 5. Let the values of the inputs be X=3 and Y=4 during the
second global tick. Threads t1 and t2 assign 15 and 20,
respectively, to their local copies of xand terminate. When
all the child threads of a par terminate, their local copies
are automatically combined and assigned to their parent
thread. In this case, the combined value of both copies is
plus(15,20)=35. The main thread resumes and assigns
the combined value of xto the output Aand then terminates.
The second global ends and the output A=35 is emitted.
3.1.3. Combine Functions and Policies. The signature
of any combine function is C:Val ×Val Val.
The two input parameters are the two copies to be com-
bined. A combine function is invoked multiple times
when more than two copies need to be combined, e.g.,
c(v1, c(v2,· · · c(vn1, vn)))). Combine functions must be
deterministic, associative, and commutative. That is, they
produce the same outputs from the same inputs, regardless
of previous invocations and how the copies are ordered or
grouped. ForeC’s combine functions are inspired by Es-
terel [15] but similar solutions can be found in other parallel
programming frameworks, e.g., OpenMP’s reduction op-
erators [8]. Solutions developed for these frameworks could
be reworked into ForeC combine functions and policies.
It may be useful to ignore some of the copies when
resynchronizing a shared variable. This is achieved by spec-
ifying a combine policy that determines what copies will
be ignored. The combine policies are new,mod, and all
and they are used in the combine clause during variable
declaration, e.g., combine new with. The new policy
ignores the copies that have the same value as their shared
variable’s resynchronized value. The mod policy ignores the
copies that were not assigned a value during the global tick.
The default policy is all where no copies are ignored. Note
that the combine function is not invoked when only one copy
remains. Instead, that copy is resynchronized value.
3.1.4. Preemption. Inspired by Esterel [15], the abort st
when (exp) statement provides preemption, which is the
termination of the abort body st when the condition exp
evaluates to true. Preemption can be used to model hier-
archical state machines succinctly. The condition exp must
be a side-effect free expression produced from the syntax
shown in Figure 3. The program below is an example of a
non-immediate and strong abort:
v o i d main ( v o i d ){
int x = 1 ;
abort {
x = 2 ; pau s e ; x = 3 ; p a use ; x = 4 ;
}when ( x >2) ;
}
After initializing variable xto 1, execution reaches the
abort.x=2 is executed and then the first global tick ends.
At the start of the second global tick (and at each subsequent
global tick), the preemption condition is evaluated before the
abort body can execute. This allows shared variables in the
condition to be evaluated with their resynchronized value.
If the preemption condition evaluates to true (any non-zero
value following the C convention), then the abort termi-
nates without executing its body. For this abort example,
the preemption condition is true at the start of the third
global tick, as x=3. An abort will also terminate when
execution reaches the end of its body.
Like Esterel, the optional weak and immediate key-
words change the temporal behavior of the preemptions. The
weak keyword delays the termination of the abort body
until the body cannot execute any further, e.g., reaches a
pause statement. The following is an example of a non-
immediate and weak abort:
v o i d main ( v o i d ){
int x = 1 ;
weak abort {
x = 2 ; pau s e ; x = 3 ; p a use ; x = 4 ;
pau s e ;
}when ( x >2) ;
}
Here, although the preemption condition is true at the start
of the third global tick, the termination of the abort is
delayed until the third pause is reached.
The immediate keyword allows the abort to ter-
minate immediately as soon as execution reaches it for the
first time. The following is an example of an immediate and
strong abort:
v o i d main ( v o i d ){
int x = 3 ;
abort {
x = 2 ; pau s e ; x = 3 ; p a use ; x = 4 ;
}when immediate ( x >2) ;
}
Here, the initial value of xis 3, meaning that the preemption
condition is true when execution first reaches the abort.
Hence, the abort body is not executed at all.
Lastly, both the weak and immediate keywords can
be used together to define an immediate and weak abort:
v o i d main ( v o i d ){
int x = 3 ;
weak abort {
x = 2 ; pau s e ; x = 3 ; p a use ; x = 4 ;
}when immediate ( x >2) ;
}
Here, the preemption condition is true when execution first
reaches the abort. However, the termination of the abort
is delayed until the first pause is reached.
3.2. Comparison with Esterel and Concurrent
Revisions
This section compares ForeC with Esterel [15] and
Concurrent revisions [12]. Concurrent revisions is a pro-
gramming model that supports the forking and joining of
asynchronous threads. When a thread is forked, it creates
a snapshot of the shared variables. Changes performed by
the thread are applied to its snapshot, thus, ensuring thread
isolation. The snapshots are merged together using a deter-
ministic merge function when the threads join. The merge
function always considers all the copies, i.e., equivalent to
ForeC’s combine policy all. Thread communication is,
therefore, always delayed until the child threads join. In con-
trast, ForeC threads may execute over several global ticks
and thread communication is only delayed to the end of each
global tick. Esterel threads communicate instantaneously
by emitting and receiving signals during each global tick.
Signal emissions may have associated values that must be
combined using a programmer-specified combine function
before the signal can be read. Esterel’s combine function
only considers the emitted values, i.e., equivalent to ForeC’s
combine policy mod.
In Concurrent revisions, the parent thread can execute
alongside its children and, e.g., fork more threads in re-
sponse to higher input workloads. This is not the case with
ForeC and Esterel because the parent thread blocks until of
all its children have joined. The commutativity and associa-
tivity of ForeC, Esterel, and Concurrent revisions’ parallel
construct depends on the commutativity and associativity of
their combine and merge functions, respectively.
Preemptions in Esterel are triggered instantaneously by
instantaneous signal communication. However, preemptions
in ForeC are triggered after a delay of one global tick
because preemption conditions are evaluated using values
computed in the previous global tick. Concurrent revisions
does not support preemptions. Esterel programs may be
non-causal [14] because of instantaneous feedback cycles.
Thanks to delayed communication, programs in ForeC and
Concurrent revisions are always causal by construction.
4. Software Pipeline Design Pattern in ForeC
Design patterns [32] are reusable templates that pro-
grammers can use to solve recurring problems of any size
and achieve high execution performance. For example, the
software pipeline pattern [32] solves the problem of needing
to process an audio or video data stream in stages. An
instance of this pattern in ForeC is given in Figure 4. The
data processing is broken down into pipeline stages that
work in parallel on different chunks of the data stream. The
pattern forks a thread for each pipeline stage. Each stage gets
a chunk of data, processes the data, and passes the result to
i n p u t i n t i n ; o ut p ut i n t o u t ;
s ha r ed i n t s 1 = 0 , s 2 = 0 ;
v o i d main ( v o i d ){
par ( s t a g e 1 ( ) , par ( s t a g e 2 ( ) , s t a g e 3 ( ) ) ) ;
}
v o i d stage1 ( v o i d ){
while ( 1 ) {s 1 = p r o c e s s 1 ( i n ) ; p a use ;}
}
v o i d stage2 ( v o i d ){pau s e ;
while ( 1 ) {s 2 = p r o c e s s 2 ( s 1 ) ; p a use ;}
}
v o i d stage3 ( v o i d ){pau s e ;p aus e ;
while ( 1 ) {o u t = p r o c e s s 3 ( s2 ) ; pau se ;}
}
Figure 4. Software pipeline.
the next stage. This is repeated until the data stream ends.
Data is passed from each stage using shared variables.
The explicit use of buffers is not needed because threads
always work on local copies of the shared variables. The
pipeline is synchronous because the stages pause before
processing their next chunk of data. Hence, the throughput
is determined by the slowest pipeline stage. To initialize
the pipeline correctly, each stage must wait for their initial
chunk of data. The waiting is achieved by placing, at the
start of the threads, a number of pause statements equal
to the number of preceding stages. The stages execute in
parallel after all have waited for their initial chunk of data.
5. Compiling ForeC for Parallel Execution
The ForeC compiler generates statically scheduled code
for direct execution on a predictable parallel architecture
described in Section 2. The aim is to generate performant
code that is amenable to static timing analysis. The compiler
translates the ForeC statements into equivalent C code and
generates thread scheduling routines for each core. The
programmer statically allocates the threads to the cores and
passes the allocations into the compiler. Based on the allo-
cations, the compiler defines a total order for all the threads,
using on a depth-first traversal of the program’s control-flow
graph. A static and non-preemptive (cooperative) schedule is
created for each core, encoded as a doubly linked list, similar
to the approach of the Columbia Esterel Compiler [22]. Each
thread is represented as a node in the linked list of its allo-
cated core. The node stores the thread’s continuation point
(a program counter) and links to the threads (nodes) that
are scheduled before and after it. The initial continuation
point of a thread is the start of its body. Each core begins
its execution by jumping to the continuation point of the
first thread in its linked list. The thread executes without
interruption until it reaches a context-switching point: a
par or pause statement, or the end of its body. At this
point, the thread stores its next continuation point into its
node and jumps to the continuation point in the next node.
Thus, inserting or removing a thread or routine from the
list controls whether it is included or excluded, respectively,
from execution.
A problem with encoding the static schedule as a linked
list is that the global tick in which threads fork and join can
only be determined at runtime. This means that threads need
to be inserted or removed from the linked lists whenever
they are forked or whenever they terminate. This is managed
by inserting synchronization routines into the linked lists that
send and receive scheduling information between cores at
runtime. For example, when a parent thread on one core
reaches a par, the other cores receive this information
via their synchronization routines and will insert the newly
forked threads into their linked lists. The notion of a global
tick is preserved by ending each linked list with a global tick
synchronization routine that implements barrier synchro-
nization. One core is nominated to perform the following
housekeeping tasks: resynchronizing the shared variables,
emitting the outputs, and sampling the inputs.
Shared variables are hoisted up to the program’s global
scope to allow all cores to access them. The copies of shared
variables are implemented as unique global variables. In
each thread, all shared variable accesses are replaced by
accesses to their copies.
6. Benchmarks
This section begins by evaluating the static worst-case
reaction time (WCRT) analysis of ForeC programs. We have
previously shown [18] that the WCRT of ForeC programs
could be estimated to a high degree of precision, which is
very useful for implementing real-time embedded systems.
Further, we provide a performance comparison between
ForeC and Esterel. Although ForeC has been designed
with embedded multi-cores in mind, we believe that the
ForeC language can be applied to the deterministic parallel
programming of desktop multi-cores. Performant parallel
programs can be achieved by using parallel design patterns
within a synchronous language, like ForeC. We evaluate
this hypothesis with a performance comparison to OpenMP,
a popular desktop solution for parallel programming that
primarily exploits loop data parallelism.
6.1. Time Predictability
ForeC is amenable to static timing analysis because we
implement all language features using static scheduling.
Additionally, bounded loops are used to ensure bounded
execution times, synchronous preemption is used instead of
(asynchronous) hardware interrupts, and threads execute in
isolation thanks to the shared variable semantics. Our C++
ForeCast tool [18] statically analyzes the WCRT of ForeC
programs on embedded multi-cores. This section highlights
our key findings [18] for the MicroBlaze multi-core simula-
tor described in Section 2, with the configuration shown in
Figure 5. The 802.11a [33] benchmark is production code
from Nokia, that tests various signal processing algorithms
needed to decode 802.11a data transmissions. 802.11a has
complex data and control dominated computations, com-
prised of 2147 lines of ForeC code that forks 26 threads,
of which up to 10 can execute in parallel. 802.11a was
Xilinx MicroBlaze, 3-stage pipeline, no branch prediction, no caches,
no out-of-order execution, 8 KB private data and instruction scratchpads
on each core (1 cycle access time), 32 KB global memory (5 cycle
access time), TDMA shared bus (5 cycle time slots per core and, thus,
a5×(number of cores) cycle long bus schedule), programs compiled with
MB-GCC-4.1.2 -O0 and decompiled with MB-OBJDUMP-4.1.2.
Figure 5. MicroBlaze multi-core configuration.
12345678910
50
75
100
125
150
×103
Cores
WCRT (clock cycles)
Observed Computed
Figure 6. WCRT results for 802.11a in clock cycles.
distributed on up to 10 cores and ForeCast computed the
WCRT of each distribution. To evaluate the precision of
the computed WCRTs, 802.11a was executed by the multi-
core simulator for one million reactions unless the program
terminated. Test vectors were generated to elicit the worst-
case execution path by studying the program’s control-flow.
The simulator reported the execution time of each global
tick and the longest was taken as the observed WCRT.
The observed and computed WCRTs of 802.11a are plot-
ted as a line graph in Figure 6. This graph shows that Fore-
Cast is very precise, even as the number of cores increased.
ForeCast computed WCRTs that were only 3.2% longer
than the observed WCRTs. Similar levels of precision have
been demonstrated with other benchmark programs [18].
The computed WCRTs of 802.11a in Figure 6 also reflect
the benefit of multi-core execution. The computed WCRT
decreased when the number of cores increased from one to
five. The computed WCRT at five cores corresponded to
the execution time of one thread that was already allocated
to its own core. Thus, the WCRT could not be improved
by distributing the remaining threads. The WCRT increased
after five cores because of increased scheduling overheads
and global memory access times. Ju et al. [17] offer the
only other known static WCRT analysis approach for syn-
chronous programs on multi-cores. Unfortunately, we cannot
compare with that work because their results are only for a
four core system with no precision results reported.
6.2. Comparison with Esterel
The dynamic scheduling approach by Yuan et al. [25]
for Esterel programs has been shown to perform well on a
TABLE 2. AVER AGE WCRT SP EED UP R ESU LTS FO R FORECAND
EST ERE L ON FO UR C ORE S NO RMA LIZ ED TO S IN GLE -TH REA DED C .
Version Life Lzss Mandelbrot MatrixMultiply
Esterel 2.28 2.42 1.20 3.87
ForeC 3.25 3.30 3.68 3.76
MicroBlaze multi-core platform similar to ours. Yuan’s ap-
proach requires a special hardware FIFO queue to help allo-
cate threads to cores for resolving signal statuses. For bench-
marking, the MicroBlaze multi-core simulator described in
Section 2 was extended with a hardware queue to support the
dynamic scheduling. Because a WCRT analysis technique
for Yuan et al.’s approach does not exist, we compare instead
the average WCRT speedup achieved by ForeC and Esterel
for the following programs: Life simulates Conway’s Game
of Life for a fixed number of iterations and a given grid of
cells. Lzss uses the Lempel-Ziv-Storer-Szymanski algorithm
to compress a fixed amount of text. Mandelbrot computes
the Mandelbrot set for a square region of the complex
number plane. MatrixMultiply computes the matrix mul-
tiplication of two equally sized square matrices. Single-
threaded C, ForeC, and Esterel versions of each benchmark
program were created and handcrafted for best performance.
The same input vector was given to each version to ensure
that the same computations were performed. The MicroB-
laze simulator returns the execution time when a program
terminates.
Table 2 shows the speedups achieved by ForeC and
Esterel when the benchmark programs were executed on
four cores. Speedup is calculated as:
Speedup(P) = Execution time of single-threaded C
Execution time of P(1)
where Pis the ForeC or Esterel version of the benchmark
program being tested. Except for MatrixMultiply, ForeC
shows superior performance than Esterel, even though Es-
terel uses dynamic scheduling with hardware acceleration.
For Esterel, all possible signal emitters must execute before
any signal consumers and this invariant is achieved using a
signal locking protocol [25] that is costly. In comparison,
shared variables in ForeC only need to be resolved at the
end of each global tick. The significance of the overhead is
evident in the Mandelbrot results, where the Esterel version
has 24 unique signals and only achieved a speedup of 1.2.
Because of minimal data dependencies in MatrixMultiply,
the scheduling overheads of the ForeC and Esterel versions
were minimal, resulting in similar speedups.
6.3. Comparison with OpenMP
We extended the ForeC compiler to target desktop multi-
cores running an operating system. For each core, the com-
piler creates a Pthread [7] to run its static thread schedule.
Hence, a fixed pool of Pthreads executes the ForeC threads,
so the cost of creating each Pthread is only incurred once.
Although the Pthreads are dynamically scheduled by the
operating system, the original ForeC threads follow their
TABLE 3. AVER AGE S PEE DUP R ES ULTS F OR FOR ECAND OPE NMP ON
FOUR CORES NORMALIZED TO SINGLE-THR EA DED C .
Version FmRadio Life Lzss
OpenMP 2.02 2.82 3.46
ForeC 2.63 2.98 3.90
static schedule. Because execution is inherently speculative
on desktop multi-cores, we present benchmarking results
for the average execution time speedup for the following
programs: FmRadio [33] transforms a fixed stream of radio
signals into audio. The history of the radio signals is used
to guide the transformation of the remaining signals. Life
and Lzss are desktop versions of those used in Section 6.2.
Single-threaded C, ForeC, and OpenMP versions of each
benchmark program were created and handcrafted for best
performance. For the ForeC versions, FmRadio used the
software pipeline and fork-join patterns and Lzss used the
fork-join pattern. The Intel VTune Amplifier XE [34] soft-
ware helped identify areas of code that could be parallelized
for the OpenMP versions of the benchmarks. Benchmarking
was carried out on a 3.4 Ghz four-core Intel Core-i5 3570
desktop running Linux 3.6 with 8 GB of RAM. Hyper-
Threading, Turbo Boost, and SpeedStep were disabled. The
benchmarks were compiled with GCC-4.8 -O2.
Table 3 shows the average speedups achieved by ForeC
and OpenMP. Speedups were calculated using equation (1).
Although ForeC and OpenMP achieved a speedup of be-
tween two and four, ForeC demonstrates greater speedup
than OpenMP in these preliminary results. Dynamic and
static thread scheduling pragmas were used in the OpenMP
versions and dynamic scheduling does introduce slight over-
heads, especially thread locking, but these overheads should
be amortized over the overall run of the benchmarks. This
scheduling approach of OpenMP contrasts with the ForeC
approach, where all work scheduling is static and determined
automatically by the ForeC compiler.
7. Conclusions
This paper introduced the ForeC language that enables
the deterministic parallel programming of multi-cores. The
language features of ForeC help bridge the differences
between synchronous-reactive programming and general-
purpose parallel programming. The local copying of shared
variables ensures thread isolation and determinism, while
minimizing the need to serialize parallel accesses to shared
variables. The behavior of shared variables can be tailored
to the application at hand by specifying suitable combine
functions and policies. A critical comparison showed that
ForeC combines the benefits offered by synchronous lan-
guages with those offered by deterministic runtime solu-
tions. To the best of our knowledge, no other synchronous
language achieves parallel execution and time predictability
similar to ForeC. For future work, the ForeC compiler could
be improved to generate more efficient code that remains
amenable to static timing analysis. In particular, different
static scheduling strategies could be explored for different
parallel programming patterns. The allocation of ForeC
threads could be refined automatically by feeding ForeCast’s
WCRT analysis results back into the ForeC compiler.
Acknowledgments
This work was supported in part by the RIPPES INRIA
International Lab, and the PRETSY2 project under DFG
Funding No. ME 1427/6-2. The authors would like to thank
Simon Yuan for setting up the Esterel benchmarks and
Avinash Malik for setting up the OpenMP benchmarks.
References
[1] M. Paolieri and R. Mariani, “Towards Functional-Safe Timing-
Dependable Real-Time Architectures,” in 17th IEEE International
On-Line Testing Symposium (IOLTS), 2011, pp. 31 – 36.
[2] Radio Technical Commission for Aeronautics, “Software Considera-
tions in Airborne Systems and Equipment Certification,” Apr. 1992,
standard DO-178B.
[3] R. Wilhelm and D. Grund, “Computation Takes Time, but How
Much?” Commun. ACM, vol. 57, no. 2, pp. 94 – 103, Feb. 2014.
[4] P. Axer, R. Ernst, H. Falk, A. Girault, D. Grund, N. Guan, B. Jon-
sson, P. Marwedel, J. Reineke, C. Rochange, M. Sebastian, R. V.
Hanxleden, R. Wilhelm, and W. Yi, “Building Timing Predictable
Embedded Systems,” ACM Transactions on Embedded Computing
Systems, vol. 13, no. 4, pp. 82:1–82:37, Mar. 2014.
[5] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing,
D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra,
F. Mueller, I. Puaut, P. Puschner, J. Staschulat, and P. Stenstr ¨
om,
“The Worst-Case Execution-Time Problem - Overview of Methods
and Survey of Tools,” ACM Trans. Embed. Comput. Syst., vol. 7,
no. 3, pp. 1 – 53, 2008.
[6] ISO/IEC JTC1/SC22/WG14, “ISO/IEC 9899:2011,” 2011.
[7] The IEEE and The Open Group, “POSIX.1-2008,” 2008, standard
Issue 7.
[8] OpenMP Architecture Review Board, “OpenMP Application Program
Interface,” Jul. 2013, standard 4.0.
[9] E. A. Lee, “The Problem with Threads,” Computer, vol. 39, pp. 33
– 42, 2006.
[10] S. Lu, S. Park, E. Seo, and Y. Zhou, “Learning from Mistakes:
A Comprehensive Study on Real World Concurrency Bug Charac-
teristics,” in Proceedings of the 13th International Conference on
Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS XIII. ACM, 2008, pp. 329 – 339.
[11] T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman,
“CoreDet: A Compiler and Runtime System for Deterministic Multi-
threaded Execution,” in Proceedings of the 15th ASPLOS on Architec-
tural Support for Programming Languages and Operating Systems,
ser. ASPLOS. ACM, 2010, pp. 53 – 64.
[12] S. Burckhardt and D. Leijen, “Semantics of Concurrent Revisions,
in Proceedings of the 20th European Conference on Programming
Languages and Systems, ser. ESOP/ETAPS, 2011, pp. 116 – 135.
[13] R. Raman, J. Zhao, V. Sarkar, M. Vechev, and E. Yahav, “Efficient
Data Race Detection for Async-Finish Parallelism,” in Proceedings
of the 1st International Conference on Runtime Verification, ser. RV.
Springer-Verlag, 2010, pp. 368 – 383.
[14] A. Benveniste, P. Caspi, S. A. Edwards, N. Halbwachs, P. L. Guernic,
and R. de Simone, “The Synchronous Languages 12 Years Later,”
Proceedings of the IEEE, vol. 91, no. 1, pp. 64 – 83, Jan. 2003.
[15] G. Berry and G. Gonthier, “The Esterel Synchronous Programming
Language: Design, Semantics and Implementation,” Science of Com-
puter Programming, vol. 19, no. 2, pp. 87 – 152, 1992.
[16] J. Souyris, E. L. Pavec, G. Himbert, V. J ´
egu, G. Borios, and R. Heck-
mann, “Computing the Worst Case Execution Time of an Avionics
Program by Abstract Interpretation,” in International Workshop on
Worst-case Execution Time, Mallorca, Spain, Jul. 2005, pp. 21 – 24.
[17] L. Ju, B. K. Huynh, A. Roychoudhury, and S. Chakraborty, “Timing
Analysis of Esterel Programs on General-Purpose Multiprocessors,”
in Proceedings of the 47th Design Automation Conference (DAC).
ACM, 2010, pp. 48 – 51.
[18] E. Yip, P. S. Roop, M. Biglari-Abhari, and A. Girault, “Programming
and Timing Analysis of Parallel Programs on Multicores,” in 13th
International Conference on Application of Concurrency to System
Design (ACSD), Jul. 2013.
[19] S. Andalam, P. S. Roop, A. Girault, and C. Traulsen, “Predictable
Framework for Safety-Critical Embedded Systems,IEEE Transac-
tions on Computers, vol. 63, no. 7, pp. 1600 – 1612, Jul. 2014.
[20] R. von Hanxleden, “SyncCharts in C - A Proposal for Light-Weight,
Deterministic Concurrency,” in Proceedings of the 9th ACM/IEEE
International conference on Embedded software, Oct. 2009.
[21] A. Girault, “A Survey of Automatic Distribution Method for Syn-
chronous Programs,” in International Workshop on Synchronous Lan-
guages, Applications and Programs, SLAP’05, ser. ENTCS, F. Maran-
inchi, M. Pouzet, and V. Roy, Eds. Elsevier Science, Apr. 2005.
[22] S. A. Edwards and J. Zeng, “Code Generation in the Columbia Esterel
Compiler,” EURASIP Journal on Embedded Systems, vol. 2007, 2007.
[23] D. Baudisch, J. Brandt, and K. Schneider, “Multithreaded Code
from Synchronous Programs: Extracting Independent Threads for
OpenMP,” in Design, Automation and Test in Europe (DATE). EDA
Consortium, 2010, pp. 949 – 952.
[24] D. Potop-Butucaru, A. Azim, and S. Fischmeister, “Semantics-
Preserving Implementation of Synchronous Specifications Over Dy-
namic TDMA Distributed Architectures,” in International Conference
on Embedded Software (EMSOFT). ACM, Nov. 2010, pp. 199 – 208.
[25] S. Yuan, “Architectures Specific Compilation for Efficient Execution
of Esterel,” Ph.D. dissertation, Electrical and Electronic Engineering,
The University of Auckland, Jul. 2013.
[26] C. Cullmann, C. Ferdinand, G. Gebhard, D. Grund, C. M. Burguiere,
J. Reineke, B. Triquet, and R. Wilhelm, “Predictability Considerations
in the Design of Multi-Core Embedded Systems,” Embedded Real
Time Software and Systems (ERTS), 2010.
[27] Xilinx, “MicroBlaze Processor Reference Guide,” 2012, [Online]
http://www.xilinx.com/support/documentation/sw manuals/xilinx13
4/mb ref guide.pdf.
[28] J. Whitham, “Scratchpad Memory Management Unit,” 2012, [Online]
http://www.jwhitham.org/c/smmu.html.
[29] S. Blazy and X. Leroy, “Mechanized Semantics for the Clight Subset
of the C Language,” Journal of Automated Reasoning, vol. 43, no. 3,
pp. 263 – 288, 2009.
[30] Motor Industry Software Reliability Association, “MISRA-C: 2012:
Guidelines for the Use of the C Language in Critical Systems,” p.
226, 2013, standard.
[31] M. Buss, D. Brand, V. Sreedhar, and S. A. Edwards, “A Novel
Analysis Space for Pointer Analysis and Its Application for Bug
Finding,” Sci. Comput. Program., vol. 75, no. 11, Nov. 2010.
[32] M. McCool, A. D. Robison, and J. Reinders, Structured Parallel
Programming. Morgan Kaufmann, Jun. 2012.
[33] A. Pop and A. Cohen, “A Stream-Computing Extension to OpenMP,”
in Proceedings of the 6th International Conference on High Perfor-
mance and Embedded Architectures and Compilers, ser. HiPEAC.
ACM, 2011, pp. 5 – 14.
[34] Intel, “Intel R
VTuneTM Amplifier,” 2014, [Online]
https://software.intel.com/en-us/intel-vtune-amplifier-xe.
... The ForeC language [46,110] is a C-like language that introduces deterministic concurrency via Esterel-constructs. ...
... The ForeC language [110] is C-like language extended with a synchronous semantics similar to Esterel. In [46], a compilation scheme for ForeC is proposed. ...
Thesis
Full-text available
A real-time system is a system whose correctness depends not only on the correctness of the values it produces, but also on the time when it produces those values. The rate at which it must produce values is defined by the environment it operates in. A typical example is an aircraft controller which must be able to react to external perturbations such as a gust of wind in a timely manner to guarantee the aircraft's safety.When programming such a system, it is important that the programming language allows to reason about the constraints introduced by this context. Synchronous languages are well-adapted to the programming of critical real-time systems thanks to their clean formal semantics and to their formally defined compilation process.However, real-time systems and their requirements have considerably evolved since the inception of these languages. In this work, we will present extensions to the synchronous language Prelude to tackle two issues: Programming multicore systems predictably and handling system reconfiguration during execution.
... The ForeC language [46,110] is a C-like language that introduces deterministic concurrency via Esterel-constructs. ...
... The ForeC language [110] is C-like language extended with a synchronous semantics similar to Esterel. In [46], a compilation scheme for ForeC is proposed. ...
Thesis
A real-time system is a system whose correctness depends not only on the correctness of the values it produces, but also on the time when it produces those values. The rate at which it must produce values is defined by the environment it operates in. When programming such a system, it is important that the programming language allows to reason about the constraints introduced by this context. Synchronous languages [14] are well-adapted to theprogramming of critical real-time systems thanks to their clean formal semantics and to their formally defined compilation process. In this work, we will present extensions to the synchronous language Prelude [67] to tackle two issues: Programming multicore systems predictably and handling system reconfiguration during execution.Multicore hardware platforms have the potential to increase the performance of real-time systems. However, their architecture, especially the shared central memory, is prone to hard-to-predict delays, outweighing the potential benefits. To address this issue, models such as PREM [71] and AER [32] have been proposed. Our first contribution aims at producing AER-compliant multicore C code from a high-level Prelude program. This shifts the responsibility of low-level implementation concerns related to task communications onto the compiler, saving tedious and error-prone development efforts.A multi-mode real-time system must respect different functional requirements during its execution. A mode of execution represents a possible system configurations, for an aircraft control system these may include take-off, cruise and landing. Mode change protocols define transitions to change safelyfrom one mode to another. Our second contribution proposes clock views to decouple the rate of tasks and transitions. The resulting multi-mode support is both formally defined and generic enough to allows programmers to choose the kind of protocol they need for their application. A clock calculus based on refinement typing [39, 83] infers and checks the consistency of rates and views.
... FairThreads [4] is another extension of C inspired by Esterel, implemented via native threads, that offers macros to express automata. Precision Timed C (PRET-C) [1], which focuses on temporal predictability and assumes a target architecture with specific support for thread scheduling and abort handling, and ForeC [35], which targets multi-core architectures, introduce a modal behavior into C programs via pause statements. Among these synchronous extensions to C, perhaps closest to our proposal are SyncCharts in C, which augment C with a lightweight language extension, realized as C macros, that provides modes based on SyncCharts [31]. ...
Preprint
Full-text available
Complex software systems often feature distinct modes of operation, each designed to handle a particular scenario that may require the system to respond in a certain way. Breaking down system behavior into mutually exclusive modes and discrete transitions between modes is a commonly used strategy to reduce implementation complexity and promote code readability. However, such capabilities often come in the form of self-contained domain specific languages or language-specific frameworks. The work in this paper aims to bring the advantages of modal models to mainstream programming languages, by following the polyglot coordination approach of Lingua Franca (LF), in which verbatim target code (e.g., C, C++, Python, Typescript, or Rust) is encapsulated in composable reactive components called reactors. Reactors can form a dataflow network, are triggered by timed as well as sporadic events, execute concurrently, and can be distributed across nodes on a network. With modal models in LF, we introduce a lean extension to the concept of reactors that enables the coordination of reactive tasks based on modes of operation. The implementation of modal reactors outlined in this paper generalizes to any LF-supported language with only modest modifications to the generic runtime system.
... The purpose of a trap is to catch a return of the enclosed process and turn it into a normal termination. 4 If can s ptrap P q " pΠ, Kq then can s ptrap P q " pΠ, K 1 q where 0 P K 1 ô t0, 2u X K ‰ H 1 P K 1 ô 1 P K 2 P K 1 ô never. ...
... Industrial solutions include here Simulink Real-Time from MathWorks, and Scade KCG6 Parallel from ANSYS/Esterel Technologies [3]. Academic results in this direction include [15,16,17,18]. However, none of these tools provide strong schedulability guarantees when integrating multiple synthesized tasks: separate timing and schedulability analysis must be performed after synthesis. ...
Thesis
The implementation of hard real-time systems involves a lot of steps that are traditionally manual. The growing complexity of such systems and hardware platforms on which they are executed makes increasingly difficult to ensure the correctness of those steps, in particular for the timing properties of the system on multi-core platform. This leads to the need for automation of the whole implementation process. In this thesis, we provide a method for automatic parallel implementation of real-time systems. The method bridge the gap between real-time systems implementation and compilation by integrating parallelization, scheduling, memory allocation, and code generation around a precise timing model and analysis that rely on strong hypothesis on the execution platform and the form of the generated code. The thesis also provides an implementation model for dataflow multithreaded software. Using the same formal ground as the first contribution, the dataflow synchronous formalisms, the model represents multithreaded implementations in a Lustre-like language extended with mapping annotations. This model allows formal reasoning on the correctness of all the mapping decisions used to build the implementation. We propose an approach toward the proof of correctness of the functionality of the implementation with respect to the functional specifications.
Article
Embedded real-time systems are tightly integrated with their physical environment. Their correctness depends both on the outputs and timeliness of their computations. The increasing use of multi-core processors in such systems is pushing embedded programmers to be parallel programming experts. However, parallel programming is challenging because of the skills, experiences, and knowledge needed to avoid common parallel programming traps and pitfalls. This paper proposes the ForeC synchronous multi-threaded programming language for the deterministic, parallel, and reactive programming of embedded multi-cores. The synchronous semantics of ForeC is designed to greatly simplify the understanding and debugging of parallel programs. ForeC ensures that ForeC programs can be compiled efficiently for parallel execution and be amenable to static timing analysis. ForeC’s main innovation is its shared variable semantics that provides thread isolation and deterministic thread communication. All ForeC programs are correct by construction and deadlock-free because no nondeterministic constructs are needed. We have benchmarked our ForeC compiler with several medium-sized programs (e.g., a 2.274 line ForeC program with up to 26 threads and distributed on up to 10 cores, which was based on a 2.155 line non-multi-threaded C program). These benchmark programs show that ForeC can achieve better parallel performance than Esterel, a widely used imperative synchronous language for concurrent safety-critical systems, and is competitive in performance to OpenMP, a popular desktop solution for parallel programming (which implements classical multi-threading, hence is intrinsically nondeterministic). We also demonstrate that the worst-case execution time of ForeC programs can be estimated to a high degree of precision.
Article
Full-text available
Software-intensive systems in most domains, from autonomous vehicles to health, are becoming predominantly parallel to efficiently manage large amount of data in short (even real-) time. There is an incredibly rich literature on languages for parallel computing, thus it is difficult for researchers and practitioners, even experienced in this very field, to get a grasp on them. With this work we provide a comprehensive, structured, and detailed snapshot of documented research on those languages to identify trends, technical characteristics, open challenges, and research directions. In this article, we report on planning, execution, and results of our systematic peer-reviewed as well as grey literature review, which aimed at providing such a snapshot by analysing 225 studies.
Article
Traditional imperative synchronous programming languages heavily rely on a strict separation between data memory and communication signals. Signals can be shared between computational units but cannot be overwritten within a synchronous reaction cycle. Memory can be destructively updated but cannot be shared between concurrent threads. This incoherence makes traditional imperative synchronous languages cumbersome for the programmer. The recent definition of sequentially constructive synchronous languages offers an improvement. It removes the separation between data memory and communication signals and unifies both through the notion of clock synchronized shared memory . However, it still depends on global causality analyses which precludes black-box procedural abstraction. This complicates reuse and composition of software components. This article shows how black-box procedural abstraction can be accommodated inside the sequentially constructive model of computation. We present the Sequentially Constructive Procedural Language ( SCoPL ) and its semantic theory of policy-constructive synchronous processes. SCoPL supports black-box procedural abstractions using policy interfaces to ensure that procedure calls are memory-safe, wait-free and their scheduling is determinate and causal. At the same time, a policy interface constrains the level of freedom for the implementation and subsequent refactoring of a procedure. As a result, policies enable separate compilation and composition of procedures. We present our extensions abstractly as a formal semantics for SCoPL and motivate it concretely in the context of the open-source, embedded, real-time language Blech .
Article
The growing trend to use multi-core processors to get more performance is increasingly present in safety-critical systems. Synchronous dataflow programming is naturally well-suited to parallel execution, thanks to the fact that all data dependencies are always explicit. MiniSIGNAL is a multi-task code generation tool for the synchronous dataflow language SIGNAL. The existing MiniSIGNAL code generation strategies mainly consider coarse-grained parallelism based on Ada multi-task model. However, when we applied it to industrial case studies, this code generation scheme has revealed inefficient: architecture aspects of the target platform have to be taken into account to achieve fine-grained parallelism. To generate more efficient target code from industrial cases, this paper presents a new multi-task code generation method for MiniSIGNAL. Starting at the level of synchronous clocked guarded actions (S-CGA) which is an intermediate language for the compilation process of MiniSIGNAL, the transformation consists of two parts: at the platform-independent level, transforming the S-CGA representation to an abstract multi-task structure (called Virtual Multi-Tasks, VMT); at the platform-dependent level, adopting the thread pool pattern concurrent JobQueue to support fine-grained parallel Ada code generation from the VMT structure. Moreover, the formal syntax and the operational semantics of VMT are mechanized in the proof assistant Coq. Finally, the effectiveness of our approach is illustrated by an application of the real-world Guidance, Navigation and Control system.
Article
Full-text available
Safety-critical embedded systems, commonly found in automotive, space, and health-care, are highly reactive and concurrent. Their most important characteristics are that they require both functional and timing correctness. C has been the language of choice for programming such systems. However, C lacks many features that can make the design process of such systems seamless while also maintaining predictability. This paper addresses the need for a C-based design framework for achieving time predictability. To this end, we propose the PRET-C language and the ARPRET architecture. PRET-C offers a small set of extensions to a subset of C to facilitate effective concurrent programming. We present a new synchronous semantics for PRET-C. It guarantees that all PRET-C programs are deterministic, reactive, and provides thread-safe communication via shared memory access. This simplifies considerably the design of safety-critical systems. We also present the architecture of a precision timed machine (PRET) called ARPRET. It offers the ability to design time predictable architectures through simple customizations of soft-core processors. We have designed ARPRET particularly for efficient and predictable execution of PRET-C. We demonstrate through extensive benchmarking that PRET-C based system design excels in comparison to existing C-based paradigms. We also qualitatively compare our approach to the Berkeley-Columbia PRET approach. We have demonstrated that the proposed approach provides an ideal framework for designing and validating safety-critical embedded systems.
Article
Full-text available
A large class of embedded systems is distinguished from general-purpose computing systems by the need to satisfy strict requirements on timing, often under constraints on available resources. Predictable system design is concerned with the challenge of building systems for which timing requirements can be guaranteed a priori. Perhaps paradoxically, this problem has become more difficult by the introduction of performance-enhancing architectural elements, such as caches, pipelines, and multithreading, which introduce a large degree of uncertainty and make guarantees harder to provide. The intention of this article is to summarize the current state of the art in research concerning how to build predictable yet performant systems. We suggest precise definitions for the concept of “predictability”, and present predictability concerns at different abstraction levels in embedded system design. First, we consider timing predictability of processor instruction sets. Thereafter, we consider how programming languages can be equipped with predictable timing semantics, covering both a language-based approach using the synchronous programming paradigm, as well as an environment that provides timing semantics for a mainstream programming language (in this case C). We present techniques for achieving timing predictability on multicores. Finally, we discuss how to handle predictability at the level of networked embedded systems where randomly occurring errors must be considered.
Conference Paper
Full-text available
Multicore processors provide better power-performance trade-offs compared to single-core processors. Consequently, they are rapidly penetrating market segments which are both safety critical and hard real-time in nature. However, designing time-predictable embedded applications over multicores remains a considerable challenge. This paper proposes the ForeC language for the deterministic parallel programming of embedded applications on multicores. ForeC extends C with a minimal set of constructs adopted from synchronous languages. To guarantee the worst-case performance of ForeC programs, we offer a very precise reachability-based timing analyzer. To the best of our knowledge, this is the first attempt at the efficient and deterministic parallel programming of multicores using a synchronous C-variant. Experimentation with large multicore programs revealed an average over-estimation of only 2% for the computed worst-case execution times (WCETs). By reducing our representation of the program's state-space, we reduced the analysis time for the largest program (with 43, 695 reachable states) by a factor of 342, to only 7 seconds.
Article
Full-text available
The determination of upper bounds on execution times, commonly called worst-case execution times (WCETs), is a necessary step in the development and validation process for hard real-time systems. This problem is hard if the underlying processor architecture has components, such as caches, pipelines, branch prediction, and other speculative components. This article describes different approaches to this problem and surveys several commercially available tools1 and research prototypes.
Article
Full-text available
Research on the automatic distribution of synchronous programs started in 1987, almost twenty years ago. Basically, from a single synchronous program, along pos-sibly with distribution specifications (but not necessarily), it involves producing automatically several synchronous programs, communicating between them so as to achieve the same behavior as the initial centralized program. Since 1987, many approaches have been proposed, some general, and others tailored specifically to a given synchronous language. Also, those methods that were basic at the beginning, are now really sophisticated, offering many features. Finally, the successes obtained with these methods have led to the definition of a new model of computation, known as Globally Asynchronous Locally Synchronous (GALS). The goal of this article is to present a survey of all existing distribution methods.
Article
The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic replication. In this work, we avoid these complications by eliminating their root cause: we develop a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs deterministically. A trivial non-performant approach to providing determinism is simply deterministically serializing execution. Instead, we present a compiler and runtime infrastructure that ensures determinism but resorts to serialization rarely, for handling interthread communication and synchronization. We develop two basic approaches, both of which are largely dynamic with performance improved by some static compiler optimizations. First, an ownership-based approach detects interthread communication via an evolving table that tracks ownership of memory regions by threads. Second, a buffering approach uses versioned memory and employs a deterministic commit protocol to make changes visible to other threads. While buffering has larger single-threaded overhead than ownership, it tends to scale better (serializing less often). A hybrid system sometimes performs and scales better than either approach individually. Our implementation is based on the LLVM compiler infrastructure. It needs neither programmer annotations nor special hardware. Our empirical evaluation uses the PARSEC and SPLASH2 benchmarks and shows that our approach scales comparably to nondeterministic execution.
Article
A few years ago, in the pages of this magazine, Edward Lee argued that computing needs time.23 This article focuses on the natural assumption that computing also takes time. We examine the problem of determining how much time. It is the problem of verifying the real-time behavior of safety-critical embedded systems. For such systems, for example, anti-lock brakes and airbags, punctual behavior is of utmost importance: If the controlling computations take too long, quality of service degrades or the systems fail completely-your braking distance is longer or your head hits the steering wheel, respectively. The basis for verifying the timeliness of system reactions is reliable information on the execution times of all computational tasks involved. It is the job of timing analysis, also called worst-case execution-time (WCET) analysis, to determine such information.
Conference Paper
Embedded systems with hard real-time constraints need sound timing-analysis methods for proving that these constraints are satisfied. Computer architects have made this task harder by improving average-case performance through the introduction of components such as caches, pipelines, out-of-order execution, and different kinds of speculation. This article argues that some architectural features make timing analysis very hard, if not infeasible, but also shows how smart configuration of existing complex architectures can alleviate this problem.
Article
Embedded hard real-time systems need reliable guarantees for the satisfaction of their timing constraints. Experience with the use of static timing analysis methods and the tools based on it in the automotive and the aeronautics industries is positive. However, both the precision of the results and the efficiency of the analysis methods are highly dependent on the predictability of the execution platform. In fact, the architecture determines whether a static timing analysis is practically feasible at all and whether the most precise obtainable results are precise enough. Results contained in the paper also show that measurement-based methods still used in industry are not useful for quite commonly used complex processors. This dependence on the architectural development is of growing concern to the developers of timing analysis tools and their customers, the developers in industry. The problem reaches a new level of severity with the advent of multi-core architectures in the embedded domain. This article describes the architectural influence on static timing analysis and gives recommendations as to profitable and unacceptable architectural features.