Conference PaperPDF Available

CPU-GPU Response Time and Mapping Analysis for High-Performance Automotive Systems

Authors:
Conference Paper

CPU-GPU Response Time and Mapping Analysis for High-Performance Automotive Systems

Abstract and Figures

In accordance to the interest in autonomous driving, automotive software requires increasing computing power whilst formal verification methods form crucial requirements in order to cope with safety, reliability, real-time, or fault-tolerance demands. Image processing is a mandatory part for this trend which makes the use of GPUs reasonable. This paper outlines methods to address formal verification challenges in modern automotive environments and presents results along with an industrial model provided by the WATERS workshop community. Most of the challenges addressed in this paper are part of the WATERS industrial challenge 2019.
Content may be subject to copyright.
CPU-GPU Response Time and Mapping Analysis
for High-Performance Automotive Systems
Robert H¨
ottger, Junhyung Ki, The Bao Bui, Burkhard Igel
IDiAL Institute
Dortmund University of Applied Sciences and Arts
Dortmund, Germany
{robert.hoettger, igel}@fh-dortmund.de,
{junhyung.ki001, the.bui003}@stud.fh-dortmund.de
Olaf Spinczyk
Computer Science Institute
Osnabr¨
uck University
Osnabr¨
uck, Germany
olaf.spinczyk@uos.de
Abstract—In accordance to the interest in autonomous driving,
automotive software requires increasing computing power whilst
formal verification methods form crucial requirements in order to
cope with safety, reliability, real-time, or fault-tolerance demands.
Image processing is a mandatory part for this trend which
makes the use of GPUs reasonable. This paper outlines methods
to address formal verification challenges in modern automotive
environments and presents results along with an industrial model
provided by the WATERS workshop community. Most of the
challenges addressed in this paper are part of the WATERS
industrial challenge 2019 [5].
Index Terms—AMALTHEA, APP4MC, Mapping, Automotive,
RTA, GPU
I. INTRODUCTION
Recent development activities in the automotive domain
have risen challenges when applying response time analysis
(RTA) as well as memory contention and access latency
estimation to AUTOSAR compliant models. Mapping tasks
to processing units in order to optimize latencies of task
chains and response times under the consideration of memory
to GPU offloading costs and highly heterogeneous hardware
architectures with different memory types, processing speeds,
peripherals, accelerators, and more, as well as sophisticated
memory contention models increases complexity and requires
appropriate changes to conventional formal RTA methods [5].
This paper uses RTA for fully-preemptive tasks under rate
monotonic scheduling running on CPUs using the windowing
technique [11] and combines it with weighted round robin
(WRR) RTA for GPU tasks [13].
Additionally, we define a memory contention model for
GPU copy engines as well as differences of asynchronous and
synchronous GPU offloading mechanisms.
While memory accesses are already accounted within ticks
of GPU tasks, CPU task response times must account memory
contention caused by tasks running on different processing
units that use divers memory controller clients and/or the
GPU’s copy engine, and memory access latencies in addition
to ticks and preemption.
Finally, we present results of task to CPU and GPU mapping
exploration conducted with the help of a genetic algorithm.
We repeat the mapping calculation for different optimization
goals and measure various metrics such as response time sum,
standard deviation of processing unit utilization values, as well
as latencies of cumulated memory accesses, task chains, the
copy engine operations, and contention over worst-, best-, or
average execution times and synchronous and asynchronous
offloading.
The remainder of this paper is structured as follows. The
next Section II introduces related work, the context this paper’s
work is referring to, as well basics of the used system model.
Afterwards, Sections III and IV form the main contributions
of this paper and formulate the specific challenges and their
solution approaches to response time analysis and task map-
ping across CPUs and GPUs in the respective automotive
domain. The solutions account data access costs, memory
contention, Copy Engine (CE) operations, synchronous or
asynchronous offloading, rate monotonic CPU and WRR
GPU scheduling, whereas the task mapping either optimizes
(a) the sum across all task response times, (b) task chain
latencies, of (c) load balancing. Finally, Section V presents
measurements and results whereas Section VI concludes this
paper.
II. RE LATE D WORK, ASSUMPTIONS,&SYS TE M MOD EL
Current research shows that compute and memory band-
width isolation is an effective approach to reduce shared
cache conflicts, bus and shared cache contention, as well as
buffer conflicts or request reordering in the memory con-
troller [9]. The WATERS community has been working on
solving various automotive challenges since 2015 [10] such
as worst-case end-to-end latencies along complex cause-effect
chains [6], communication paradigms [4], WCET / WCRT
for advanced shared memory architectures [14], optimized
application mapping, and sophisticated models for multi-core
execution platforms.
The WATERS2019 challenge [5] forms the basis of this
paper via the following assumptions:
Fixed priority (mixed) preemptive scheduling in form of
rate monotonic scheduling (Di=Ti)
CPU contention given as:
γi,k =blk,i + (Kk·#Ci) + sGP U ·bGP U (1)
with #Cidenoting the number of cores that run at least
one task which accesses at least one label accessed by τi.
Baseline bli,k is derived from a processing unit’s access
latency to memory as well as labels accessed by τi.Kand
sGP U are constants derived from the memory contention
model [2] in conjunction with information given in the
forum1.
GPU contention is given as γi,GPU =blGP U + 0.5·#C
Copy operations of the CE are handled by the GPU
The execution engine’s memory accesses are already
covered in GPU ticks
Data is always transferred in form of an integer multiple
of a complete cache line (i.e. 64 Bytes)
The challenge model is given as an AMALTHEA model2that
can be accessed by the APP4MC3platform.
Table I collects all indexes and notations used throughout
this paper which is based on the Burns Standard Notation [3].
TABLE I
IND EXE S AN D NOTATI ONS
Entity index Entity index
Task iRunnable p
Processing Unit mMemory l
Label nGlobal memory g
Lower priority tasks oHigher priority tasks h
Description Symbol Description Symbol
Deadline DiPeriod Ti
WC execution time C+
iWC exec. time on pumC+
i,m
Runnable exec. time cpTask τi
WC Response time R+
iBusy period Wi
Processing unit pumPriority Pi
Utilization Ui,m Runnable rp
Frequency in Hz fmLatency L
Read Latency LmlWrite Latency Lml
Read labels RiWritten Labels Wi
Label LLabel size S
Given those indexes, we define the following different
latencies:
La,i = A task’s access latency derived from all its label
accesses to memory
Lc,i = A task’s contention latency derived from the
challenge denoted in Eq. 1.
A task’s locking latency Ll,i that is subdivided into:
The global latency defined by spin locks for re-
sources shared across tasks running on different
processing units
The local locking latency derived from the priority
ceiling protocol (PCP)
The locking should be considered for semaphores or other
lock types that ensure deterministic system behavior as
well as cooperative tasks that can only be preempted at
runnable bounds.
1WATERS forum thread https://bit.ly/2IlLXTe, accessed 05.2019
2AMALTHEA 0.9.3 documentation http://eclip.se/fy, accessed 06.2019
3APP4MC website http://eclip.se/eU, accessed 06.2019
Several suggestions to improve real time behavior and
determinism for AUTOSAR applications running on a hetero-
geneous multi core system have been proposed [12] [7] but
not yet included in the AUTOSAR specification. Considering
locking latencies is planned for a later extension of this paper’s
work. Finally, a task is a tuple of execution time Ci, a period
Ti, a read label set Ri, a written label set Wi, a runnable set
Fi, a priority Pi, and a deadline Dias shown in Eq. 2.
τi={Ci, Ti,Ri,Wi, Fi, Pi, Di}
C+
i=X
pFi
c+
p
Fi={ri,0, ...}
(2)
III. CHA LL EN GE I : RTA FOR CPU-GPU
A. CPU Response Time Analysis
1) Data Access Costs:
Before a tasks starts executing on a CPU, labels need to be
read from memory. After a task finishes execution, its results
(labels) must be written into memory. Eq. 3 describes the
memory access latencies that is added to each CPU task’s
response time.
L+
a,i =X
x∈RiSx
64 ·Lml
fm
+X
y∈WiSy
64 ·Lml
fm
(3)
The constant 64 is used here as the baseline derived from the
challenge [5]. Here, ls denotes the label size and rl and wl
define given read label and write label latencies specified in
the given AMALTHEA model. Recent publications such as
[8] have shown that memory mapping significantly influences
task response times.
2) Memory Contention:
Memory contention latencies are added to task response times
using Eq. 1.
3) CPU Response Time Analysis:
Based on [14], we consider the worst case response time
(R+
i) for rate monotonic scheduling within a level-i busy
period window using recurrence relation as shown in Eq. 5.
However, the previously outlined latencies must be added to
task execution times in order to get more accurate analysis.
While label access latencies always occur (also for BCET),
worst case execution times take all described latencies into
account as shown in Eq. 4.
C+,CP U
i=C+
i+La,i +Lc,i +Ll,i (4)
Wi=X
hehep(i)Wi
The ·Che with he (hi)
Ki=Wi
Ti
fk,+
i=X
hhp(i)&fk1,+
i
Th'·C+,CP U
h+k·C+,CP U
i
R+,CPU
i= max
k[1,Ki]fk
i(k1) ·Ti
(5)
Wiis the busy period length (window) that has to be consid-
ered for task τi.Kiis the amount of task τiinstances within
the busy period that needs to be checked. fk
iis the finish time
of the k-th instance of τiand Riis the worst case response
time of τiwhich is the maximum among all k-th finish times
minus the k-th task τi’s release which is derived from the task’s
period. Equations 5 are entirely derived from [14], depend on
the task to pu mapping, and consider only the set of tasks
mapped to the same processor pu. This implementation allows
arbitrary deadlines over rate monotonic derived deadlines.
4) Asynchronous and Synchronous Offloading:
The two offloading cases are expressed in Figure 1. Since
PRE
GPU
AO POST Task3Task2
PRE
GPU
POST Task3active wait Task2
Asynchronous Ooading
Synchronous Ooading
CPU
GPU
CPU
GPU
passive wait
Fig. 1. Synchronous vs asynchronous GPU task offloading (without copy
operations) in a Gantt chart
passive waiting allows other tasks to execute (cf. Task2 in
Figure 1 asynchronous offloading), the overall throughput is
higher for the asynchronous offloading. However, a penalty
has to be added for the asynchronous offloading to represent
the latency between the end of GPU kernel and the start
of the post processing phase, which is denoted as AO for
asynchronous offloading costs in Figure 1. Those additional
costs AO require less processing resources compared with
the relatively longer active waiting period during synchronous
offloading. The synchronous offloading can be implemented
using the conventional RTA from [11]. Consequently, for the
synchronous case, the triggering task τi’s execution time is:
Cs+
i=C+,CP U
i+CE +
i+R+,GP U
j(6)
With CE+
idenoted in Eq. 12 for considering copy engine
(CE) operations. This means, we take the normal execution
time for a task and add the CE time for the triggered task and
its response time at the GPU.
In order to calculate response times that consider the passive
waiting for the asynchronous offloading situation, we split the
triggering task into two parts, i.e., pre and post processing
tasks. While the former simply receives the execution time
of everything until the trigger event, the latter receives all
execution time after the trigger event including pre processing
as well as the AO penalty. Additionally, the latter task obtains
an offset value which equals the pre task’s length plus the
triggered GPU task’s response time. Consequently, we use
another RTA to consider offsets, based on [15] for the asyn-
chronous offloading approach, in which passive waiting can
be utilized by other tasks. The offset consideration makes use
of the ”imposed interference” method since the critical instant
derivation used for the synchronous offloading is not viable
when having offsets for the asynchronous case. Therefore,
task sets with the same periodic activation but different offsets
are combined in transactions. Each transaction’s (Γi) effec-
tively imposed interference (Wdb(Ri, t)) during an iteratively
increasing time interval (t) is computed. The iteration ends via
fix-point lookup for the response time calculation of the task
under consideration, i.e. R+,offs
i=R+,offs(n)
iwith R+,offs(n)
i=
R+,offs(n1)
i.
Wdb(τi, t) = X
jhpd(τi)t
Td+ 1·Cdj xdjb (t)
t=tphase(τdj , τdb)
phase(τdj , τdb) = (Td+ (Odj Odb)) % Td
xdjb (t) = (0for t<0
max(0, Cdj (t%Td)) otherwise
Wd(τi, t) = max
bhpdτi
(Wdb(τua , t))
R0
i=C+
i
R+,offs;(n+1)
i=C+
i+X
ΓdΓWdτi, R+,offs;(n)
i (7)
All equations 7 are derived from [15] whereas dis the
transaction index. The part of task τdj that cannot be executed
during interval tis denoted as xdj.
B. GPU Response Time Analysis
Before a GPU task starts executing, the copy engine needs
to copy all accessed labels into the dedicated GPU region.
1) Copy Engine:
The copy engine CE reads all labels accessed by a task from
different memories, writes them into a dedicated GPU memory
location of the global memory and after the GPU execution
finished, all labels are written back into their original location.
This copy engine access latency is described in Eq. 8.
CE +
a,i =X
n(RiSWi) Sn·Lm(i)l(n)+Lm(i)l(n)
fm!
+
Si·Lm(i)g+Lm(i)g
fm
(8)
with nbeing the label index of labels accessed by τi,Snis
the label size in baseline, i.e. Sn=l#Bytesn
64 m,lis the index
of the memory label nis mapped to, fmis the frequency in
Hz of the GP U τiis mapped to, Lm(i)l(n)is the read
latency between (a) GP Umto which τiis mapped to and (b)
memory Mlto which label nis mapped to, and Lm(i)l(n)
is the respective write latency. The label size sum of a task is
the cumulated sizes across all labels accessed by a task τias
shown in Eq. 9.
Si=X
n(RiSWi)
Sn(9)
Both read and write latencies to the label’s original place
(meml) are multiplied with the label sizes, since the copy
engine reads during the copy-in phase and writes during the
copy out phase. Since the labels are copied into the global
memory region for the GPU, write and read latencies must be
further multiplied with the label size sum lssifor the copy in
and copy out operations respectively. The resulting data flow
is CEr ead l CEwrite g GP Uexecution CEread g
CEw rite l. For instance, the C Eatime for a task τ1accessing
a single label of 128 Bytes from an A57 core which has 5
cycles read and 6 cycles write latency to the memory the
label is mapped to, and 7 read and 8 write cycles latency to
global memory, is CEa,1=(128
64 )·(5+6)
2·109+2·(7+8)
2·109= 26ns. In
addition to the actual copy operation time, contention CE +
c,i
and queuing CE +
q,i delays need to be considered. The CE
contention time is derived from the challenge description, i.e.:
CE +
c,i =#Bytes
64 ·(bl +K·#C)(10)
The CE queuing delay is derived from the maximal copy
operations among tasks mapped to the GPU and triggered
from a CPU as shown in Eq. 11. We assume FIFO-ordered
CE queuing.
CE +
q,i =X
m:(mis CPU,m6=m(τi))
max
jCE +
a,j (11)
With τjtriggered by pumand m(τi)denoting the CPU index
τihas been triggered from.
Given a predefined label mapping as well as access latencies
from each processing unit to each memory and the frequency
of each core, we finally derive the total copy engine time that
considers queuing and contention via Eq. 12.
CE +
i= 2 ·(C E+
q,i +CE+
c,i) + C E+
a,i (12)
During the implementation of the copy engine lantecy cal-
culation, situations were identified, to which labels were
written back to their original place, but not changed during
the GPU execution. Consequently, a small adjustment of the
CE operation was implemented, which only accounts written
(changed) labels to be chosen for being written back to their
original location. This adjustment reduced the CE operation
latency to a small extent as outlined in Section VI.
2) GPU RTA using Weighted Round Robin Scheduling:
After the copy engine time has been calculated, we can
analyze the response times of GPU tasks scheduled by a WRR
scheduler. A task set is schedulable if Eq. 13 is true.
X
i C+,GP U
i·Tmax
Ti!Tmax with τimapped to GPU
(13)
We implemented the response time analysis described in [13]
apart from the specific burst stimulus consideration, which
is not part of the model in scope. The RTA under round
robin uses the windowing-technique proposed by Lehoczky in
[11] to check for the worst case response time of tasks with
arbitrary deadlines within the critical instant, i.e., the situation
when all tasks are released at the same time. The implemented
algorithm considers interference of others tasks (Ik,j ) within
a round robin turn (k), task interference of previous round
robin turns (pad), requested execution times until each time
slice window, as well as periodic tasks with different execu-
tion times and time slices in order to derive accurate round
robin timing behavior without much pessimism. The following
calculation of R+,GP U
i(q)uses C+,GP U
i=C+,CP U
i+CE +
i.
R+,GPU
i(q) = q·C+,GP U
i+I(q)δ
i(q)
I(q) =
K(q)
X
k=1
Ik,where K(q) = &q·C+,GP U
i
θi'
Ik=
n1
X
j=1
Ik,j
Ik,j =
V
X
v=1
padk,j (v), V such that padk,j (v)6= 0
padk,j (v) = min lk,j (v), Cmax
j·η+
j tk,j +
v1
X
s=1
padk,j (s)!
k1
X
p=1
Ip,j
v1
X
s=1
padk,j (s)!
lk,j (v) = θj
v1
X
s=1
padk,j (s)
tk,j =
k1
X
p=1
Ip+ (k1)θi+
j1
X
u=1
Ik,u
padk,j (1) = (θiif Ek,j θj
Ek,j if Ek,j < θj
Ek,j =C+,GP U
j·η+
j(tk,j )
k1
X
p=1
Ip,j
(14)
Here, qis the activation index, η+
jis the upper-bound arrival
function of task τi,δ
iis the minimal distance of τievents
(set to Tiin this paper), θjis the time slot of τj,K(q)is
the total number of RR-turns required by qactivation of τi
to complete, Ek,j is the remaining execution demand of task
τjat the beginning of θjin the k-th RR-turn, and lk,j is the
unused time in time slot θjat the beginning of padk,j (v)as
stated in [13].
C. Task Chain Latencies
We re-use an existing outline of task chain latency calcu-
lation derived from [1] that is shown in Eq. 15 with δ(T C)
denoting the task chain length in time.
T C ={δ0, ...}
δe=R+
j=δ0+
j<|T C |
X
j=δ1
(Tj+R+
j)(15)
Since no task chains are given in the existing model, we added
the following two task chains manually:
1) TC1: {EKF Localization Planner}
2) TC2: {DASM Pre SFM Post Pre Local. Post}
No differences across synchronous and asynchronous latency
measurements were found for TC1 since no offloading task is
included in that chain. However, TC1’s latency measurement is
still used for the task chain latency sum optimization outlined
in the next Section IV denoted as TCSO, to which results are
also presented in Figure 4.
IV. CHA LLENGE II: TASK MA PPING
Before the actual task mapping is performed, the authors
noticed, that the task Planner has a periodic activation
of 12ms whereas it’s execution time is >12ms for any
CPU. Consequently, we changed its period to 15ms to be
schedulable.
Task mapping is encoded via a genetic algorithm using
jenetics library [17]. In order to restrict the solution space for
tasks available on either CPU cores only, GPU cores only, or
both, instead of decoding a single chromosome with multiple
integer genes, each task mapping is encoded within a dedicated
chromosome consisting of a single integer gene. Consequently,
genes can have different integer domains.
We assess results by summing up the following values along
with the GAs fitness function:
I) RTSO = Response Time Sum Optimized
a) Worst case response time sum over the CPUs and
GPUs, i.e., R+
tot =PiR+
ithat involves 5, 14, and 7)
b) The total CPU memory access latency (cf. Eq. 3)
c) The total CE latency (cf. Eq. 12)
d) The total task contention (TTC =Piγi; cf. Eq. 1)
II) TCSO = Task Chain latency sum Optimized
The task chain sum of all task chain latencies via
T CS =Peδethat involves Eq. 15
III) LBO = Load Balancing Optimized
The standard deviation along utilization values (given in
Eq. 16) in %.
Um,%=X
i C+
i,m ·100
Ti!with τimapped to pum(16)
It is important to note here that a combination of any of
the above calculations can easily be implemented. Additional
analyses as well as further model entity analyses can be
integrated within the genetic algorithm’s fitness function.
V. RE SU LTS
The following subsections outline results obtained by ap-
plying outlined challenge solutions described in Sections III
and IV to the given WATERS model.
A. Synchronous vs Asynchronous Offloading
A CPU task, which offloads a GPU task and actively
waits for its finalization (i.e. synchronous offloading), obtains
significantly more execution time and response time corre-
spondingly. Since the challenge model features already highly
utilized processing units, setting the synchronous offloading
to true is infeasible with the given model since no mapping
!"#$%&
!&'$"&
&(&$(%
!##$&)
!!!$%(
!'*$()
&%+$&'
!''$)+
!#&$""
!'!$%&
&(&$(%
!##$&)
!)*$&*
!'!$%&
&%+$&'
!''$)+
&+)
&%)
&*)
!')
!&)
!+)
!%)
,-./012340356
78# 3-6 20 19 :;0 ; <6 78 #36 20 19 :; 0; <6
Fig. 2. Task chain latencies for different mappings as well as worst and best
case execution times
could be found that is free from exceeding a processing unit’s
capacity across all processing units.
However, when considering best case execution times and
increasing the period of the task ‘PRE_SFM_gpu_POST
from 33ms to 66ms, valid results were found (cf. ‘syncBCET
with diagonal lines in Figure 2). Having two tasks mapped to
the GPU in this scenario, their triggering task’s execution times
increased to 911% and 1999% compared to their asynchronous
execution times. This change results in an increase to 158%
of the total response time sum fitness value.
B. Task Chain Latencies
Task chain latency results for TC2 are shown in Figure 2.
The RTSO mapping features the highest task chain latencies
among the different mappings whereas the TCO mapping so-
lution provides the lowest task chain latencies. Asynchronous
and synchronous measurements are the same for the TCO
mapping, since the respective offloaded tasks (SFM and Lo-
calization) are mapped to a CPU which, according to the
challenge [5], results in 0 execution time for the Pre and
Post processing runnables within the triggering task. Other
mappings map at least one of those two tasks to the GPU
which results in different latency measurements among the
synchronous and asynchronous offloading approach.
Synchronous task chain latency measurements of Figure 2
violate at least one deadline except for the ‘syncBCET
solution due to the significant increase of wasted CPU cycles
during active waiting.
While task chain optimized mappings clearly feature the
lowest task chain latencies, synchronous execution results in
a task chain latency increase of up to 9.2%.
C. Various Metric Results along Different Mappings
Final utilization results are presented in Figure 3. The
utilization already shows infeasibility of the given mapping
model due the GPU being utilized by >141%. Even the
Denver0 processing unit will hardly meet the deadlines since
the instructions only already fill >98% of the processing
unit’s capacity. Utilization is computed via Eq. 16. In addition
to Figure 3, Figure 4 presents measurements of different
98.74%
30.99%
0.00%
75.45%
28.84%
13.45%
141.40%
74.99%
72.51%
62.07%
74.17%
63.45%
62.94%
80.10%
72.51%
69.95%
79.20%
35.73%
75.39%
74.17%
78.43%
77.51%
50.95%
86.66%
74.17%
52.50%
60.44%
78.43%
76.76%
82.79%
35.73%
55.18%
74.17%
84.10%
80.10%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Denver0 Denver1 ARM0 ARM1 ARM2 ARM3 GPU
Utiliz ation
EV RTS O TCSO LBO Giv en
Fig. 3. Processing unit utilization values for an (a) EV = early valid mapping, (b) RTSO = response time sum optimized, (c) LBO = load balancing optimized,
(d) TCO = task chain optimized mapping results, and (e) the given mapping
0%
20%
40%
60%
80%
100%
Response'Time'Sum
Task 'Conten ti o n
Label'Access'Costs
CE'Lat ency
Task 'Cha i n'L a ten y
Utilization'StdDev
RTSO
TCSO
LBO
Fig. 4. Different metric measurements for calculated mapping results
metrics along with the three optimized mappings as well as an
early valid mapping. Due to infeasibility, some of those metrics
could not be calculated for the given mapping model such
that the given mapping is omitted in Figure 4. Interestingly,
the load balancing approach does not provide the best cumu-
lated response times although it provides clearly the lowest
utilization standard deviation with 40%. The configuration
for the measurements of Figures 3 and 4 is considering WCET,
asynchronous offloading, only written labels for the CE back
to host process, and 1ms ·Pifor the GPU time slice derivation.
D. Time Slice Derivation for GPU WRR Scheduling
During investigating time slice lengths and weights for the
round robin scheduling on the GPU, ticks of the Detection
task were reduced to C+
Dectection=1
10 in order to have
three tasks running on the GPU without exceeding the GPU’s
capacity, i.e., having a feasible task set of three tasks on the
GPU. Apart from Detection,Localization and SFM
were mapped to the GPU.
sl =PiTiR+
i
#T asksGP U
(17)
Figure 5 shows average task slack times derived from Eq. 17
along with different base time slices θ(x-axis) and weights,
i.e., individual time slices that are derived from the base
time slice θ. The figure compares equal weights (same θ),
priority weights (derived from RMS whereas the highest
priority has the highest value Pi), utilization weights (cf.
Eq. 16) multiplied with the number of tasks mapped to the
respective GPU, and the utilization weights only. For all θ
values except 100ms, the priority-based time slice derivation
shows the highest (best) slack times. In addition to the slack
times presented in Figure 5, Figure 6 provides insights into
each time slice weighting approach’s standard deviation.
VI. CONCLUSIONS
This paper provides solutions towards typical response time
analysis and task mapping challenges for high performance au-
tomotive systems. Therefore, formal CPU and GPU response
time analyses are given that consider different scheduling
paradigms (RMS and WRR), contention models, memory
Fig. 5. Influence of time slice derivation methods and different base time slice
lengths (θ) on slack times: (a) equal, (b) priority-, (c) utilization·#Tasks-, and
(d) utilization-based time slices
Fig. 6. Average slack time deviations of different time slice derivation
methods
access and offloading patterns (synchronous vs asynchronous),
task chains as well as locking, queuing, and blocking latencies.
Results are presented along with applying the solutions to the
WATERS 2019 challenge and its given AMALTHEA model.
The WRR scheduling has been investigated towards four
different time slice derivation methods of which the default
time slice times task priority derivation method has shown
lower response times in average. Mappings have been calcu-
lated via a genetic algorithm and its results are presented along
with various configurations such that BCET-WCET consider-
ation, Synchronous-Asynchronous GPU offloading, time slice
derivation method, as well as the Copy-Engine operation type.
Measurements of three different task to processing unit
mappings, each optimized towards a different goal, and the
given mapping have shown a rather sophisticated reasoning
about the quality of results along various metrics such as
memory contention, processing utilization standard deviation,
cumulated label access costs, CE latency, task chain latency,
and the sum of response times. However, each of the calculated
mapping significantly outperforms the given mapping in most
metrics.
One of our next steps is to further analyze blocking times
on runnable level based on [16] as well as extending the
current RTA to consider mixed-preemptive scheduling in terms
of cooperative tasks, that can only be preempted at runnable
bounds.
VII. LIMITATIONS & REM AR KS
It took us a couple of days and forum questions to
fully understand the challenge model entities. We have been
implementing our solutions since the publication of the
AMALTHEA challenge model, not fully dedicating all re-
sources to the challenges themselves. Most effort was spent
on implementing concepts of [13] and [15].
Related progress, implementation, and information of this
paper are intended to be collected in the WATERS forum
corresponding thread4.
REFERENCES
[1] Jalil Boudjadar and Simin Nadjm-Tehrani. Schedulability and Memory
Interference Analysis of Multicore Preemptive Real-time Systems. In
Proceedings of the Int. Conference on Performance Engineering, ICPE,
pages 263–274. ACM, 2017.
[2] Roberto Cavicchioli, Nicola Capodieci, and Marko Bertogna. Memory
Interference Characterization Between CPU Cores and Integrated GPUs
in Mixed-Criticality Platforms. In Proceedings of the Int. Conference
on Emerging Technologies and Factory Automation, ETFA, pages 1–10,
2017.
[3] R.I. Davis. Burns Standard Notation for Real Time Scheduling. In
N. Audsley and S.K. Baruah, editors, Real-Time Systems: The Past, The
Present and The Future, pages 38–41. Mar 2013.
[4] Arne Hamann, Dakshina Dasari, Simon Kramer, Michael Pressler, Falk
Wurst, and Dirk Ziegenbein. WATERS Industrial Challenge 2017. In
Int. Workshop an Analysis Tools and Methodologies for Embedded and
Real-time Systems, WATERS, 2017.
[5] Arne Hamann, Dakshina Dasari, Falk Wurst, Ignacio Sanudo, Nicola
Capodieci, Paolo Burgio, and Marko Bertogna. WATERS Industrial
Challenge, 2019. Online: https://www.ecrts.org/forum/viewtopic.php?f=
43&t=124&sid=1da5e37bde907477b2b991c411c03a03.
[6] Arne Hamann, Dirk Ziegenbein, Simon Kramer, and Martin
Lukasiewycz. FMTV 2016 Verification Challenge. In Real-Time and
Embedded Technology and Applications Symposium, RTAS, 2016.
[7] Robert H ¨
ottger, Burkhard Igel, and Olaf Spinczyk. On Reducing
Busy Waiting in AUTOSAR via Task-Release-Delta-based Runnable
Reordering. In Proceedings of the 2017 Design, Automation & Test
in Europe Conference & Exhibition, DATE, pages 1510–1515. IEEE,
March 2017.
[8] Robert H ¨
ottger, Lukas Krawczyk, Burkhard Igel, and Olaf Spinczyk.
Memory Mapping Analysis for Automotive Systems. In Work in
Progress Paper, 25th IEEE Real-Time and Embedded Technology and
Applications Symposium, RTAS, April 2019.
[9] Saksham Jain, Iljoo Baek, Shige Wang, and Ragunathan Raj Rajkumar.
Fractional GPUs : Software-based Compute and Memory Bandwidth
Reservation for GPUs. In Proceedings of the Real-Time and Embedded
Technology and Applications Symposium, RTAS, pages 29–41, 2019.
[10] Simon Kramer, Dirk Ziegenbein, and Arne Hamann. Real World
Automotive Benchmarks For Free. In Int. Workshop an Analysis Tools
and Methodologies for Embedded and Real-time Systems, WATERS,
2015.
[11] John P Lehoczkyl. Fixed Priority Scheduling of Periodic Task Sets
with Arbitrary Deadlines. In Proceedings of the Real-Time Systems
Symposium, RTSS, pages 201–209, 1990.
[12] Renata Martins Gomes, Fabian Mauroner, and Marcel Baunach. Col-
laborative Resource Management for Multi-Core AUTOSAR OS. In
Wolfgang A. Halang and Olaf Spinczyk, editors, Betriebssysteme und
Echtzeit, pages 99–108. Springer Berlin Heidelberg, 2015.
[13] Razvan Racu, Li Li, Rafik Henia, Arne Hamann, and Rolf Ernst.
Improved Response Time Analysis of Tasks Scheduled Under Preemp-
tive Round-Robin. In Proceedings of the Int. Conference on Hard-
ware/Software Codesign and System Synthesis, CODES+ISSS, pages
179–184. ACM, 2007.
[14] Ignacio Sa, Paolo Burgio, and Marko Bertogna. Schedulability and
Timing Analysis of Mixed Preemptive-Cooperative Tasks on a Parti-
tioned Multi-Core System. In Int. Workshop on Analysis Tools and
Methodologies for Embedded and Real-time Systems, WATERS, 2016.
[15] K. Traore, E. Grolleau, A. Rahni, and M. Richard. Response-Time
Analysis of Tasks with Offsets. In Proceedings of the Conference on
Emerging Technologies and Factory Automation, pages 1–8, Sep. 2006.
[16] Alexander Wieder and Bj¨
orn Brandenburg. On Spin Locks in AU-
TOSAR: Blocking Analysis of FIFO, Unordered, and Priority-ordered
Spin Locks. In Proceedings - Real-Time Systems Symposium, pages
45–56, 2013.
[17] F. Wilhelmst¨
otter. Jenetics is an advanced Genetic Algorithm, Evolu-
tionary Algorithm and Genetic Programming library, written in modern
day Java, 2019. Online available at http://jenetics.io/.
4WATERS forum thread for this paper: https://bit.ly/2IEJPpz, accessed
06.2019
... 8.1. Solution 1: Applying an extended classical response time approach [27] Authors in [27] employ conventional RTA for fully-preemptive tasks under rate monotonic scheduling running on CPUs using the windowing technique and combines it with weighted round robin (WRR) RTA for GPU tasks. Additionally, they define a memory contention model for GPU copy engines as well as differences of asynchronous and synchronous GPU offloading mechanisms. ...
... 8.1. Solution 1: Applying an extended classical response time approach [27] Authors in [27] employ conventional RTA for fully-preemptive tasks under rate monotonic scheduling running on CPUs using the windowing technique and combines it with weighted round robin (WRR) RTA for GPU tasks. Additionally, they define a memory contention model for GPU copy engines as well as differences of asynchronous and synchronous GPU offloading mechanisms. ...
... Each of the solutions interestingly used a different toolset from existing real-time research for solving the same problem. While authors [27] and [29] used genetic algorithms for design space exploration (with different objectives), [30] used a Mixed Integer Linear Programming ToolKit and [28] used an evolutionary algorithm with an SMT solver for schedule synthesis. Classical response time analysis using busy windows was used in [27] and [29], while [30] applied the self-suspending theory for the same. ...
Article
The push towards automated and connected driving functionalities mandates the use of heterogeneous hardware platforms in order to provide the required computational resources. For these platforms, established methods for performance modeling in industry are no longer effective or adequate. In this paper, we explore the detailed problem of mapping a prototypical autonomous driving application on a Nvidia Tegra X2 platform while considering different constraints of the application, including end-to-end latencies of event chains spanning CPU and GPU boundaries. With the given use-case and platform, we propose modeling concepts in Amalthea, capturing the architectural aspects of heterogeneous platforms and also the execution structure of the application. These models can be fed into appropriate tools to predict performance properties. We proposed the above problem in the Workshop on Analysis Tools and Methodologies for Embedded and Real-time Systems (WATERS) Industrial Challenge 2019 and in response, academicians came up with different solutions. In this paper, we evaluate these different solutions and summarize all approaches. The lesson learned from this challenge is then used to improve on the simplifying assumptions we made in our original formulation and discuss future modeling extensions.
... Hoettger et al. [11] present an approach based on a GA that focuses on optimizing a variety of goals -among others,the sum of response times, copy engine operations, and endto-end response times of task chains while considering bestand worst cases for execution times and delays. Compared to their work, we focus on worst case assumptions of a given application while ensuring that deadlines are always met with less pessimistic methods for e.g. ...
Conference Paper
Full-text available
The complexity of future automotive software will lead to an intricate and unforeseen impact of product and project decisions on systems level, even in late development phases. To cope with this fact, the early assessment of design decisions is a key factor for success. Within this context, our focus lies in determining if a given hardware platform is capable of fulfilling end-to-end timing requirements such as the reaction latency for a given software and finding a feasible deployment using information that is available in early design phases. For this, we propose an integrated analysis and design space exploration approach based on Eclipse APP4MC that is tailored towards software with self-suspending task sets and heterogeneous commercial off-the-shelf hardware consisting of regular processing units (CPUs) and accelerators represented by integrated GPUs. We address the challenge of searching the design space in order to find a feasible solution that fulfills reaction latency constraints for a given set of task chains in Logical Execution Time based communication and optimize the reaction latency for implicit communication. The applicability of our approach is finally demonstrated on an industrial case study.
Book
Full-text available
The increasing complexity of modern control systems leads many companies to have to resize or redesign their solutions to adapt them to new functionalities and requirements. A paradigmatic case of this situation has occurred in the railway sector, where the implementation of signaling applications has been carried out using traditional techniques that, although they currently meet the basic requirements, their time performance and functional scalability can be substantially improved. From the solutions proposed in this thesis, besides contributing to the assessment of systems that require functional safety certification, the base technology for schedulability analysis and optimization of general as well as time-partitioned distributed real-time systems will be derived, which can be applied in different environments where cyber-physical systems play a key role, for example in Industry 4.0 applications, where similar problems may arise in the future.
Conference Paper
The increasing amount of innovative software tech- nologies in the automotive domain comes with challenges re- garding inevitable distributed multi-core and many-core method- ologies. Approaches for general purpose solutions have been studied over decades but do not completely meet the specific constraints (e. g. timing, safety, reliability, affinity, etc.) for AUTOSAR compliant applications. AUTOSAR utilizes a spinlock mechanism in combination with the priority ceiling protocol in order to provide mutually exclusive access to shared resources. The disadvantages of spinlocks in combination with preemptive scheduling are unpredictable task response times on the one hand and wasted computation time caused by busy waiting periods on the other hand. In this paper, we propose a concept of task-release-delta- based runnable reordering for the purpose of sequentializing parallel accesses to shared resources, resulting in reduced task response times and improved timing predictability. To achieve this, runnables that represent smallest executable program parts in AUTOSAR are reordered based on precedence constraints. Our experiments among industrial use cases show that task response times can be reduced by up to 18,5%.
Conference Paper
Today's embedded systems demand increasing computing power to accommodate the ever-growing software functionality. Automotive and avionic systems aim to leverage the high performance capabilities of multicore platforms, but are faced with challenges with respect to temporal predictability. Multicore designers have achieved much progress on improvement of memory-dependent performance in caching systems and shared memories in general. However, having applications running simultaneously and requesting the access to the shared memories concurrently leads to interference. The performance unpredictability resulting from interference at any shared memory level may lead to violation of the timing properties in safety-critical real-time systems. In this paper, we introduce a formal analysis framework for the schedulability and memory interference of multicore systems with shared caches and DRAM. We build a multicore system model with a fine grained application behavior given in terms of periodic preemptible tasks, described with explicit read and write access numbers for shared caches and DRAM. We also provide a method to analyze and recommend candidates for task-to-core reallocation with the goal to find schedulable configurations if a given system is not schedulable. Our model-based framework is realized using Uppaal and has been used to analyze a case study.
Chapter
Although the demand for and availability of multi-core processors in the automotive industry increases, the domain-specific AUTOSAR standard for the development of embedded automotive software, due to its inherent concept for static software designs, does not efficiently solve many of the problems related to the truly parallel execution of tasks in multi-core systems. In this paper we focus on the resource sharing problems inherent to the AUTOSAR Operating System and propose the introduction of a more flexible concept. While regular AUTOSAR tasks are unaware of their resource-related impact on each other, we extend the resource manager to issue so called hints across cores to notify blocking tasks about their spurious influence on the system. This gives them an opportunity to collaborate with other tasks in order to reduce bounded and avoid unbounded priority inversions as well as deadlocks. It also improves the overall system reactivity and allows to share the underlying hardware more efficiently. In the presented test cases, we could reduce the blocking time for spinlocks significantly, without changing the AUTOSAR API, and still keeping the system deadlock free.
Conference Paper
Motivated by the widespread use of spin locks in embedded multiprocessor real-time systems, the worst-case blocking in spin locks is analyzed using mixed-integer linear programming. Four queue orders and two preemption models are studied: (i) FIFO-ordered spin locks, (ii) unordered spin locks, (iii) priority-ordered spin locks with unordered tie-breaking, and (iv) priority-ordered spin locks with FIFO-ordered tie-breaking, each analyzed assuming both preempt able and non-preempt able spinning. Of the eight lock types, seven have not been analyzed in prior work. Concerning the sole exception (non-preempt able FIFO spin locks), the new analysis is asymptotically less pessimistic and typically much more accurate since no critical section is accounted for more than once. The eight lock types are empirically compared in schedulability experiments. While the presented analysis is generic in nature and applicable to real-time systems in general, it is specifically motivated by the recent inclusion of spin locks into the AUTOSAR standard, and four concrete suggestions for an improved AUTOSAR spin lock API are derived from the results.
Conference Paper
This article presents some results about schedulability analysis of tasks with offsets also known as transactions, in the particular case of monotonic transactions. The impact of a transaction on the response time of a lower priority task under analysis is computed with the interference implied by the transaction. In the general context of tasks with offsets (general transactions), only exponential methods are known to calculate the exact worst-case response time of a task. However, in this case, Maki-Turja and Nolin have proposed an efficient approximation method. A monotonic pattern in a transaction (regarding the priority of the task under analysis), occurs when, by rotation of the higher priority tasks in a transaction, it is possible to find a pattern of tasks such that the processor demand of the transaction is monotically decreasing during a period of the transaction. We have shown in our previous work that if a task under analysis is such that all the interfering transactions are monotonic, then it is possible to evaluate its exact response time in a pseudo-polynomial time. This article presents in detail how to apply this method. Then, it compares our results to the multiframe model proposed by Mok and Chen in (1996) (AM "accumulatively monotonic" pattern). We show that the multiframe model is a particular instance of tasks with offsets but the results presented for AM multiframe cannot be applied on monotonic transactions. Finally, we show that the approximation method proposed by Maki-Turja and Nolin computes an exact response time in the case of monotonic transactions, even if its complexity is higher than the one of the test that we proposed.
Conference Paper
Consideration is given to the problem of fixed priority scheduling of period tasks with arbitrary deadlines. A general criterion for the schedulability of such a task set is given. Worst case bounds are given which generalize the C.L. Liu and J.W. Layland (1973) bound. The results are shown to provide a basis for developing predictable distributed real-time systems
Memory Interference Characterization Between CPU Cores and Integrated GPUs in Mixed-Criticality Platforms
  • Roberto Cavicchioli
  • Nicola Capodieci
  • Marko Bertogna
Roberto Cavicchioli, Nicola Capodieci, and Marko Bertogna. Memory Interference Characterization Between CPU Cores and Integrated GPUs in Mixed-Criticality Platforms. In Proceedings of the Int. Conference on Emerging Technologies and Factory Automation, ETFA, pages 1-10, 2017.
Burns Standard Notation for Real Time Scheduling
  • R I Davis
R.I. Davis. Burns Standard Notation for Real Time Scheduling. In N. Audsley and S.K. Baruah, editors, Real-Time Systems: The Past, The Present and The Future, pages 38-41. Mar 2013.
WATERS Industrial Challenge
  • Arne Hamann
  • Dakshina Dasari
  • Simon Kramer
  • Michael Pressler
  • Falk Wurst
  • Dirk Ziegenbein
Arne Hamann, Dakshina Dasari, Simon Kramer, Michael Pressler, Falk Wurst, and Dirk Ziegenbein. WATERS Industrial Challenge 2017. In Int. Workshop an Analysis Tools and Methodologies for Embedded and Real-time Systems, WATERS, 2017.
  • Arne Hamann
  • Dakshina Dasari
  • Falk Wurst
  • Ignacio Sanudo
  • Nicola Capodieci
  • Paolo Burgio
  • Marko Bertogna
Arne Hamann, Dakshina Dasari, Falk Wurst, Ignacio Sanudo, Nicola Capodieci, Paolo Burgio, and Marko Bertogna. WATERS Industrial Challenge, 2019. Online: https://www.ecrts.org/forum/viewtopic.php?f= 43&t=124&sid=1da5e37bde907477b2b991c411c03a03.