Conference PaperPDF Available

An analytical approach for calculating end-to-end response times in autonomous driving applications

Authors:

Abstract and Figures

This work proposes a solution for the WATERS industrial challenge 2019. The first part addresses the response time analysis challenge and covers analyzing the end-to-end latency of the given application by evaluating the critical path from its sensor tasks to actuator tasks under the assumption of an (i) implicit and (ii) LET based communication paradigm. The second part discusses the optimization of the applications end-to-end latency using a genetic algorithm based approach by modifying (i) the application's allocation of tasks to cores as well as (ii) the time-slices for tasks allocated to the GPU.
Content may be subject to copyright.
An analytical approach for calculating end-to-end
response times in autonomous driving applications
Lukas Krawczyk, Mahmoud Bazzal, Ram Prasath Govindarajan, Carsten Wolff
IDiAL Institute
Dortmund University of Applied Sciences and Arts
44227 Dortmund, Germany
lukas.krawczyk@fh-dortmund.de
Abstract—This work proposes a solution for the WATERS
industrial challenge 2019. The first part addresses the response
time analysis challenge and covers analyzing the end-to-end
latency of the given application by evaluating the critical path
from its sensor tasks to actuator tasks under the assumption
of an (i) implicit and (ii) LET based communication paradigm.
The second part discusses the optimization of the applications
end-to-end latency using a genetic algorithm based approach by
modifying (i) the application’s allocation of tasks to cores as well
as (ii) the time-slices for tasks allocated to the GPU.
Index Terms—Real-time systems, response-time analysis, auto-
motive, APP4MC
I. INTRODUCTION
The demands on automotive computing platforms are con-
tinuously rising due to the increasing amount of software
that is driven by new automotive functionalities. In order to
cope with these needs, future E/E architectures will consists
of a variety of heterogeneous applications with unique char-
acteristics, mixed levels of criticality and different models
of computation, such as classical periodic control, event-
based planning, or stream-based perception applications will
typically be co-existing on the same hardware platform [10].
Deploying these applications will introduce several challenges,
such as maintaining freedom from interference in safety-
critical application, as required by the ISO 26262 standard,
or meeting constraints such as timing requirements. The later
is especially challenging due to the varying computational
models.
The WATERS industrial challenge 2019 describes the proto-
type of such an end-to-end automotive driving application and
proposes two challenges typically encountered in designing
such systems. The application consists of 10 Tasks with 4
tasks that may further be accelerated by offloading to an
accelerator (GPU). The heterogeneous hardware platform is
split into two processor islands, with the first consisting of a
dual-core NVidia Denver 2 CPU and the second of a quad-
core ARM Cortex-A57 CPU. Additionally, the platform also
integrates a 256-core NVidia Pascal GPU grouped into two
streaming multiprocessors.
The research leading to these results has received funding from the Federal
Ministry for Education and Research (BMBF) under Grant 01IS18047D in
the context of the ITEA3 EU-Project PANORAMA.
The overall goal of the first challenge is to determine the
application’s end-to-end response time, e.g. the data propaga-
tion path from its sensor tasks to its actuator tasks. The second
challenge addresses the problem of minimizing the response
time of task chains by optimizing the application’s deployment
to the underlying hardware platform (NVIDIA Jetson TX2
SoM).
This paper is organized as follows. Our assumptions for
the remainder of this paper are introduced in Section II,
followed by the system model along with its notation and
terminology in Section III. Our proposed solution to the
analysis challenge is addressed in Section IV and presents the
approach for determining the application worst case response
time. Section V presents our solution to the design space
exploration (optimization) challenge, which is based on a
genetic algorithm that optimizes the application’s deployment
towards a lower end-to-end response time by modifying the
available degrees of freedom. Finally, VI discusses our results
and concludes this paper.
II. ASSUMPTIONS
The challenge states that all tasks, both for CPU and GPU,
follow a read-execute-write semantic, which implies an e.g.
implicit or LET communication paradigm. Accordingly, we
will determine the resp. end-to-end latency for both communi-
cation paradigms, although we expect implicit communication
to result in lower response times compared to LET as shown
in [4]. Moreover, we assume that all tasks access the global
memory (DRAM) exactly once at the beginning of their
execution for a copy-in operation and also exactly once at
the end of their communication for a copy-out operation, i.e.
all cores (CPU, GPU) contain a local memory that stores local
variable copies. Consequently, the time required for accessing
these local variables is assumed to be included in a task’s ticks
and as such out of the scope of this paper.
Furthermore, we assume that all operations of the CPU as
well as the GPUs Copy Engine (CE) and Execution Engine
(EE) are fully preemptive and the resulting delays of concur-
rent accesses are covered by an additional contention overhead.
All tasks are allocated to exactly one CPU core or the GPU.
The allocation and duration of each kernels time slice is fixed
at design time, i.e. tasks do not migrate at run-time. Finally, all
tasks on the CPU follow a partitioned fixed priority scheduling
policy whereas the GPU schedules its tasks using a weighted
round-robin (WRR) policy with fixed time slices. Priorities
for tasks are unique, i.e. two tasks with the same period have
different fixed priorities.
Since this work considers tasks τ∈ T executed on hetero-
geneous processing units with different scheduling algorithms
and different characteristics, we distinguish between tasks
executed on the CPU and tasks offloaded to the GPU. Tasks
that are executed on the CPU are further divided into tasks that
perform offloading of some of its work to a GPU and regular
tasks that do not perform offloading. An offloading task can
be further subdivided into three phases: A (i) pre-processing
phase, that prepares a given data set for its processing on
the GPU, an (ii) offloading phase in which the task offloads
some of its execution to the GPU and suspends itself until it
receives a response from the GPU, and a (iii) post-processing
phase that processes the resulting data set from the GPU.
Moreover, offloading tasks can either perform synchronous or
asynchronous offloading. In the synchronous case, the task
will be actively blocking lower priority tasks while it is
being executed. In the asynchronous case, lower priority tasks
may be executed during its waiting phase at the cost of an
asynchronous offloading overhead that increases the offloading
tasks execution time. For the remainder of this work, we will
apply asynchronous offloading if tasks with lower priority are
executed on the same processing unit as the offloading task
in order to allow lower priority tasks to be executing during
the offloading tasks waiting phase. Synchronous offloading is
applied if no lower priority tasks are co-scheduled on the same
processing unit with the goal of minimizing the offloading
tasks execution time by avoiding the asynchronous offloading
overhead.
III. SYS TE M MOD EL
Each task that is executed on a CPU τi∈ T is described
as tuple τi= ({τi1, . . . , τi|τi|}, Pi, πi)with its period Pia
unique priority πi, and a list of |τi|sub-tasks τij . A sub-task
can either be executed on the CPU or offloaded to the GPU. In
the former case, it is described as τC
ij = (Cij,ρ, Oij , Jij )with
an offset Oij and a jitter Jij. The set Cij,ρ ={C+
ij,ρ, C
ij,ρ}
denotes a pair of execution times on processing unit ρ, with
C+
ij,ρ being the best case execution time and C
ij,ρ the worst
case execution time.
A sub-task that will be offloaded to the GPU τG
ij ∈ T Gis
similarly described as tuple τG
ij = (Cij,ρ, Oij , Jij , φij ), with
φibeing the length of its time slice for the WRR scheduling
on the GPU. Although the challenge assumes the presence of
a single GPU, some of the GPU tasks may be executed on the
CPU. Accordingly, we maintain the concept of distinguishing
between processing unit specific execution times for each
processing unit ρ.
Our approach utilizes hyper periods of task sets, i.e. we
address the k-th instance (job) of τias τi.k. The relative
distance of a job’s arrival to the beginning of the hyper period
is represented by ri.k, and the relative distance to its worst case
response is denoted by R
i. Data is passed between tasks using
labels. A label l∈ Lij represents a variable that is accessed
(i.e. either read or written) by a sub-task τij .
Finally, a task chain σn
i∈ S is denoted as a finite
sequence of n1tasks (τ1, . . . , τn)and represents the data
propagation flow over the given task set, i.e. each path that
can be constructed from a job τm.α to a job τm+1needs
to satisfy the criteria rm+1rm.α +R
m, i.e. the absolute
arrival of the successor needs to occur after the response of
its predecessor.
IV. PROPOSED SOLUTION FOR ANALYSIS CHALLENGE
A. Task Chains
In order to determine the application’s end-to-end response
time, we have to derive longest data propagation path (critical
path) from any of its sensor tasks (i.e. a task with no
predecessor) to any of the actuator tasks (i.e. tasks with no
successor).
A convenient approach for this is to analyze the commu-
nication graph spanned by runnables and their label accesses
provided by the AMALTHEA model. However, the temporal
behavior is ambiguous due to labels with multiple read- and
write accesses from runnables executed by different tasks. For
instance, the label Cloud map host is both read and written by
runnables Lidar Function (Task Lidar Grabber) and Local-
ization Preprocessing (Task PRE Localization gpu POST )
(cf. Fig. 1a). Since both tasks can be executed in parallel,
it is impossible to determine which task realizes a source of
communication or a target. As a result, we decide to derive
the communication flow based on the challenge’s description
in [3] and ensure data consistency by introducing transitive
labels for task-to-task communication that ensure a proper data
flow by limiting the number of writers per label to one (cf.
Fig. 1b).
Lidar_Grabber
Localization
n n+1 n+2
(a) (b)
Labels
Fig. 1. Example of inconsistent data accesses due to multiple write sources
to a label (a) and with consistent accesses by adding transitive labels (b)
Since mapping decisions may lead to an excessively or even
indefinitely blocked task, each task within a data propagation
path may increase the path response time. Accordingly, we
consider all routes within the application from a sensor task to
the actuator task and specify appropriate task chains. Possible
sensor tasks for a critical path are identified as tasks without
incoming edges, i.e. the tasks Lidar Grabber,Detection,CAN,
SFM, and Lane Detection, whereas the sink is realized by
DASM, which is the only task without outgoing edge.
The task chains, that become part of our analysis, are further
specified in Tab. I.
TABLE I
IDE NTI FIE D TASK CH AI NS
Task Chain Tasks
σ1Lidar Grabber Loc EKF Planner DASM
σ2CAN Loc EKF Planner DASM
σ3SFM Planner DASM
σ4Lane detection Planner DASM
σ5Detection Planner DASM
B. End-to-end Latency
The applications end-to-end latency depends on the applied
communication paradigm and the corresponding time at which
data from one task is propagated to its follower in a task chain.
In the following, we describe the approaches for determining
a task-chains end-to-end latency for implicit and LET based
communication paradigms that will be used for deriving the
applications worst case end-to-end timing.
The worst case end-to-end latency of the application LE2E
can be determined by the highest latency among a set Sof all
task chains, formally denoted as LE2E= maxσ∈S (LT C (σ))
with LT C being the function that determines the task chain
worst case latency.
Implicit communication:In order to determine the appli-
cations end-to-end latency while assuming an implicit com-
munication paradigm, we apply the approach from Kloda et
al. in [5].
The worst case latency of a task chain σnis derived by
analyzing the latency of each data propagation path that origi-
nates during the hyper period H=lcm{Pi|τi∈ T }, which is
sufficient due to the recurring execution and communication
pattern.
LT C = max
k|kT1<H LT C (σn, kT1)(1)
As shown in [5] , an upper bound for the worst-case latency of
task chains σninstance starting at time instant rpis determined
by Eq. 2, with the producer’s release time rp, the consumer’s
release time rc, and a sub-chain σn1that contains all tasks
from σnexcept for the first element.
LT C (σn, rp)
(rcrp) + LT C (σn1, rc)n2
R
pn= 1
(2)
We illustrate this function in Fig. 2, which shows 3 tasks
that form a task chain with a total latency of 14 time units.
Naturally, data propagation is delayed due to different periods
and arrival times of tasks. For instance, let us consider the first
sub-chain consisting of a producer τp, represented by Task A
with an release time rp= 54, and a consumer τc, represented
by Task B that arrives at rc= 56. The latency due to different
arrival times is rcrp= 2 time units, and the task chains
total latency becomes 2 + LT C (σn1,56). In the second sub-
chain, Task B becomes the producer τpand arrives at time
instant rp= 56, whereas the next consumer (Task C) capable
of processing the producers output arrives at instant rc= 60.
Accordingly, the delay becomes rcrp= 4 time units, plus
the worst case response time of Task C (8time units), leading
to the chains total WCRT of 14 time units.
54 56 60 68
Task A
Task B
Task C
Fig. 2. Data propagation flow in a task chain with 3 tasks
The release time rcof a consumer τc, is derived by Eq. 3.
If both, the consumer as well as the producer are allocated
to the same processing unit, and the producer has a higher
priority compared to the consumer, any job of the consumer
that satisfies rcrpwill always be executed after the
resp. producers job. Accordingly, a safe release time for the
consumer that guarantees that it will read the most recent
output of the producer is derived by the first case in Eq. 3.
In all other cases, only a consumer that is released after the
producer finishes its work, i.e. rc> rp+R
p, will be capable
of reading the producer’s output.
rc=
rp
TcTciff π(τp)> π(τc)and P(τp) = P(τc)
rp+R
p
TcTcotherwise
(3)
LET Communication:The applications end-to-end latency
is determined using an adjusted variant from [5] that allowed
an efficient implementation for this challenge. Due to the lack
of space we omit a detailed description.
C. Response Time
Before we present an approach for determining the worst
case response time in the given heterogeneous architecture,
we need to analyze if and how tasks scheduled on different
processing units impact each other. Therefore we illustrate the
execution of a very simple heterogeneous example system with
similar characteristics to the challenge in Fig. 3.
The system consists of four tasks τ1τ4and three pro-
cessing units Core1, Core2 and GPU. Tasks τ2and τ3denote
regular tasks while τ1and τ4denote offloading tasks. An
offloading task τxis further subdivided into
τC
x1, being its pre-processing phase
τG
x2, being its offloading or suspension phase that starts
as soon as its sub-task is launched on the accelerator and
lasts until the accelerator finished its execution, and
τC
x3, being its post-processing phase that is instantly
released once the task offloaded to the GPU finishes
execution.
Core 2 Core 1GPU
30 33 38 4010 12 15 17 20
2nd iteration ( ) 4th iteration( )
Fig. 3. Example of schedule with 3 tasks executed by 2 CPUs and 2 offloaded
tasks to the GPU.
It becomes apparent that an offloading task follows the
same periodic activation pattern on both, the CPU as well as
the GPU. However, each subsequent phase of the offloading
task is delayed by a minimum offset equal to the best case
response time of the previous phase (cf. Fig. 3, 2nd iteration),
and a maximum offset equal to the worst case response time
(cf. Fig. 3, 4th iteration). In other words, the release of the
suspension phase will occur no earlier then the best case
response time of its pre-processing phase, and no later then
the pre-processing phases worst case response time. In order
to consider this behavior during our response time analysis,
we need to adjust the values for offset and jitter for offloading
and post-processing phases. Without loss of generality, we can
describe this dependency by reformulating the notation from
[7] for subsequently released sub-tasks τij belonging to the
same task τias presented in Eq. 4.
j={2,...,|τi|} :Oij =R+
ij1
Jij =R
ij1− R+
ij1
(4)
As both of our scheduling specific response time analy-
sis approaches applied in the following sub-sections have a
monotonic dependency of the response time on the jitter terms
[7], an iterative algorithm will guarantee the jitter value to
converge.
In the following subsections we will present the respective
response time analysis approaches for tasks executed on the
CPU resp. on the GPU.
1) Tasks execution on the CPU:Tasks executed on the
CPU follow a fully preemptive fixed priority based scheduling
strategy. As their total computation time needs to account
times for memory access, we begin by determining the total
execution time for each sub-task that is scheduled on the CPU.
As presented in the previous section, we need to consider
bounds for the best resp. worst cases. Since the calculation
for both cases is generally the same and only differs in the
values that are used for Cij,ρ and Aρ, i.e. C+
ij,ρ and A+
ρfor
the best case as well as C
ij,ρ and A
ρfor the worst case, we
provide a general notation that can be applied for both cases.
The total execution time Wij of a sub-task executed on
the CPU consists of the raw processing time Cij,ρ for the
processing unit ρ, and the delays introduced by accessing
labels and contention effects caused by other processing units
(CPU, GPU) accessing shared resources such as e.g. DRAM.
We formalize this behavior in Eq. 5 with Aρbeing the access
time for accessing the shared memory from a processing unit ρ
and λibeing the numbers of memory accesses. For simplicity,
we sum up the read and write accesses, since both have the
same access time in the given challenge.
Wij =Cij,ρ +λij · Aρ(5)
The number of memory accesses is trivially calculated by
summing up the number of memory accesses per label lin
Eq. 6, with Lij being the set of labels accessed by sub-task
τij .
λij =X
l∈Lij
size(l)
size(cacheline)(6)
In order to determine the worst case response time of any
task τi, we need to identify the response time of its last sub-
task τi|τi|in Eq. 7.
R
i=R
i|τi|(7)
What remains now is deriving bounds on the best resp. worst
case response times R+
ti and R
ti for each of the sub-tasks.
A good approximation for the best case response time can be
obtained by summing up the total computation times of a tasks
predecessors including itself as shown in Eq. 8 [7].
R+
ij =X
k=1...j
W+
ik (8)
The problem of finding the worst case response time in
task sets with offsets has been shown to be NP complete
by Tindell et al. [11] as it exponentially grows with the
number of tasks. Although the challenge subject to this paper
consists of a comparatively small example with 10 tasks, the
analysis presented in this section will become part of the
optimization challenge addressed in Sec. V, which again is
a NP complete problem. Consequently, we find it desirable to
apply an efficient approximation that will not only scale well
with larger problem sizes, i.e. have a polynomial efficiency,
but also provide solutions close to the exact response time
analysis.
Therefore, we focus on the upper-bound approximation
approach that has been developed by Tindell et al. [11] and
later refined by Palencia et al. [7] for tasks with dynamic
offsets.
The maximum response time R
ij for a given sub-task τij
is obtained by Eq. 9 [7] by checking every critical instant
for an instance pthat falls into the tasks busy period, with the
function hpireturning all higher priority sub-tasks that belong
to task τi.
R
ij = max
chpi(τij )τij max
p=p0,ijc ,...,pL,ijc
(Rijc (p))(9)
The response time for a given instance pwhen the critical
instant coincides with the activation of task τic is then derived
by Eq. 10 [7] by subtracting the phase between both tasks and
it’s release time from the absolute response time, and adding
it’s offset.
Rijc (p) = wijc (p)Φijc (p1)Ti+Oij (10)
Due to lack of space and since we did modify the original
approach, we omit a further description and refer to the
original source in [7] for a complete formulation of the
remaining functions.
2) Tasks executed on the GPU:Sub-tasks being executed
on GPU’s (kernels) follow a weighted round robin scheduling
policy. A context switch to the next queued kernel occurs
(i) after the kernels predefined time slice elapses or (ii) the
execution is finished, whichever is encountered first. Since
kernels also follow a strict read-execute-write policy, their total
execution time WG
ij needs to account the copy engines copy-in
CCE
in resp. copy-out CCE
out operations along with the execution
engines execution time CEE (Eq. 11).
WG
ij =CCE
inij +CEE
ij +CCE
outij (11)
The delay introduced by the copy engines copy-in and copy-
out operations is derived similarly to the memory access of
CPU tasks. In order to create or copy back a local label, the
memory has to be accessed twice. Due to the equal access
times for read and write accesses, the duration of a copy engine
can be determined equivalently to Eq. 5 by multiplying the
number of accesses λij by the access time Aρand the scale
2to account the resp. write access for each read access in
Eq. 12.
CCE
inij +CCE
outij = 2 ·λij · Aρ(12)
Similar to tasks executed on the CPU, this approach can be
used in determining both, worst as well as best case execution
times by considering the resp. execution and access times C+
ij
and A+
ρand vice versa.
For the remainder of this section, we adapt the approach in
[8] for determining the best and worst case response times in
weighted round robin scheduled tasks by checking different
time windows for a given busy period.
With a single stream, a sub-tasks τij worst case response
time RG
ij is derived (slightly reformulated compared to [8])
as presented in Eq. 13 by determining the maximum of all
response times RG
ij (q)for the q-th time window of the busy
period, with n+
ij (∆t)being an arrival function that returns the
maximum number of activations for the given interval t[9].
RG
ij = max
1qn+
ij (wij (q)) RG
ij (q)(13)
The execution time for the first qnumber of activations includ-
ing all interferences Icaused other tasks that are executed on
the GPU is given by wij(q)in Eq. 14.
wij (q) = q· WG
ij +Iij (q)(14)
The response time for a given time window RG
ij (q)is
then derived in Eq. 15 by subtracting the earliest absolute
release time δ
ij (q)[9] of the q-th time window from the total
execution time.
RG
ij (q) = wij (q)δ
ij (q)(15)
For determining the individual interference Iij(q)for the q-
th time window as well as the modifications required for
determining the best case response time RG+
ij , we refer to
the original work in [8].
3) Memory Access Latency:As stated in the challenges
description [3], the time for reading or writing to the memory
Aρfrom processing unit ρcan be calculated by Eq. 16 for
CPUs resp. Eq. 17 for GPUs, with CCρbeing the number
of cycles for accessing the memory, fρthe processing units
frequency, and ζthe numbers of processing units concurrently
accessing the memory. Moreover, the constant κρannotates
the increase in latency for each interfering CPU, whereas γρ
represents the increase in latency if the GPU’s copy engine is
performing operations.
Aρ=CCρ
fρ
+κρ·ζ+γρ(16)
Aρ=CCρ
fρ
+ 0.5ns ·ζ(17)
In the following, we illustrate the worst case in which all
ζ= 5 neighboring cores as well as the GPU’s copy engine
cause contentions. Given the provided values for κρand γρin
[3], the previous equations can be simplified into a worst case
access time A
ρin Eq. 18.
A
ρ=
220 ns for CPUs (A57)
38 ns for CPUs (Denver)
6ns for GPUs
(18)
For the best case, we assume that none of the neighboring
cores cause contention, which leads to the best case access
time A+
ρin Eq. 19.
A+
ρ=
20 ns for CPUs (A57)
8ns for CPUs (Denver)
3ns for GPUs
(19)
4) Results for analysis model:The results of our analysis
are presented in Tab. II using the notation from Sec. III. Tasks
denoted with an asterisk (*) represent offloading tasks that are
executed on the CPU iff the offloaded task is being executed
on the GPU. It becomes apparent that task Planner cannot
be scheduled as it’s total worst case execution time Wi=
12.4 + 0.8 = 13.2ms exceeds its period T= 12. Accordingly,
we decide to reduce its number of ticks by 10% in order to
make it schedulable on a Denver core while maintaining a
tight bound.
As the remaining model is not schedulable due to it’s
allocations [3], we will apply the remainder of our analysis,
i.e. the end-to-end analysis and the response time analysis,
on the feasible solution that is generated in the design space
exploration challenge.
TABLE II
WORST CASE EXECUTION-, A ND CO MMU NI CATI ON TI ME S FOR E ACH TAS K AN D PRO CES SI NG UN IT
Denver A57 Pascal
Name P CC+λ· Aλ· A+CC+λ· Aλ· A+CC+λAλA+
DASM 5 1.3 1.0 0.0 0.0 1.9 1.3 0.0 0.0 - - - -
CANbus polling 10 0.6 0.4 0.0 0.0 0.6 0.4 0.0 0.0 - - - -
Planner 12 12.4 9.5 0.8 0.2 13.2 9.6 4.4 0.4 - - - -
EKF 15 4.4 4.1 0.0 0.0 4.8 4.0 0.0 0.0 - - - -
Lidar Grabber 33 10.9 9.8 2.1 0.4 13.7 10.2 12.0 1.1 - - - -
SFM 33 27.8 22.2 2.4 0.5 29.5 24.1 13.9 1.3 7.9 7.0 0.4 0.2
SFM* 33 6.7 5.4 3.6 0.8 7.9 6.3 20.8 1.9 - - - -
Lane detection 66 42.2 38.4 2.4 0.5 51.0 47.8 13.8 1.3 27.3 24.5 0.4 0.2
Lane detection* 66 7.6 6.1 2.4 0.5 8.3 6.8 13.8 1.3 - - - -
OS Overhead 100 50.0 50.0 0.0 0.0 50.0 50.0 0.0 0.0 - - - -
Detection 200 - - - - - - - - 116.0 108.0 0.5 0.3
Detection* 200 4.1 3.0 3.3 0.7 4.7 4.0 19.0 1.8 - - - -
Localization 400 294.8 276.7 1.8 0.4 387.4 366.5 10.3 0.9 124.0 117.0 0.3 0.1
Localization* 400 14.5 6.1 1.8 0.4 17.6 7.3 10.3 0.9 - - - -
V. PROPOSED SOLUTION FOR OPTIMIZATION CHALLENGE
Our scope in the optimization challenge lies in minimizing
the applications end-to-end latency, i.e. the term LE2E=
maxσ∈S (LT C (σ)) as specified in the previous section. With
regard to the communication paradigms, the only approach
for reducing the end-to-end latency in LET communication
would be in modifying the task’s period Piwhich is out of
the scope of this paper, therefor we will focus on finding a
feasible allocation for this case that ensures schedulability.
For the implicit case, the applications latency can be
optimized by altering a task’s worst case response times.
Consequently we will use the tasks priority, the allocation from
tasks to processing units, and time slices for tasks offloaded
to the GPU as a degree of freedom in our approach.
For priority assignment, we apply Audsleys approach [1]
that will determine priorities leading to a feasible schedule,
and combine it with the recurrent-relation for assigning offsets
and jitters for tasks performing offloading. The remainder
of our optimization is performed by genetic algorithm based
approach that aims at minimizing the applications end-to-end
latency. It is implemented in Java using the open source library
Jenetics [12] and extends the native DSE capability [6] of
App4MC. The genetic algorithm has been executed on an
Intel Core i5-3570K quad-core CPU operating 3.4 GHz with
an initial population of 500 randomly initialized individuals
and a termination criteria of 1000 iterations after a steady, i.e.
non-improving, fitness value.
The applications end-to-end latency for each of the previ-
ously described task-chains is illustrated in Tab. III, with the
deployment leading to these results in Tab. IV. The deploy-
ment was found after approx. 5 minutes, with the majority
of cpu time (287 out of 291 seconds) taken by the fitness
calculation as described in Sec. IV.
We can observe that implicit communication slightly (3%
28%, avg. 10%) outperforms LET communication, which
seems convenient considering that the response times of those
tasks that form a task chain are very close to the resp. task’s
TABLE III
WOR ST-CAS E EN D-TO -END L ATEN CY F OR LET COMMUNICATION (LEFT )
AND IMPLICIT COMMUNICATION (RIGHT)
Task Chain LET end-to-end Implicit end-to-end
σ1886 859.9
σ2865 836.9
σ367 59.9
σ4100 71.9
σ5230 221.9
TABLE IV
WORST CASE EXECUTION-, COMMUNICATION,AND R ESP ON SE TI ME S FOR
TASK S
Name P π Cλ· ARφ
Core 0 (Denver)
Planner 12 9 11.2 0.8 12.0
Core 1 (Denver)
SFM* 33 6 6.7 3.6 31.5
Lane detection 66 2 42.2 1.2 53.6
Core 2 (A57)
CANbus polling 10 5 0.6 0.0 0.6
EKF 15 1 4.8 0 5.4
Core 3 (A57)
Localization 400 4 387.4 5.2 392.6
Core 4 (A57)
Lidar Grabber 33 8 13.7 12.0 25.7
Detection* 200 7 4.7 1.8 198.0
Core 5 (A57)
OS Overhead 100 0 50 0.0 79.9
DASM 5 3 1.9 0.0 1.9
GP10B (GPU)
Detection 200 116.0 0.5 166.2 375
SFM 33 7.9 0.0 19.9 11.6
period. As expected, one of the Denver cores is forced to
exclusive execute the task Planner due to a lack of alternatives.
VI. CONCLUSION AND OU TL OO K
This work presents an approach for analyzing the end-to-end
latencies in applications for heterogeneous embedded systems.
It provides an detailed description on the challenges when
(i) analyzing the application’s end-to-end response time and
(ii) minimizing it by optimizing the application’s deployment.
Since the initial system was not schedulable under our worst-
case assumption (i.e. the total worst case execution time of
task Planner exceeded its period), we reduced the number of
ticks by 10% for the task planner only. We have presented our
results for both challenges, consisting of the applications worst
case response time (σ1) for the implicit and LET communi-
cation paradigms as well as the worst case execution times,
worst case communication overheads including contention,
worst case response times, and the optimized and feasible
deployment of the application.
Due to the complexity of the WATERS2019 challenge and
the time required to fully comprehend the essential character-
istics we had to narrow down the scope of our contribution,
leaving room for future work we would like to address in the
future.
While our current solution only covers a single streaming
multiprocessor, it would be desirable to further exploit the
hardware’s capabilities by utilizing both SMs while consider-
ing any contention effects this would introduce. Moreover, we
plan to consider other scheduling approaches that allow e.g.
migrating tasks at run-time. Finally, we would like to back-up
our findings with benchmarks of the prototypical application
by executing these on a NVidia TX2 in order to increase the
accuracy of our approaches.
REFERENCES
[1] Neil C. Audsley. Optimal priority assignment and feasibility of static
priority tasks with arbitrary start times. 2007.
[2] AUTOSAR. Specification of RTE 4.2.2, 2015. URL:
https://www.autosar.org/fileadmin/user upload/standards/classic/4-0/
AUTOSAR SWS RTE.pdf.
[3] Arne Hamann, Dakshina Dasari, , and Falk Wurst. WATERS Industrial
Challenge 2019. 2019. URL: https://www.ecrts.org/waters/.
[4] Arne Hamann, Dakshina Dasari, Simon Kramer, Michael Pressler, and
Falk Wurst. Communication Centric Design in Complex Automo-
tive Embedded Systems. In Marko Bertogna, editor, 29th Euromi-
cro Conference on Real-Time Systems (ECRTS 2017), volume 76 of
Leibniz International Proceedings in Informatics (LIPIcs), pages 10:1–
10:20, Dagstuhl, Germany, 2017. Schloss Dagstuhl–Leibniz-Zentrum
fuer Informatik. URL: http://drops.dagstuhl.de/opus/volltexte/2017/
7162, doi:10.4230/LIPIcs.ECRTS.2017.10.
[5] Tomasz Kloda, Antoine Bertout, and Yves Sorel. Latency analysis
for data chains of real-time periodic tasks. In 23rd IEEE Interna-
tional Conference on Emerging Technologies and Factory Automation,
ETFA 2018, Torino, Italy, September 4-7, 2018, pages 360–367. IEEE,
2018. URL: https://doi.org/10.1109/ETFA.2018.8502498, doi:10.
1109/ETFA.2018.8502498.
[6] Lukas Krawczyk, Carsten Wolff, and Daniel Fruhner. Automated
distribution of software to multi-core hardware in model based em-
bedded systems development. In Giedre Dregvaite and Robertas
Damasevicius, editors, Information and Software Technologies - 21st
International Conference, ICIST 2015, Druskininkai, Lithuania, Oc-
tober 15-16, 2015, Proceedings, volume 538 of Communications in
Computer and Information Science, pages 320–329. Springer, 2015.
URL: https://doi.org/10.1007/978-3-319-24770-0 28, doi:10.1007/
978-3- 319-24770- 0\_28.
[7] J.C. Palencia and M. Gonzalez Harbour. Schedulability analysis for
tasks with static and dynamic offsets. 2002. doi:10.1109/real.
1998.739728.
[8] Razvan Racu, Li Li, Rafik Henia, Arne Hamann, and Rolf Ernst.
Improved response time analysis of tasks scheduled under preemptive
round-robin. In Soonhoi Ha, Kiyoung Choi, Nikil D. Dutt, and
J¨
urgen Teich, editors, Proceedings of the 5th International Conference
on Hardware/Software Codesign and System Synthesis, CODES+ISSS
2007, Salzburg, Austria, September 30 - October 3, 2007, pages
179–184. ACM, 2007. URL: https://doi.org/10.1145/1289816.1289861,
doi:10.1145/1289816.1289861.
[9] Kai Richter. Compositional Scheduling Analysis Using Standard Event
Models. PhD thesis, Dec 2004. URL: https://publikationsserver.
tu-braunschweig.de/receive/dbbs mods 00001765.
[10] Selma Saidi, Sebastian Steinhorst, Arne Hamann, Dirk Ziegenbein, and
Marko Wolf. Special Session: Future Automotive Systems Design: Re-
search Challenges and Opportunities. In 2018 International Conference
on Hardware/Software Codesign and System Synthesis (CODES+ISSS),
pages 1–7. IEEE, sep 2018. URL: https://ieeexplore.ieee.org/document/
8525873/, doi:10.1109/CODESISSS.2018.8525873.
[11] K.W. Tindell. Adding time-offsets to schedulability analysis. Depart-
ment of Computer Science, University of York,, 1994.
[12] Franz Wilhelmsttter. Jenetics: Java genetic algorithm library, 2019. URL:
http://jenetics.io/.
... I. They are derived from the actual measurements [24] on Nvidia Jetson TX2 platform (Denver cores) and scaled considering typical high-performance computing systems for autonomous driving. Unfortunately, these industry-level applications are IP protected, so we were not able to run them on a real platform. ...
... β=232.81, and γ=2.64, on the same hardware platform in [24]. Discrete frequency levels. ...
... Discrete frequency levels. We use 12 evenly spaced frequencies between 345 MHz and 2 GHz from the same hardware platform in [24]. Scenarios. ...
Preprint
The increasing computing demands of autonomous driving applications make energy optimizations critical for reducing battery capacity and vehicle weight. Current energy optimization methods typically target traditional real-time systems with static deadlines, resulting in conservative energy savings that are unable to exploit additional energy optimizations due to dynamic deadlines arising from the vehicle's change in velocity and driving context. We present an adaptive system optimization and reconfiguration approach that dynamically adapts the scheduling parameters and processor speeds to satisfy dynamic deadlines while consuming as little energy as possible. Our experimental results with an autonomous driving task set from Bosch and real-world driving data show energy reductions up to 46.4% on average in typical dynamic driving scenarios compared with traditional static energy optimization methods, demonstrating great potential for dynamic energy optimization gains by exploiting dynamic deadlines.
... J o u r n a l P r e -p r o o f Journal Pre-proof 8.3. Solution 3: Analytical approach with focus on end-to-end latencies [29] The authors of [29] mainly focus on optimizing the end-to-end latencies, considering the LET and implicit communication semantics. They also apply standard response time analysis techniques for the fixed priority preemptive scheduling on the CPUs (including offsets) [31] as well as weighted roundrobin scheduling on the GPUs [37]. ...
... J o u r n a l P r e -p r o o f Journal Pre-proof 8.3. Solution 3: Analytical approach with focus on end-to-end latencies [29] The authors of [29] mainly focus on optimizing the end-to-end latencies, considering the LET and implicit communication semantics. They also apply standard response time analysis techniques for the fixed priority preemptive scheduling on the CPUs (including offsets) [31] as well as weighted roundrobin scheduling on the GPUs [37]. ...
... Each of the solutions interestingly used a different toolset from existing real-time research for solving the same problem. While authors [27] and [29] used genetic algorithms for design space exploration (with different objectives), [30] used a Mixed Integer Linear Programming ToolKit and [28] used an evolutionary algorithm with an SMT solver for schedule synthesis. Classical response time analysis using busy windows was used in [27] and [29], while [30] applied the self-suspending theory for the same. ...
Article
The push towards automated and connected driving functionalities mandates the use of heterogeneous hardware platforms in order to provide the required computational resources. For these platforms, established methods for performance modeling in industry are no longer effective or adequate. In this paper, we explore the detailed problem of mapping a prototypical autonomous driving application on a Nvidia Tegra X2 platform while considering different constraints of the application, including end-to-end latencies of event chains spanning CPU and GPU boundaries. With the given use-case and platform, we propose modeling concepts in Amalthea, capturing the architectural aspects of heterogeneous platforms and also the execution structure of the application. These models can be fed into appropriate tools to predict performance properties. We proposed the above problem in the Workshop on Analysis Tools and Methodologies for Embedded and Real-time Systems (WATERS) Industrial Challenge 2019 and in response, academicians came up with different solutions. In this paper, we evaluate these different solutions and summarize all approaches. The lesson learned from this challenge is then used to improve on the simplifying assumptions we made in our original formulation and discuss future modeling extensions.
... The experimental results in this work are based on a case study that is constructed from two industrial challenges from the WATERS workshops [22,25] and the timing and deployment information from [26,27]. It represents a highly simplified version of a VCC with heterogeneous computing units executing two applications. ...
Article
The adherent need for computational power in highly automated driving will lead to fewer control units with larger computational power, eventually replacing a large number of a vehicle’s electric control units by so-called vehicle control computers. As a result, high-level functionality will no longer be realized by co-designing software and hardware, but instead by integrating software from different suppliers onto a shared hardware platform. In order to ensure that this functionality will operate according to system specifications, the integrator will need to ensure that requirements related to performance and timing are fulfilled. However, solely relying on e.g. periodical or event triggered activation patterns when integrating third-party software without taking into account external triggering information may lead to overly pessimistic estimations, thus leading to unnecessarily expensive hardware or even preventing feasible configurations. In this work, we provide a comprehensive overview of activation patterns that are employed in the automotive industry and a detailed description of their semantics. We also propose novel event models corresponding to these activation patterns in order to enable the application of performance analysis techniques. We demonstrate the usage of these models on a case study in the context of compositional performance analysis, and quantify the improvement when bounding response times in comparison to the usage of chained task sets. Our experimental results show that the accurate representation of occurrence schemes using specialized event models allows to reduce the pessimism by up to 32.3% compared to using traditional modeling techniques. Finally, we discuss the results along with the impact on the determinism of an application’s temporal behavior.
... It is also worth mentioning that the model is derived for a specific hardware platform (NVIDIA Jetson TX2 SoM). To further optimize the application, evolutionary optimization approaches such as genetic algorithm can be used for allocation of tasks [23]. In order to determine the worst case response time of an application, the event chain in the critical path must be considered. ...
Conference Paper
Full-text available
The computational demands of safety-critical ADAS applications on autonomous vehicles have been ever-increasing. As a result, high performance computing nodes with varying operating frequencies, and inclusion of different sensors have been introduced, which has resulted in introduction of heterogeneous architectures with high complexity. This complexity has led to challenges in analyzing a system's timing behavior, such as determining the end-to-end response time of high-level functionality as well as a real-time application's latency. Although several approaches to tackle this issue have been proposed, their practical verification on real-life applications is still an open issue. Accordingly, this work proposes an automotive demonstrator that will be used in evaluating the timing behavior of ADAS applications in a real-life environment using methodologies such as tracing, profiling and static analysis. The APP4MC RaceCar is a work in progress four-wheel drive demonstrator built on a Traxxas 1/10 scale RC car platform. It is equipped with state-of-the-art sensors like LiDAR, ZED2 stereo camera and hosts multiple heterogeneous on-board computers such as Nvidia AGX Xavier to replicate a full size autonomous vehicle. In this paper, we describe the need for making such a demonstrator with an overview of the heterogeneous components used in it. Moreover, we further describe the system architecture as well as the data flow through event-chain task model for the ADAS application which is based on Waters Challenge 2019 industrial case study.
... It is also worth mentioning that the model is derived for a specific hardware platform (NVIDIA Jetson TX2 SoM). To further optimize the application, evolutionary optimization approaches such as genetic algorithm can be used for allocation of tasks [23]. In order to determine the worst case response time of an application, the event chain in the critical path must be considered. ...
Preprint
The computational demands of safety-critical ADAS applications on autonomous vehicles have been ever-increasing. As a result, high performance computing nodes with varying operating frequencies, and inclusion of different sensors have been introduced, which has resulted in introduction of heterogeneous architectures with high complexity. This complexity has led to challenges in analyzing a system's timing behavior, such as determining the end-to-end response time of high-level functionality as well as a real-time application's latency. Although several approaches to tackle this issue have been proposed, their practical verification on real-life applications is still an open issue. Accordingly, this work proposes an automotive demonstrator that will be used in evaluating the timing behavior of ADAS applications in a real-life environment using methodologies such as tracing, profiling and static analysis. The APP4MC RaceCar is a work in progress four-wheel drive demonstrator built on a Traxxas 1/10 scale RC car platform. It is equipped with state-of-the-art sensors like LiDAR, ZED2 stereo camera and hosts multiple heterogeneous on-board computers such as Nvidia AGX Xavier to replicate a full size autonomous vehicle. In this paper, we describe the need for making such a demonstrator with an overview of the heterogeneous components used in it. Moreover, we further describe the system architecture as well as the data flow through event-chain task model for the ADAS application which is based on Waters Challenge 2019 industrial case study.
... This work is the extension of our previous work[14] ...
Conference Paper
Full-text available
The complexity of future automotive software will lead to an intricate and unforeseen impact of product and project decisions on systems level, even in late development phases. To cope with this fact, the early assessment of design decisions is a key factor for success. Within this context, our focus lies in determining if a given hardware platform is capable of fulfilling end-to-end timing requirements such as the reaction latency for a given software and finding a feasible deployment using information that is available in early design phases. For this, we propose an integrated analysis and design space exploration approach based on Eclipse APP4MC that is tailored towards software with self-suspending task sets and heterogeneous commercial off-the-shelf hardware consisting of regular processing units (CPUs) and accelerators represented by integrated GPUs. We address the challenge of searching the design space in order to find a feasible solution that fulfills reaction latency constraints for a given set of task chains in Logical Execution Time based communication and optimize the reaction latency for implicit communication. The applicability of our approach is finally demonstrated on an industrial case study.
Chapter
Today the challenge facing every company is the enormous quantity of data being captured, at yearly, monthly, weekly, daily and hourly levels and how this data may be used. Despite the amount of data often this data is limited regarding company processes and their analysis. This can be solved by preprocessing data, after its quality evaluation, for process mining activities. Prepared data can be used for data dimensions’ coverage and having dimension members filled financial analyst may analyze this data from different perspectives for discovery of certain patterns, anomalies and frauds. This paper presents primary results of data cube dimensions fill according data of real organizations General Ledger information. Provided examples help to illustrate the possibility cover majority dimension members and give material for further researches.
Chapter
With the inherent complexity of heterogeneous embedded systems in the automotive domain, it becomes necessary to consider the modularity of components in such systems. Modular Performance Analysis (MPA) is a framework that attempts to analyse timing properties of these systems using the techniques of Real-Time Calculus (RTC). In this paper, we present the RTCAnalysis tool that performs practical MPA analysis on automotive systems during early design phases to identify metrics required to determine whether the system under analysis satisfies safety requirements.
Chapter
This article presents the experience of using blended learning methods and digital technologies to organize the educational process at Sumy State University. Based on the experience of using a combination of classical models of blended learning and interactive digital technologies in the educational process, a unified complex model has been proposed. The classic model of a flipped classroom is integrated with project approach to provide the collaborative problem-solving methods in the proposed model. The group-project approach for solving collective tasks is described step by step. The purpose of the research is to elaborate the complex structure of the blended learning model, including the educational load distribution, which will meet the requirements for teaching multiple subjects and different disciplines.
Article
Full-text available
In this paper we present an extension to current schedulability analysis techniques for periodic tasks with offsets, scheduled under a preemptive fixed priority scheduler. Previous techniques allowed only static offsets restricted to being smaller than the task periods. With the extension presented in this paper, we eliminate this restriction and we allow both static and dynamic offsets. The most significant application of this extension is in the analysis of multiprocessor and distributed systems. We show that we can achieve a significant increase of the maximum schedulable utilization by using the new technique, as opposed to using previously known worst-case analysis techniques for distributed systems. ___________________ This work has been supported in part by the Comisin Interministerial de Ciencia y Tecnologa of the Spanish Government, under grant number TAP97-892 1.
Conference Paper
Software-Mapping, i.e. the mapping of software elements to hardware components, is especially in the context of embedded multi-core systems a rather complex task. Usually, it is not sufficient to allocate tasks to hardware, since further types of allocations, e.g. communications to data paths or data to memories, exist. Accordingly, these allocations have a crucial impact on the performance. Since it is required to fulfill several constraints, e.g. deadlines and task ordering, it is furthermore necessary to select those allocations that result in a valid, but also efficient mapping. Such efficiency is usually not achieved by executing the application as quick as possible but e.g. as reliable or energy saving as possible. One way to achieve this lies in using mathematical methods, e.g. Integer Linear Programming (ILP). ILP allows describing the mapping problem in terms of equations, which will be optimized towards a specific goal. This work describes an exemplary integration of an existing mathematical method for embedded multi-core software to hardware mapping into the AMALTHEA Tool Platform, including its evaluation as well as adaptation, in order to provide an automated software mapping functionality.
Conference Paper
Round-Robin scheduling is the most popular time triggered schedul- ing policy, and has been widely used in communication networks for the last decades. It is an efficient scheduling technique for in- tegration of unrelated system parts, but the worst-case timing de- pends on the system properties in a very complex way. The existing works on response time analysis of task scheduled under Round- Robin determine very pessimistic response time bounds, without considering in detail the interactions between tasks. This may lead to a degradation of the efficiency of Round-Robin scheduling algo- rithm, and becomes a practical obstacle to its application in real- time systems. In this paper we present an approach to compute much tighter best-case and worst-case response time bounds of tasks scheduled under preemptive Round-Robin, including also the effects of the scheduling algorithm.
Article
This paper discusses the addition of so-called time offsets to task sets dispatched according to fixed priorities. The motivation for this work is two-fold: firstly, direct expression of time offsets is a useful structuring approach for designing complex hard real-time systems. Secondly, analysis directly addressing time offsets can be very much less pessimistic than extant analysis. In this report we extend our current fixed priority schedulability analysis, and then present two major worked examples, illustrating the approach. 1. INTRODUCTION Previous work has addressed the problem of determining the worst-case timing behaviour of tasks dispatched according to fixed priority scheduling [11, 10]. Much of this work has been aimed at determining the worst-case case response time of a given task; of course, the worst-case response time is, by definition, the response time of the task in the worst-case scheduling scenario. So far, in all these previous pieces of work, tasks have been ass...
Optimal priority assignment and feasibility of static priority tasks with arbitrary start times
  • C Neil
  • Audsley
Neil C. Audsley. Optimal priority assignment and feasibility of static priority tasks with arbitrary start times. 2007.
WATERS Industrial Challenge
  • Arne Hamann
  • Dakshina Dasari
  • Falk Wurst
Arne Hamann, Dakshina Dasari,, and Falk Wurst. WATERS Industrial Challenge 2019. 2019. URL: https://www.ecrts.org/waters/.
Communication Centric Design in Complex Automotive Embedded Systems
  • Arne Hamann
  • Dakshina Dasari
  • Simon Kramer
  • Michael Pressler
  • Falk Wurst
Arne Hamann, Dakshina Dasari, Simon Kramer, Michael Pressler, and Falk Wurst. Communication Centric Design in Complex Automotive Embedded Systems. In Marko Bertogna, editor, 29th Euromicro Conference on Real-Time Systems (ECRTS 2017), volume 76 of Leibniz International Proceedings in Informatics (LIPIcs), pages 10:1-10:20, Dagstuhl, Germany, 2017. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: http://drops.dagstuhl.de/opus/volltexte/2017/ 7162, doi:10.4230/LIPIcs.ECRTS.2017.10.