Content uploaded by Mahmoud Khaled
Author content
All content in this area was uploaded by Mahmoud Khaled on Feb 15, 2019
Content may be subject to copyright.
pFaces∗: An Acceleration Ecosystem for Symbolic Control
Mahmoud Khaled
Hybrid Control Systems Group
Technical University of Munich
Munich, Germany
khaled.mahmoud@tum.de
Majid Zamani
Department of Computer Science
University of Colorado Boulder, USA
Department of Computer Science
Ludwig Maximilian University of Munich, Germany
majid.zamani@colorado.edu
ABSTRACT
The correctness of control software in many safety-critical appli-
cations such as autonomous vehicles is crucial. One technique to
achieve correct control software is called "symbolic control", where
complex systems are approximated by nite-state abstractions.
Then, using those abstractions, provably-correct digital controllers
are algorithmically synthesized for concrete systems, satisfying
complex high-level requirements. Unfortunately, the complexity of
synthesizing such controllers grows exponentially in the number
of state variables. However, if distributed implementations are con-
sidered, high-performance computing platforms can be leveraged
to mitigate the eects of the state-explosion problem.
We propose
pFaces
, an extensible software-ecosystem, to accel-
erate symbolic control techniques. It facilitates designing parallel
algorithms and supervises their executions to utilize available com-
puting resources. To demonstrate its capabilities, novel parallel
algorithms are designed for abstraction-based controller synthesis.
Then, they are implemented inside
pFaces
and dispatched, for par-
allel execution, in dierent heterogeneous computing platforms,
including CPUs, GPUs and Hardware Accelerators (HWAs). Results
show remarkable reduction in the computation time by several
orders of magnitudes as number of processing elements (PEs) in-
creases, which easily outperforms all the existing tools.
CCS CONCEPTS
•Computing methodologies →Parallel algorithms
;Graph-
ics processors;
•Computer systems organization →Embedded
and cyber-physical systems
;
•Software and its engineering
→Formal methods
;Parallel programming languages;
•Hard-
ware →Hardware accelerators;
KEYWORDS
Symbolic Control; Discrete Abstractions; Reactive Synthesis; High
Performance Computing; Parallel Algorithms;
C++
;
OpenCL
; Mes-
sage Passing Interface (MPI); CPU; GPU; FPGA; HW Accelerators.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
HSCC ’19, April 16–18, 2019, Montreal, QC, Canada
©2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6282-5/19/04. . . $15.00
https://doi.org/10.1145/3302504.3311798
ACM Reference Format:
Mahmoud Khaled and Majid Zamani. 2019.
pFaces
: An Acceleration Ecosys-
tem for Symbolic Control . In 22nd ACM International Conference on Hybrid
Systems: Computation and Control (HSCC ’19), April 16–18, 2019, Montreal,
QC, Canada. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/
3302504.3311798
1 INTRODUCTION
Recently, the world has witnessed many emerging safety-critical
applications such as smart buildings, autonomous vehicles and
smart grids. These applications are examples of so-called Cyber-
physical systems (CPS). In CPS, embedded control software plays
a signicant role by monitoring and controlling several physical
variables, such as pressure or velocity, through multiple sensors and
actuators, and communicates with other systems or with supporting
computing servers. A novel approach to design provably correct
embedded control software in an automated fashion, is via formal
method techniques in control [
12
,
14
], and in particular symbolic
control.
Symbolic control provides algorithmically provably-correct con-
trollers based on dynamics of physical systems and some given
high-level requirements. In symbolic control, physical systems are
approximated by nite abstractions and then discrete controllers are
automatically synthesized for those abstractions, using automata-
theoretic techniques [
6
]. Finally, those controllers will be rened to
hybrid ones applicable to the original physical systems. Unlike tra-
ditional design-then-test ows, merging design phases with formal
verication ensures that controllers are certied-by-construction.
Current implementations of symbolic control are designed to
run serially in one CPU [
1
,
4
,
5
,
7
,
8
,
11
,
13
]. This way of imple-
mentation interacts poorly with the symbolic approach, whose
complexity grows exponentially in the number of state variables in
the model. This, consequently, limits the current implementations
to small dynamical systems where controllers are to be computed
o-line. In this work, we investigate novel parallel algorithms and
data structures, and utilizing high-performance computing (HPC)
platforms to mitigate the eects of the state explosion problem.
Traditionally, compute platforms such as super-computers were
expensive and inaccessible to many scientic communities. Moti-
vated by the market (e.g., gaming and cryptocurrencies mining),
compute units (CUs) like GPUs showed remarkable improvement
in speed. This introduced general purpose GPU (GPGPU) to utilize
such CUs for scientic data-parallel tasks. One good example is
how GPUs played a major role in crunching data collected by the
LIGO observatories, in 2015, making the detection of gravitational
∗
A patent application for the tool is led to the European Patent Oce (EPO). This
work was supported in part by the H2020 ERC Starting Grant AutoCPS.
HSCC ’19, April 16–18, 2019, Montreal, QC, Canada Mahmoud Khaled and Majid Zamani
Interconnection Network
CN CN CN· · ·
CPU
GPU
CPU
GPU HWA
CPU
Figure 1: A computing model considered in pFaces. Colored
boxes represent PEs of dierent computation power.
waves possible. Lately, cloud-computing providers, like Amazon
and Microsoft, made it possible for customers to build small clusters
combining CPUs, GPUs and HW-accelerators (HWAs).
This inspired the authors to use HPC platforms for mitigating
the eects of state-explosion problem in symbolic control. However,
current and future techniques need to be (re-)designed for paral-
lel and distributed execution. We propose
pFaces
as a software-
ecosystem that facilitates utilizing HPC platforms. The main con-
tributions of this work are:
(1)
an extensible software ecosystem to support utilizing HPC
platforms for mainly symbolic control and similar elds (e.g.,
reachability analysis);
(2)
novel parallelization of general abstraction-based controller
synthesis that outperforms all the existing tools.
1.1 Similar Works
The concept behind
pFaces
is dierent from currently available
tools [
1
,
4
,
5
,
7
,
8
,
11
,
13
]. They present implementations of separate
techniques while
pFaces
is intended to host the implementation of
any technique. To the best of our knowledge, all existing tools are
designed to run serially in one CPU.
pFaces
is the rst-of-its-kind
tool to support multiple CUs, including CPUs, GPUs, HWAs, and
clusters combining heterogeneous congurations of them.
Surprisingly, most modern CPUs come equipped with internal
GPUs that never get utilized by serial programs. For example, the
Intel Core i5 6200U
processor (HW conguration
CPU1
in Table
2), which contains two CPU-cores, has an internal GPU that can out-
perform the CPU-cores if a data-parallel task is well implemented.
Now, having any of those existing tools run in this processor, might
not utilize the second CPU-core and will never utilize the internal
GPU, which is a waste of resources.
On the other hand, we parallelize a symbolic control technique
similar to that used in [
1
,
7
,
8
,
11
], implement it and compare it with
other existing tools. As reported in Section 5,
pFaces
accelerates the
technique and outperforms them by several orders of magnitudes.
2pFaces: A GENERIC ACCELERATOR
We rst discuss what classes of heterogeneous computing platforms
are considered in
pFaces
. Then, we present the general internal
pFaces
User
Config.
Files
Developed in C+ +=MPI Developed in OpenCL
Debug and
Log
Files
Figure 2: Internal structure of pFaces.
structure of
pFaces
. The work-ow inside
pFaces
is best under-
stood with a parallel program in hand. Hence, we present it with
one parallel implementation starting from Section 3.
2.1 Supported HPC Platforms
Figure 1 shows a general heterogeneous computing model where
compute nodes (CN) combine dierent CUs (e.g., CPUs, GPUs, and
HWA). All CUs are connected to one, possibly hierarchal, intercon-
nection network. Each CU contributes to the system with a set of
PEs. PEs represent the HW circuits doing mathematical and logical
operations. PEs vary in computation power. For example, CPUs
has small number of PEs (a.k.a. cores or threads in CPU terminol-
ogy), but able to do fast mathematical computations. A GPU has
less powerful PEs (a.k.a. pixel/vertex shader units in GPU termi-
nology) but they come in a large number. PEs of re-congurable
HWAs (e.g., FPGA) are customized HW circuits (e.g., logic circuits
doing application-specic math/logic functionalities) for maximum
possible performance.
pFaces
aims at providing scalable, distributed execution of par-
allel algorithms that utilizes all available PEs in such heterogeneous
systems. To the best of our knowledge,
pFaces
is the only tool that
can deal with all of these types of CUs simultaneously.
2.2 Internal Structure of pFaces
pFaces
introduces a exible interface for utilizing available compu-
tation resources to solve problems arising in the eld of symbolic
control, or similar elds. Therefore, the core of
pFaces
is developed
independent of the targeted problem for acceleration. This is clearly
depicted in Figure 2.
The management ecosystem, depicted in yellow color, is inde-
pendent of the Computation Kernel, depicted in purple color and
denoted by kernel for simplicity, which represents the job to be
accelerated. The management modules are developed in
C++
mixed
with Message Passing Interfaces (
MPI
). We choose
C++
to balance
run eciency and portability.
MPI
enables running instances of
pFaces over dierent CNs that communicate over a network.
Within each CN, a
pFaces
instance identies available CUs using
the resource identication and management engine. It keeps track of
the underlaying hardware architecture and runs parts of the kernel
using the Kernel Tuner Module to assess the compute power of each
of the identied CUs. A Management Engine Module orchestrates the
pFaces: An Acceleration Ecosystem for Symbolic Control HSCC ’19, April 16–18, 2019, Montreal, QC, Canada
work among dierent modules and commands the Task Scheduler
Module which runs the kernel as ecient as possible, using the data
collected after resource identication. A Conguration Interface
Module helps users interact with the kernel via text conguration
les that follow some rules dened by the kernel developer. With
the Logging and Debugging Engine Module,
pFaces
informs the user
about the current state of execution and delivers hints, suggestions
and debugging information about the executing kernel.
Kernels encapsulate the parallel algorithm under consideration.
They should be developed in
OpenCL
with some additional exten-
sions dened by
pFaces
.
OpenCL
is a standard and programming
language for heterogeneous parallel computing. We select
OpenCL
as it is becoming a widely accepted standard for CPUs, GP Us, many
embedded devices, and most recently, for HWAs (e.g., FPGAs [2]).
3 A KERNEL FOR SYMBOLIC CONTROL
Some theoretical background is presented in this section followed
by a novel parallelized version for one of the common techniques
in symbolic control.
Here, nite abstractions are constructed based on the theory in
[
9
], which utilizes a growth-bound (GB) formula to over-approximate
the reachable sets. Algorithmic controller synthesis is done based
on the technique presented in [
11
], which uses xed-point (FP)
computations on the constructed nite-state models.
It is not possible to use those techniques directly in
pFaces
since
they are manifested as serial algorithms. Novel parallel algorithms
are then proposed for constructing nite abstractions and synthe-
sizing symbolic controllers for them. In Section 5, it is shown that
the algorithms scale remarkably as number of PEs increases. We
refer to this kernel by pFaces/GBFP.
3.1 Parallel Construction of Finite Abstractions
We consider general nonlinear systems given in the form of a dif-
ferential equation:
Σ:Û
ξ(t)=f(ξ(t),u),(1)
where
ξ(t) ∈ X⊆Rn
is a state vector and
u∈U⊆Rm
is an input
vector. We denote by
ξx,u(·)
the trajectory satisfying
(1)
at almost
every
t∈ [
0
,τ]
, where
τ∈R+
is a sampling period, started from
initial condition
ξx,u(
0
)=x
and under some input
u
. Set
¯
X
is a nite
partition on
X
constructed by a set of hyper-rectangles of identical
widths
η∈Rn
+
. Set
¯
U
is a nite subset of
U
. A nite abstraction of
(1)
is a nite-state system
¯
Σ=(¯
X,¯
U,T)
, where
T∈¯
Xׯ
Uׯ
X
is
a transition relation that is crafted so that there exists a feedback-
renement relation (FRR)
R ∈ Xׯ
X
from
Σ
to
¯
Σ
. Interested readers
can nd more details about FRRs in [9].
For the vector eld of
(1)
, a function
Ωf
:
¯
Xׯ
U→X2
char-
acterizes the over-approximations of the reachable sets starting
from a set
¯
x∈¯
X
when the input
¯
u
is applied. For example, if the
growth-bound map (
β
:
Rn×U→Rn
) introduced in [
9
] is used,
Ωf
can be dened as follows:
Ωf(¯
x,¯
u)=(xlb ,xu b )
:
=(−r+ξ¯
xc,¯
u(τ),
r+ξ¯
xc,¯
u(τ))
, where
r=β(η/
2
,u)
, and
¯
xc∈¯
x
denotes the centroid of
¯
x
. An over approximation of the reachable sets can then be obtained
by the map
Of
:
¯
Xׯ
U→
2
¯
X
dened by
Of(¯
x,¯
u)=Q◦Ωf(¯
x,¯
u)
,
where Qis a quantization map dened by:
Q(xlb ,xu b )={¯
x′∈¯
X|¯
x′∩ [[xlb ,xu b ]] ,∅},(2)
Algorithm 1:
Traditional serial algorithm for constructing
discrete abstractions.
Input: ¯
X,¯
U,Of
Output: A transition relation T⊆¯
Xׯ
Uׯ
X.
1T← ∅ ;
2for all ¯
x∈¯
Xdo
3for all ¯
u∈¯
Udo
4for all ¯
x′∈Of(¯
x,¯
u)do
5T←T∪ {( ¯
x,¯
u,¯
x′)} ;
6end
7end
8end
where
[[xlb ,xu b ]] =[xlb,1,xu b,1]×[xlb ,2,xub,2] × · · · × [xlb,n,
xub,n].
Algorithm 1 depicts the traditional algorithm for constructing
nite abstractions of dynamical systems. The algorithm constructs,
serially,
T∈¯
Xׯ
Uׯ
X
by iterating over all elements of
¯
Xׯ
U
. For
any
(¯
x,¯
u)
, the evaluation of
Of
and
T|(¯
x,¯
u)
is independent of any
element of
¯
Xׯ
U
. Algorithm 2 is then proposed as a parallelization
of Algorithm 1. Each PE, annotated with an index
p∈ {
1
,
2
,· · · ,P}
,
where
P
is the number of available PEs, handles one
(¯
x,¯
u) ∈ ¯
Xׯ
U
.
Function
I
:
N+\ {∞} → {
1
,
2
, . . . , P}
maps a parallel job (i.e., an
iteration of the
parallel for-all
statement) with index
i
to a PE
with an index
p=I(i)
. The algorithm introduces the abstraction
task as an ideal data-parallel task with no communication overhead
among the processing elements. It is more suitable for CUs with
massive number of PEs (e.g. GPUs and super computers). However,
having
P>|¯
Xׯ
U|
is a waste of computation power. A variant of it
is also provided in
pFaces
/
GBFP
and it aggregates the computation
of all
(¯
x,¯
u)
having the same
¯
x
in one PE, which is more suitable for
CUs with small number of fast PEs (e.g., CPUs and FPGAs).
Instead of storing symbolic transitions in
T
,
Ωf
is used to con-
struct a distributed container
K
:
=K1
lo c ∪K2
lo c ∪ · · · ∪ KP
lo c
, where
the subscript
loc
indicates that
Kp
lo c ⊆¯
Xׯ
U×X2
is stored in a
local-memory of the PE with index
p
. The lines 10-12 in Algorithm
2 are optional and can be omitted if there is no interest to obtain
a combined abstraction. We show in Subsection 3.2 that only the
Algorithm 2:
Proposed parallel algorithm for constructing
discrete abstractions.
Input: ¯
X,¯
U,Ωf
Output: A characteristic set K⊆¯
Xׯ
U×X2.
1K← ∅;
2for all p∈ {1,2,· · · ,P}do
3Kp
lo c ← ∅;
4end
5for all (¯
x,¯
u) ∈ ¯
Xׯ
Uin parallel with index ido
6p=I(i);
7(xlb ,xu b ) ← Ωf(¯
x,¯
u);
8Kp
lo c ←Kp
lo c ∪ {( ¯
x,¯
u,(xlb ,xu b ))};
9end
10 for all p∈ {1,2,· · · ,P}do
11 K←K∪Kp
lo c ;
12 end
HSCC ’19, April 16–18, 2019, Montreal, QC, Canada Mahmoud Khaled and Majid Zamani
distributed containers
Kp
lo c
are required to synthesize symbolic
controllers for ¯
Σ.
Note that using
K
rather than
T
is more ecient since
|T|
is sen-
sitive to
|Of(¯
x,¯
u)|
, while
|K|=
2
n|¯
Xׯ
U|
is constant and consumes,
practically, less memory. This becomes more important when such
operations are executed in PEs of GPUs or FPGAs known for having
limited memory. In Subsection 3.3, we show that
Kp
lo c
can be also
omitted and the abstraction is done on-the-y.
3.2 Parallel Synthesis of Symbolic Controllers
Given
¯
Σ=(¯
X,¯
U,T)
, we dene the controllable predecessor map
CPreT: 2 ¯
Xׯ
U→2¯
Xׯ
Ufor Z⊆¯
Xׯ
Uby:
CPreT(Z)={( ¯
x,¯
u) ∈ ¯
Xׯ
U|∅ ,T(¯
x,¯
u) ⊆ π¯
X(Z)},(3)
where
π¯
X(Z)={¯
x∈¯
X|∃¯
u∈¯
U(¯
u,¯
x) ∈ Z}
, and
T(¯
x,¯
u)
is an in-
terpretation of the transitions set
T
as a map
T
:
¯
Xׯ
U→
2
¯
X
that evaluates a set of post-states from a state-input pair. We con-
sider reachability and invariance specications given by the LTL
formulae
^ψ
and
□ψ
, respectively, where
ψ
is a propositional for-
mula over a set of atomic propositions
AP
. We rst construct an
initial winning set
Zψ={( ¯
x,¯
u) ∈ ¯
Xׯ
U|L(¯
x,¯
u) |=ψ)}
, where
L:¯
Xׯ
U→2AP is some labeling function.
To synthesize symbolic controllers for the reachability speci-
cations, we utilize the monotone function
G(Z)
:
=CPreT(Z) ∪ Zψ
to iteratively compute
Z∞=µZ.G(Z)
starting with
Z0=∅
. Here,
we adopt a notation from
µ
-calculus with
µ
as the minimal xed
point operator and
Z
is the operated variable. Interested readers
can nd more details in [
6
] and the references therein. The syn-
thesized controller is a map
C
:
¯
Xw→
2
¯
U
, where
¯
Xw⊆¯
X
repre-
sents a winning (a.k.a. controllable) set of states. Set
C
is dened
by:
C(¯
x)={¯
u∈¯
U|( ¯
x,¯
u) ∈ µj(¯
x)Z.G(Z)}
, where
j(¯
x)=inf{i∈
N|¯
x∈π¯
X(µiZ.G(Z))}
, and
µiZ.G(Z)
represents the value of
ith
it-
eration of the minimal xed point computation. Algorithm 3 shows
a serial implementation of the minimal xed-point computation
Z∞=µZ.G(Z)
. For the sake of space, we omit similar discussion
about synthesizing controllers for invariance specications. Inter-
ested readers can nd more details in [11].
Algorithm 4 is proposed as a parallelization of Algorithm 3. We
assume using the same indexing map
I(·)
from Algorithm 2. Line
Algorithm 3:
Traditional serial algorithm to synthesize
C
enforcing the specication ^ψ.
Input: Initial winning domain Zψ⊂¯
Xׯ
Uand T
Output: A controller C:¯
Xw→2¯
U.
1Z∞← ∅ ;
2¯
Xw← ∅ ;
3do
4Z0←Z∞;
5Z∞←CPreT(Z0) ∪ Zψ;
6D←Z∞\Z0;
7foreach ¯
x∈π¯
X(D)with ¯
x<¯
Xwdo
8¯
Xw←¯
Xw∪ { ¯
x};
9C(¯
x):={¯
u∈¯
U|( ¯
x,¯
u) ∈ D};
10 end
11 while Z∞,Z0;
Algorithm 4:
Proposed parallel algorithm to synthesize
C
enforcing the specication ^ψ.
Input: Initial winning domain Zψ⊂¯
Xׯ
Uand T
Output: A controller C:¯
Xw→2¯
U.
1Z∞← ∅ ;
2¯
Xw← ∅ ;
3do
4Z0←Z∞;
5for all p∈ {1,2,· · · ,P}do
6Zp
lo c ← ∅;
7¯
Xp
w,lo c ← ∅;
8end
9for all (¯
x,¯
u) ∈ ¯
Xׯ
Uin parallel with index ido
10 p=I(i);
11 Post s ←Q◦Kp
lo c (¯
x,¯
u);
12 if Post s ⊆Z0∪Zψthen
13 Zp
lo c ←Zp
lo c ∪ {( ¯
x,¯
u)};
14 ¯
Xp
w,lo c ←¯
Xp
w,lo c ∪ { ˆ
x};
15 if ¯
x<π¯
X(Z0)then
16 C(¯
x) ← C(¯
x) ∪ { ¯
u};
17 end
18 end
19 end
20 for all p∈ {1,2,· · · ,P}do
21 Z∞←Z∞∪Zp
lo c ;
22 ¯
Xw←¯
Xw∪¯
Xp
w,lo c ;
23 end
24 while Z∞,Z0;
(11) corresponds to computing
T(¯
x,¯
u)
from the stored characteristic
values
(xlb ,xu b )
. Since all PEs use
Z0
when running lines (12) and
(15), the synchronization among all PEs is required to ensure all PEs
get the most updated version of
Z0
. Such synchronization happens
in every iteration of the FP by collecting all local versions
Zp
lo c
in
line (21) and the update in line (4) before starting another parallel
synthesis iteration of the
parallel for-loop
in line (9). Similarly, a
variant of the algorithm, that is more suitable for CPUs and FPGAs,
is provided in pFaces/GBFP.
3.3 A Memory-ecient Kernel
Modern CUs contain hundreds to thousands of PE. This motivates
the concept of more-compute/less-memory where recomputing
results between repeated iterations is favored over storing them.
We apply this by eliminating the use of
Kp
lo c
in line (11) of Algorithm
4 and computing it on the y using the same way done in lines
(7) and (8) of Algorithm 2. We denote such modied kernel by
pFaces/GBFPm.
3.4 Implementation details
Figure 3 shows the work-ow of
pFaces
/
GBFP
. Apart from the
boxes highlighted with CU, all steps are executed in serial in the
management ecosystem. The kernel developer implements such
steps using subroutines from
pFaces
. For example, the step Distrib-
ute Jobs Based on Collected Data is a simple call to a subroutine in
pFaces: An Acceleration Ecosystem for Symbolic Control HSCC ’19, April 16–18, 2019, Montreal, QC, Canada
Start
Tune Devices
with selected samples
of input/output spaces
are
Load tune data
Distribute jobs based
on collected data
Sync. point
over devices
Run Abstraction
Algorithm
in Parallel
Run Synthesis
Iterations
in Parallel
devices
tuned ?
Read user
configfiles
Yes No
Start parallel jobs
Dump the controller
from devices' memories
Encode/save
the controller
End
CU
PE PE
· · ·
CU
PE PE
· · ·
· · ·
CU
PE PE
· · ·
CU
PE PE
· · ·
· · ·
Identify availble
parallel devices
Compile kernels
for devices
Generate code
from the controller
Figure 3: Work ow of pFaces/GBFP.
pFaces
that computes the best task distribution among available
PEs. Parallel tasks are handled completely by pFaces.
After constructing the abstraction, PEs are synchronized to make
sure all PEs start the synthesis task with correct abstraction mem-
ory. Also, after each FP iteration, synchronization is required as
discussed in Subsection 3.2. Such requirement is mainly a conse-
quence of the check in line (12) in Algorithm 4 and the fact that
Z0
is maintained as a distributed data container when the algo-
rithm is executed in multiple CUs. This introduces overhead that
reduces the scalability of the synthesis. Fortunately, such overhead
is mitigated by the fact that dynamical systems possess some lo-
cality. More specically, when a PE runs the check in line (12) in
Algorithm 4, a good possibility is that elements of
Post s
are close
(i.e., in the Euclidean distance) to
(¯
x,¯
u)
. Consequently, the check is
computed using memory from the same CU or neighboring CUs.
pFaces, on the other hand, distributes the tasks with an encoding
I(·) that promotes such locality.
Once the FP settles,
pFaces
/
GBFP
collects the controller data
and encodes it.
pFaces
facilitates encoding and saving data ob-
ject as Raw-Data, Binary Decision Diagrams (BDD), Bitmaps, or
compressed Bitmaps. Also,
pFaces
oers code-generation by im-
plementing the library BDD2Implement [
3
], helping users export
the map Cas C/C++-code or VHDL-code.
4 AN EXAMPLE
We show, with an example, how
pFaces
can be used to mitigate
computation complexities resulting from the state explosion prob-
lem. Consider the truck-with-a-trailer example presented in [
10
].
It is a three-dimensional system and the requirement is to reach
some target speed while maintaining a safe distance between the
truck and the trailer. We generalize the example to
Truck_N
where
Table 1: Details and results for the example Truck_N.
N=1N=2N=3N=4
n3 5 7 9
|¯
Xׯ
U| × 1062.64 36.17 398.29 5520.4
Memory per (x,u)(Byte) 25 41 1 1
Total memory (M.B.) 63 1414 379 5264
pFaces-kernel § § §§ §§
HW conguration MIX1GPU2MIX2MIX2
Time to nd C(sec.) 0.89 0.98 0.96 46.6
Nis the number of trailers, knowing that adding an extra trailer
increases the dimensions by two. We require the controller to be
computed in real-time (RT) with a deadline window of 1.0 second.
For dierent
N
, we focus on the value of
|¯
Xׯ
U|
as it aects directly
the complexity of Algorithms 2 and 4.
For the original problem (i.e.,
Truck_1
), the existing tool
SCOTS
solves the problem in 39 seconds using HW conguration
CPU2
in
Table 2, which violates the real-time constraint. With the same HW
conguration,
pFaces
/
GBFP
(denoted by
§
in Table 1) solves the
problem in 0.89 seconds with speedup of around 44x. Speedups are
calculated by dividing the time running the serial implementation,
which is the time reported by
SCOTS
for the current case, by the
time running the same example with the parallel implementation,
for a specic HW conguration.
We upgrade the system and synthesize a controller for
Truck_2
.
In order to respect the RT-deadline, we update the HW congura-
tion to
GPU2
. The problem is then solved in 0.98 seconds. Notice
the increase in memory per
(¯
x,¯
u)
when using
pFaces
/
GBFP
(see
the discussion about
K
in Subsection 3.1). To save memory, we use
pFaces/GBFPm(denoted by §§ in Table 1) for N>2.
For
Truck_3
, the HW conguration, that allows respecting the
RT-deadline, is much expensive to be installed in the truck. We
rent some CUs from Amazon-AWS, build the HW conguration
MIX2
, combining multiple GPUs, and share it among all trucks of
type
Truck_3
. Now, the problem is solved in 0.96 seconds. Here, all
trucks are assumed to have some access to the Cloud for submitting
requests and receiving a list of control actions, which is assumed
to take less than 4 milliseconds for communication.
Now, we experiment on
Truck_4
with the same HW congu-
ration. Unfortunately, due to the state-explosion problem, we no
longer can solve the problem in RT and it is solved in 46.6 seconds.
We emphasize that in order to control the complexity, more PEs
need to be added in this case.
Table 1 reports the collected results for the four cases. When
using
pFaces
/
GBFPm
, the 1-byte requirement for the memory per
(¯
x,¯
u)is used for controller synthesis not for the abstraction.
5 BENCHMARKING pFaces/GBFP
For benchmarking, we use HW congurations listed in Table 2. We
conduct a benchmark for scalability and report its results in Tables
3 and 4. Here, the dynamics and parameters of the examples dcdc
and
vehicle
are borrowed from [
11
], while those for the examples
robot and khepera are borrowed from [4].
The reference results for speedup computation are marked with
a black box. The highest speedup is underlined. N/A denotes "not
applicable" and used to indicate that the tool
SCOTS
does not run in
some HW congurations. The example
Truck_1
reported in Table
4 has smaller
η
than the one in Table 1. For
CLS1
, we report the time
HSCC ’19, April 16–18, 2019, Montreal, QC, Canada Mahmoud Khaled and Majid Zamani
Table 2: Used HW congurations for the proposed benchmarks.
Code Name Class
Number
of PEs PE Frequency
Memory
(G.B.)
Power
(Watt)
Price
($)
CPU1
Intel Core i5-6200U in Lenovo X260 Laptop 2016
CPU 2 2.8 GHz 8 15 281
CPU2Intel Xeon E5-2630 CPU 10 3.1 GHz 8 85 667
GPU1NVIDIA Quadro P5000 GPU 2560 600 MHz 16 180 1,800
GPU2NVIDIA Tesla V100 GPU 5120 800 MHz 16 250 10,664
GPU3
AMD Radeon Pro Vega 20 in Macbook Pro 2018
GPU 1280 1200 MHz 4 ≤50 350
PGA1Altera DE5-Net Board FPGA 2 50 MHz 8 ≤4 6,250
PGA2Kintex UltraScale FPGA KCU1500 FPGA 2 300 MHz 16 ≤10 2,500
MIX1CPU1and its internal GPU Mixed 24 2.8 GHz / 300 GHz 8 25 281
MIX28×GPU2with NVLink interconnection Mixed 40960 800 MHz 128 2000 85,312
CLS1Two networked CNs: 32-core CPU and GPU2
Cluster
5152 3.8 GHz / 800 MHz 488 450 37,000
Table 3: Scalability benchmarking for the examples: DCDC and Vehicle.
DCDC:|¯
U|=2,|¯
X|=639200 Vehicle:|¯
U|=49,|¯
X|=91035
CPU1CPU2GPU1GPU2PGA1PGA2MIX1MIX2CLS1CPU1CPU2GPU1GPU2GPU3PGA1MIX1MIX2CLS1
SCOTS
44.3
36.9
N/A N/A N/A N/A N/A
N/A N/A
207
203
N/A N/A N/A N/A N/A
N/A N/A
pFaces/GBFP 1.6
0.41 0.037 0.009 0.189 0.073 0.41 0.003 0.8/98% 52.1 10.9 0.98 0.152 1.72 16.8 13.6
0.04
12.0/99%
Speedup
23x 108x 997x 4100x 195x 505x 90x
12300x44x 4x
18x 207x 1350x 118x 12x 15x
5075x 16x
Table 4: Scalability benchmarking for the examples: Robot and Truck_1.
Truck_1:|¯
U|=51,|¯
X|=229327 Robot:|¯
U|=77,|¯
X|=1364889
CPU1CPU2GPU1GPU2PGA1PGA2MIX1MIX2CLS1CPU1CPU2GPU1GPU2GPU3PGA1MIX1MIX2CLS1
SCOTS
249
191
N/A N/A N/A N/A N/A
N/A N/A
4423
3949
N/A N/A N/A N/A N/A
N/A N/A
pFaces/GBFP 7.2 1.7
0.462 0.153 1.05 0.84
2.0
0.006 1.4/97% 154
97 8.3
1.85 13.2 147 136 0.309 96.2/96%
Speedup
26x 112x 413x 1249x 182x 227x 95x
31850x136x
25x
40x
475x 2134x 299x 26x 29x
12779x41x
to solve the problem and its percentage consumed by the network
communication overhead. We recommend using clusters only for
large problems where the FP computation time is expected to be
much longer than the communication overhead.
We only compare with the tool
SCOTS
since it implements, ex-
actly, the serial Algorithms 1 and 3. Nevertheless,
pFaces
/
GBFP
outperforms the tools reported in Section 1.1. For example, in [
5
], a
dierent technique is used and the example
DCDC
was reported to
take 0.36 seconds, which is clearly outperformed by
pFaces
/
GBFP
as number of PE increases. Other tools reported in Section 1.1 are
yet outperformed by
SCOTS
or the tool in [
5
]. Therefore, we do not
compare pFaces with them.
6 CONCLUSIONS AND FUTURE WORK
A software ecosystem is proposed to facilitate prototyping par-
allel algorithms serving research areas like symbolic control and
reachability analysis. A traditional symbolic control technique is re-
designed as a data-parallel task that scales with number of PEs help-
ing to tackle computational complexities. The kernel
pFaces
/
GBFP
scales very well but consumes lots of memory.
pFaces
/
GBFPm
is
memory-ecient but slower due to repeated computations. Fu-
ture work will focus on designing distributed data-structures that
balance between memory size and fast write/query time.
ACKNOWLEDGMENTS
We gratefully acknowledge the support of
Intel
,
Xilinx
,
NVIDIA
,
and
Amazon
corporations. The conguration
GPU1
was donated by
NVIDIA
Corporation. The conguration
PGA1
was donated by
Intel
Corporation. A paid access to Amazon-AWS for testing the tool on
an EC2-F1 instance was provided by
Xilinx
Corporation. The tests
on the HW congurations
GPU2
,
MIX2
and
CLS1
in Amazon-AWS
were provided through a grant from Amazon.
REFERENCES
[1]
K. Hsu, R. Majumdar, K. Mallik, and A. K. Schmuck. 2018. Multi-Layered
Abstraction-Based Controller Synthesis for Continuous-Time Systems. In Pro-
ceedings of the 21st International Conference on Hybrid Systems: Computation
and Control (Part of CPS Week) (HSCC ’18). ACM, New York, NY, USA, 120–129.
https://doi.org/10.1145/3178126.3178143
[2]
L. Kalms and D. Göhringer. 2017. Exploration of OpenCL for FPGAs using
SDAccel and comparison to GPUs and multicore CPUs. In 2017 27th International
Conference on Field Programmable Logic and Applications (FPL). IEEE, USA, 1–4.
https://doi.org/10.23919/FPL.2017.8056847
[3]
M. Khaled. 2017. BDD2Implement: A Code Generation Tool for Symbolic Con-
trollers. https://gitlab.lrz.de/hcs/BDD2Implement
[4]
M. Khaled, M. Rungger, and M. Zamani. June 2018. SENSE: Abstraction-Based
Synthesis of Networked Control Systems. In Electronic Proceedings in Theoretical
Computer Science (EPTCS), 272. Open Publishing Association (OPA), 111 Cooper
Street, Waterloo, Australia, 65–78. https://doi.org/10.4204/EPTCS.272.6
[5]
Y. Li and J. Liu. 2018. ROCS: A Robustly Complete Control Synthesis Tool for
Nonlinear Dynamical Systems. In Proceedings of the 21st International Conference
on Hybrid Systems: Computation and Control (Part of CPS Week) (HSCC ’18). ACM,
New York, NY, USA, 130–135. https://doi.org/10.1145/3178126.3178153
[6]
O. Maler, A. Pnueli, and J. Sifakis. 1995. On the synthesis of discrete controllers
for timed systems. In 12th Annual Symposium on Theoretical Aspects of Computer
Science (STACS 95), E. W. Mayr and C. Puech (Eds.). Springer Berlin Heidelberg,
Berlin, Heidelberg, 229–242. https://doi.org/10.1007/3- 540-59042-0_76
[7]
M. Mazo, A. Davitian, and P. Tabuada. 2010. PESSOA: A Tool for Embedded
Controller Synthesis. In Computer Aided Verication, Tayssir Touili, Byron Cook,
and Paul Jackson (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 566–569.
https://doi.org/10.1007/978-3- 642-14295- 6_49
[8]
S. Mouelhi, A. Girard, and G. Gössler. 2013. CoSyMA: A Tool for Controller
Synthesis Using Multi-scale Abstractions. In Proceedings of 16th International
Conference on Hybrid Systems: Computation and Control (HSCC ’13). ACM, New
York, NY, USA, 83–88. https://doi.org/10.1145/2461328.2461343
[9]
G. Reissig, A. Weber, and M. Rungger. April 2017. Feedback Renement Relations
for the Synthesis of Symbolic Controllers. IEEE Trans. Automat. Control 62, 4
(April 2017), 1781–1796. https://doi.org/10.1109/TAC.2016.2593947
[10]
M. Rungger, M. Mazo, Jr., and P. Tabuada. 2013. Specication-guided Controller
Synthesis for Linear Systems and Safe Linear-time Temporal Logic. In 16th In-
ternational Conference on Hybrid Systems: Computation and Control (HSCC ’13).
ACM, New York, NY, USA, 333–342. https://doi.org/10.1145/2461328.2461378
[11]
M. Rungger and M. Zamani. 2016. SCOTS: A Tool for the Synthesis of Symbolic
Controllers. In Proceedings of the 19th International Conference on Hybrid Systems:
Computation and Control (HSCC ’16). ACM, New York, NY, USA, 99–104. https:
//doi.org/10.1145/2883817.2883834
[12]
P. Tabuada. 2009. Verication and control of hybrid systems, A symbolic approach.
Springer, USA. https://doi.org/10.1007/978- 1-4419-0224- 5
[13]
T. Wongpiromsarn, U. Topcu, N. Ozay, H. Xu, and R. M. Murray. 2011. TuLiP:
A Software Toolbox for Receding Horizon Temporal Logic Planning. In 14th
International Conference on Hybrid Systems: Computation and Control (HSCC ’11).
ACM, New York, NY, USA, 313–314. https://doi.org/10.1145/1967701.1967747
[14]
M. Zamani, G. Pola, M. Mazo Jr., and P. Tabuada. 2012. Symbolic Models for
Nonlinear Control Systems Without Stability Assumptions. IEEE Trans. Automat.
Control 57, 7 (July 2012), 1804–1809. https://doi.org/10.1109/TAC.2011.2176409