Content uploaded by Mahmoud Khaled

Author content

All content in this area was uploaded by Mahmoud Khaled on Feb 15, 2019

Content may be subject to copyright.

pFaces∗: An Acceleration Ecosystem for Symbolic Control

Mahmoud Khaled

Hybrid Control Systems Group

Technical University of Munich

Munich, Germany

khaled.mahmoud@tum.de

Majid Zamani

Department of Computer Science

University of Colorado Boulder, USA

Department of Computer Science

Ludwig Maximilian University of Munich, Germany

majid.zamani@colorado.edu

ABSTRACT

The correctness of control software in many safety-critical appli-

cations such as autonomous vehicles is crucial. One technique to

achieve correct control software is called "symbolic control", where

complex systems are approximated by nite-state abstractions.

Then, using those abstractions, provably-correct digital controllers

are algorithmically synthesized for concrete systems, satisfying

complex high-level requirements. Unfortunately, the complexity of

synthesizing such controllers grows exponentially in the number

of state variables. However, if distributed implementations are con-

sidered, high-performance computing platforms can be leveraged

to mitigate the eects of the state-explosion problem.

We propose

pFaces

, an extensible software-ecosystem, to accel-

erate symbolic control techniques. It facilitates designing parallel

algorithms and supervises their executions to utilize available com-

puting resources. To demonstrate its capabilities, novel parallel

algorithms are designed for abstraction-based controller synthesis.

Then, they are implemented inside

pFaces

and dispatched, for par-

allel execution, in dierent heterogeneous computing platforms,

including CPUs, GPUs and Hardware Accelerators (HWAs). Results

show remarkable reduction in the computation time by several

orders of magnitudes as number of processing elements (PEs) in-

creases, which easily outperforms all the existing tools.

CCS CONCEPTS

•Computing methodologies →Parallel algorithms

;Graph-

ics processors;

•Computer systems organization →Embedded

and cyber-physical systems

;

•Software and its engineering

→Formal methods

;Parallel programming languages;

•Hard-

ware →Hardware accelerators;

KEYWORDS

Symbolic Control; Discrete Abstractions; Reactive Synthesis; High

Performance Computing; Parallel Algorithms;

C++

;

OpenCL

; Mes-

sage Passing Interface (MPI); CPU; GPU; FPGA; HW Accelerators.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

HSCC ’19, April 16–18, 2019, Montreal, QC, Canada

©2019 Association for Computing Machinery.

ACM ISBN 978-1-4503-6282-5/19/04. . . $15.00

https://doi.org/10.1145/3302504.3311798

ACM Reference Format:

Mahmoud Khaled and Majid Zamani. 2019.

pFaces

: An Acceleration Ecosys-

tem for Symbolic Control . In 22nd ACM International Conference on Hybrid

Systems: Computation and Control (HSCC ’19), April 16–18, 2019, Montreal,

QC, Canada. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/

3302504.3311798

1 INTRODUCTION

Recently, the world has witnessed many emerging safety-critical

applications such as smart buildings, autonomous vehicles and

smart grids. These applications are examples of so-called Cyber-

physical systems (CPS). In CPS, embedded control software plays

a signicant role by monitoring and controlling several physical

variables, such as pressure or velocity, through multiple sensors and

actuators, and communicates with other systems or with supporting

computing servers. A novel approach to design provably correct

embedded control software in an automated fashion, is via formal

method techniques in control [

12

,

14

], and in particular symbolic

control.

Symbolic control provides algorithmically provably-correct con-

trollers based on dynamics of physical systems and some given

high-level requirements. In symbolic control, physical systems are

approximated by nite abstractions and then discrete controllers are

automatically synthesized for those abstractions, using automata-

theoretic techniques [

6

]. Finally, those controllers will be rened to

hybrid ones applicable to the original physical systems. Unlike tra-

ditional design-then-test ows, merging design phases with formal

verication ensures that controllers are certied-by-construction.

Current implementations of symbolic control are designed to

run serially in one CPU [

1

,

4

,

5

,

7

,

8

,

11

,

13

]. This way of imple-

mentation interacts poorly with the symbolic approach, whose

complexity grows exponentially in the number of state variables in

the model. This, consequently, limits the current implementations

to small dynamical systems where controllers are to be computed

o-line. In this work, we investigate novel parallel algorithms and

data structures, and utilizing high-performance computing (HPC)

platforms to mitigate the eects of the state explosion problem.

Traditionally, compute platforms such as super-computers were

expensive and inaccessible to many scientic communities. Moti-

vated by the market (e.g., gaming and cryptocurrencies mining),

compute units (CUs) like GPUs showed remarkable improvement

in speed. This introduced general purpose GPU (GPGPU) to utilize

such CUs for scientic data-parallel tasks. One good example is

how GPUs played a major role in crunching data collected by the

LIGO observatories, in 2015, making the detection of gravitational

∗

A patent application for the tool is led to the European Patent Oce (EPO). This

work was supported in part by the H2020 ERC Starting Grant AutoCPS.

HSCC ’19, April 16–18, 2019, Montreal, QC, Canada Mahmoud Khaled and Majid Zamani

Interconnection Network

CN CN CN· · ·

CPU

GPU

CPU

GPU HWA

CPU

Figure 1: A computing model considered in pFaces. Colored

boxes represent PEs of dierent computation power.

waves possible. Lately, cloud-computing providers, like Amazon

and Microsoft, made it possible for customers to build small clusters

combining CPUs, GPUs and HW-accelerators (HWAs).

This inspired the authors to use HPC platforms for mitigating

the eects of state-explosion problem in symbolic control. However,

current and future techniques need to be (re-)designed for paral-

lel and distributed execution. We propose

pFaces

as a software-

ecosystem that facilitates utilizing HPC platforms. The main con-

tributions of this work are:

(1)

an extensible software ecosystem to support utilizing HPC

platforms for mainly symbolic control and similar elds (e.g.,

reachability analysis);

(2)

novel parallelization of general abstraction-based controller

synthesis that outperforms all the existing tools.

1.1 Similar Works

The concept behind

pFaces

is dierent from currently available

tools [

1

,

4

,

5

,

7

,

8

,

11

,

13

]. They present implementations of separate

techniques while

pFaces

is intended to host the implementation of

any technique. To the best of our knowledge, all existing tools are

designed to run serially in one CPU.

pFaces

is the rst-of-its-kind

tool to support multiple CUs, including CPUs, GPUs, HWAs, and

clusters combining heterogeneous congurations of them.

Surprisingly, most modern CPUs come equipped with internal

GPUs that never get utilized by serial programs. For example, the

Intel Core i5 6200U

processor (HW conguration

CPU1

in Table

2), which contains two CPU-cores, has an internal GPU that can out-

perform the CPU-cores if a data-parallel task is well implemented.

Now, having any of those existing tools run in this processor, might

not utilize the second CPU-core and will never utilize the internal

GPU, which is a waste of resources.

On the other hand, we parallelize a symbolic control technique

similar to that used in [

1

,

7

,

8

,

11

], implement it and compare it with

other existing tools. As reported in Section 5,

pFaces

accelerates the

technique and outperforms them by several orders of magnitudes.

2pFaces: A GENERIC ACCELERATOR

We rst discuss what classes of heterogeneous computing platforms

are considered in

pFaces

. Then, we present the general internal

pFaces

User

Conﬁg.

Files

Developed in C+ +=MPI Developed in OpenCL

Debug and

Log

Files

Figure 2: Internal structure of pFaces.

structure of

pFaces

. The work-ow inside

pFaces

is best under-

stood with a parallel program in hand. Hence, we present it with

one parallel implementation starting from Section 3.

2.1 Supported HPC Platforms

Figure 1 shows a general heterogeneous computing model where

compute nodes (CN) combine dierent CUs (e.g., CPUs, GPUs, and

HWA). All CUs are connected to one, possibly hierarchal, intercon-

nection network. Each CU contributes to the system with a set of

PEs. PEs represent the HW circuits doing mathematical and logical

operations. PEs vary in computation power. For example, CPUs

has small number of PEs (a.k.a. cores or threads in CPU terminol-

ogy), but able to do fast mathematical computations. A GPU has

less powerful PEs (a.k.a. pixel/vertex shader units in GPU termi-

nology) but they come in a large number. PEs of re-congurable

HWAs (e.g., FPGA) are customized HW circuits (e.g., logic circuits

doing application-specic math/logic functionalities) for maximum

possible performance.

pFaces

aims at providing scalable, distributed execution of par-

allel algorithms that utilizes all available PEs in such heterogeneous

systems. To the best of our knowledge,

pFaces

is the only tool that

can deal with all of these types of CUs simultaneously.

2.2 Internal Structure of pFaces

pFaces

introduces a exible interface for utilizing available compu-

tation resources to solve problems arising in the eld of symbolic

control, or similar elds. Therefore, the core of

pFaces

is developed

independent of the targeted problem for acceleration. This is clearly

depicted in Figure 2.

The management ecosystem, depicted in yellow color, is inde-

pendent of the Computation Kernel, depicted in purple color and

denoted by kernel for simplicity, which represents the job to be

accelerated. The management modules are developed in

C++

mixed

with Message Passing Interfaces (

MPI

). We choose

C++

to balance

run eciency and portability.

MPI

enables running instances of

pFaces over dierent CNs that communicate over a network.

Within each CN, a

pFaces

instance identies available CUs using

the resource identication and management engine. It keeps track of

the underlaying hardware architecture and runs parts of the kernel

using the Kernel Tuner Module to assess the compute power of each

of the identied CUs. A Management Engine Module orchestrates the

pFaces: An Acceleration Ecosystem for Symbolic Control HSCC ’19, April 16–18, 2019, Montreal, QC, Canada

work among dierent modules and commands the Task Scheduler

Module which runs the kernel as ecient as possible, using the data

collected after resource identication. A Conguration Interface

Module helps users interact with the kernel via text conguration

les that follow some rules dened by the kernel developer. With

the Logging and Debugging Engine Module,

pFaces

informs the user

about the current state of execution and delivers hints, suggestions

and debugging information about the executing kernel.

Kernels encapsulate the parallel algorithm under consideration.

They should be developed in

OpenCL

with some additional exten-

sions dened by

pFaces

.

OpenCL

is a standard and programming

language for heterogeneous parallel computing. We select

OpenCL

as it is becoming a widely accepted standard for CPUs, GP Us, many

embedded devices, and most recently, for HWAs (e.g., FPGAs [2]).

3 A KERNEL FOR SYMBOLIC CONTROL

Some theoretical background is presented in this section followed

by a novel parallelized version for one of the common techniques

in symbolic control.

Here, nite abstractions are constructed based on the theory in

[

9

], which utilizes a growth-bound (GB) formula to over-approximate

the reachable sets. Algorithmic controller synthesis is done based

on the technique presented in [

11

], which uses xed-point (FP)

computations on the constructed nite-state models.

It is not possible to use those techniques directly in

pFaces

since

they are manifested as serial algorithms. Novel parallel algorithms

are then proposed for constructing nite abstractions and synthe-

sizing symbolic controllers for them. In Section 5, it is shown that

the algorithms scale remarkably as number of PEs increases. We

refer to this kernel by pFaces/GBFP.

3.1 Parallel Construction of Finite Abstractions

We consider general nonlinear systems given in the form of a dif-

ferential equation:

Σ:Û

ξ(t)=f(ξ(t),u),(1)

where

ξ(t) ∈ X⊆Rn

is a state vector and

u∈U⊆Rm

is an input

vector. We denote by

ξx,u(·)

the trajectory satisfying

(1)

at almost

every

t∈ [

0

,τ]

, where

τ∈R+

is a sampling period, started from

initial condition

ξx,u(

0

)=x

and under some input

u

. Set

¯

X

is a nite

partition on

X

constructed by a set of hyper-rectangles of identical

widths

η∈Rn

+

. Set

¯

U

is a nite subset of

U

. A nite abstraction of

(1)

is a nite-state system

¯

Σ=(¯

X,¯

U,T)

, where

T∈¯

X×¯

U×¯

X

is

a transition relation that is crafted so that there exists a feedback-

renement relation (FRR)

R ∈ X×¯

X

from

Σ

to

¯

Σ

. Interested readers

can nd more details about FRRs in [9].

For the vector eld of

(1)

, a function

Ωf

:

¯

X×¯

U→X2

char-

acterizes the over-approximations of the reachable sets starting

from a set

¯

x∈¯

X

when the input

¯

u

is applied. For example, if the

growth-bound map (

β

:

Rn×U→Rn

) introduced in [

9

] is used,

Ωf

can be dened as follows:

Ωf(¯

x,¯

u)=(xlb ,xu b )

:

=(−r+ξ¯

xc,¯

u(τ),

r+ξ¯

xc,¯

u(τ))

, where

r=β(η/

2

,u)

, and

¯

xc∈¯

x

denotes the centroid of

¯

x

. An over approximation of the reachable sets can then be obtained

by the map

Of

:

¯

X×¯

U→

2

¯

X

dened by

Of(¯

x,¯

u)=Q◦Ωf(¯

x,¯

u)

,

where Qis a quantization map dened by:

Q(xlb ,xu b )={¯

x′∈¯

X|¯

x′∩ [[xlb ,xu b ]] ,∅},(2)

Algorithm 1:

Traditional serial algorithm for constructing

discrete abstractions.

Input: ¯

X,¯

U,Of

Output: A transition relation T⊆¯

X×¯

U×¯

X.

1T← ∅ ;

2for all ¯

x∈¯

Xdo

3for all ¯

u∈¯

Udo

4for all ¯

x′∈Of(¯

x,¯

u)do

5T←T∪ {( ¯

x,¯

u,¯

x′)} ;

6end

7end

8end

where

[[xlb ,xu b ]] =[xlb,1,xu b,1]×[xlb ,2,xub,2] × · · · × [xlb,n,

xub,n].

Algorithm 1 depicts the traditional algorithm for constructing

nite abstractions of dynamical systems. The algorithm constructs,

serially,

T∈¯

X×¯

U×¯

X

by iterating over all elements of

¯

X×¯

U

. For

any

(¯

x,¯

u)

, the evaluation of

Of

and

T|(¯

x,¯

u)

is independent of any

element of

¯

X×¯

U

. Algorithm 2 is then proposed as a parallelization

of Algorithm 1. Each PE, annotated with an index

p∈ {

1

,

2

,· · · ,P}

,

where

P

is the number of available PEs, handles one

(¯

x,¯

u) ∈ ¯

X×¯

U

.

Function

I

:

N+\ {∞} → {

1

,

2

, . . . , P}

maps a parallel job (i.e., an

iteration of the

parallel for-all

statement) with index

i

to a PE

with an index

p=I(i)

. The algorithm introduces the abstraction

task as an ideal data-parallel task with no communication overhead

among the processing elements. It is more suitable for CUs with

massive number of PEs (e.g. GPUs and super computers). However,

having

P>|¯

X×¯

U|

is a waste of computation power. A variant of it

is also provided in

pFaces

/

GBFP

and it aggregates the computation

of all

(¯

x,¯

u)

having the same

¯

x

in one PE, which is more suitable for

CUs with small number of fast PEs (e.g., CPUs and FPGAs).

Instead of storing symbolic transitions in

T

,

Ωf

is used to con-

struct a distributed container

K

:

=K1

lo c ∪K2

lo c ∪ · · · ∪ KP

lo c

, where

the subscript

loc

indicates that

Kp

lo c ⊆¯

X×¯

U×X2

is stored in a

local-memory of the PE with index

p

. The lines 10-12 in Algorithm

2 are optional and can be omitted if there is no interest to obtain

a combined abstraction. We show in Subsection 3.2 that only the

Algorithm 2:

Proposed parallel algorithm for constructing

discrete abstractions.

Input: ¯

X,¯

U,Ωf

Output: A characteristic set K⊆¯

X×¯

U×X2.

1K← ∅;

2for all p∈ {1,2,· · · ,P}do

3Kp

lo c ← ∅;

4end

5for all (¯

x,¯

u) ∈ ¯

X×¯

Uin parallel with index ido

6p=I(i);

7(xlb ,xu b ) ← Ωf(¯

x,¯

u);

8Kp

lo c ←Kp

lo c ∪ {( ¯

x,¯

u,(xlb ,xu b ))};

9end

10 for all p∈ {1,2,· · · ,P}do

11 K←K∪Kp

lo c ;

12 end

HSCC ’19, April 16–18, 2019, Montreal, QC, Canada Mahmoud Khaled and Majid Zamani

distributed containers

Kp

lo c

are required to synthesize symbolic

controllers for ¯

Σ.

Note that using

K

rather than

T

is more ecient since

|T|

is sen-

sitive to

|Of(¯

x,¯

u)|

, while

|K|=

2

n|¯

X×¯

U|

is constant and consumes,

practically, less memory. This becomes more important when such

operations are executed in PEs of GPUs or FPGAs known for having

limited memory. In Subsection 3.3, we show that

Kp

lo c

can be also

omitted and the abstraction is done on-the-y.

3.2 Parallel Synthesis of Symbolic Controllers

Given

¯

Σ=(¯

X,¯

U,T)

, we dene the controllable predecessor map

CPreT: 2 ¯

X×¯

U→2¯

X×¯

Ufor Z⊆¯

X×¯

Uby:

CPreT(Z)={( ¯

x,¯

u) ∈ ¯

X×¯

U|∅ ,T(¯

x,¯

u) ⊆ π¯

X(Z)},(3)

where

π¯

X(Z)={¯

x∈¯

X|∃¯

u∈¯

U(¯

u,¯

x) ∈ Z}

, and

T(¯

x,¯

u)

is an in-

terpretation of the transitions set

T

as a map

T

:

¯

X×¯

U→

2

¯

X

that evaluates a set of post-states from a state-input pair. We con-

sider reachability and invariance specications given by the LTL

formulae

^ψ

and

□ψ

, respectively, where

ψ

is a propositional for-

mula over a set of atomic propositions

AP

. We rst construct an

initial winning set

Zψ={( ¯

x,¯

u) ∈ ¯

X×¯

U|L(¯

x,¯

u) |=ψ)}

, where

L:¯

X×¯

U→2AP is some labeling function.

To synthesize symbolic controllers for the reachability speci-

cations, we utilize the monotone function

G(Z)

:

=CPreT(Z) ∪ Zψ

to iteratively compute

Z∞=µZ.G(Z)

starting with

Z0=∅

. Here,

we adopt a notation from

µ

-calculus with

µ

as the minimal xed

point operator and

Z

is the operated variable. Interested readers

can nd more details in [

6

] and the references therein. The syn-

thesized controller is a map

C

:

¯

Xw→

2

¯

U

, where

¯

Xw⊆¯

X

repre-

sents a winning (a.k.a. controllable) set of states. Set

C

is dened

by:

C(¯

x)={¯

u∈¯

U|( ¯

x,¯

u) ∈ µj(¯

x)Z.G(Z)}

, where

j(¯

x)=inf{i∈

N|¯

x∈π¯

X(µiZ.G(Z))}

, and

µiZ.G(Z)

represents the value of

ith

it-

eration of the minimal xed point computation. Algorithm 3 shows

a serial implementation of the minimal xed-point computation

Z∞=µZ.G(Z)

. For the sake of space, we omit similar discussion

about synthesizing controllers for invariance specications. Inter-

ested readers can nd more details in [11].

Algorithm 4 is proposed as a parallelization of Algorithm 3. We

assume using the same indexing map

I(·)

from Algorithm 2. Line

Algorithm 3:

Traditional serial algorithm to synthesize

C

enforcing the specication ^ψ.

Input: Initial winning domain Zψ⊂¯

X×¯

Uand T

Output: A controller C:¯

Xw→2¯

U.

1Z∞← ∅ ;

2¯

Xw← ∅ ;

3do

4Z0←Z∞;

5Z∞←CPreT(Z0) ∪ Zψ;

6D←Z∞\Z0;

7foreach ¯

x∈π¯

X(D)with ¯

x<¯

Xwdo

8¯

Xw←¯

Xw∪ { ¯

x};

9C(¯

x):={¯

u∈¯

U|( ¯

x,¯

u) ∈ D};

10 end

11 while Z∞,Z0;

Algorithm 4:

Proposed parallel algorithm to synthesize

C

enforcing the specication ^ψ.

Input: Initial winning domain Zψ⊂¯

X×¯

Uand T

Output: A controller C:¯

Xw→2¯

U.

1Z∞← ∅ ;

2¯

Xw← ∅ ;

3do

4Z0←Z∞;

5for all p∈ {1,2,· · · ,P}do

6Zp

lo c ← ∅;

7¯

Xp

w,lo c ← ∅;

8end

9for all (¯

x,¯

u) ∈ ¯

X×¯

Uin parallel with index ido

10 p=I(i);

11 Post s ←Q◦Kp

lo c (¯

x,¯

u);

12 if Post s ⊆Z0∪Zψthen

13 Zp

lo c ←Zp

lo c ∪ {( ¯

x,¯

u)};

14 ¯

Xp

w,lo c ←¯

Xp

w,lo c ∪ { ˆ

x};

15 if ¯

x<π¯

X(Z0)then

16 C(¯

x) ← C(¯

x) ∪ { ¯

u};

17 end

18 end

19 end

20 for all p∈ {1,2,· · · ,P}do

21 Z∞←Z∞∪Zp

lo c ;

22 ¯

Xw←¯

Xw∪¯

Xp

w,lo c ;

23 end

24 while Z∞,Z0;

(11) corresponds to computing

T(¯

x,¯

u)

from the stored characteristic

values

(xlb ,xu b )

. Since all PEs use

Z0

when running lines (12) and

(15), the synchronization among all PEs is required to ensure all PEs

get the most updated version of

Z0

. Such synchronization happens

in every iteration of the FP by collecting all local versions

Zp

lo c

in

line (21) and the update in line (4) before starting another parallel

synthesis iteration of the

parallel for-loop

in line (9). Similarly, a

variant of the algorithm, that is more suitable for CPUs and FPGAs,

is provided in pFaces/GBFP.

3.3 A Memory-ecient Kernel

Modern CUs contain hundreds to thousands of PE. This motivates

the concept of more-compute/less-memory where recomputing

results between repeated iterations is favored over storing them.

We apply this by eliminating the use of

Kp

lo c

in line (11) of Algorithm

4 and computing it on the y using the same way done in lines

(7) and (8) of Algorithm 2. We denote such modied kernel by

pFaces/GBFPm.

3.4 Implementation details

Figure 3 shows the work-ow of

pFaces

/

GBFP

. Apart from the

boxes highlighted with CU, all steps are executed in serial in the

management ecosystem. The kernel developer implements such

steps using subroutines from

pFaces

. For example, the step Distrib-

ute Jobs Based on Collected Data is a simple call to a subroutine in

pFaces: An Acceleration Ecosystem for Symbolic Control HSCC ’19, April 16–18, 2019, Montreal, QC, Canada

Start

Tune Devices

with selected samples

of input/output spaces

are

Load tune data

Distribute jobs based

on collected data

Sync. point

over devices

Run Abstraction

Algorithm

in Parallel

Run Synthesis

Iterations

in Parallel

devices

tuned ?

Read user

conﬁgﬁles

Yes No

Start parallel jobs

Dump the controller

from devices' memories

Encode/save

the controller

End

CU

PE PE

· · ·

CU

PE PE

· · ·

· · ·

CU

PE PE

· · ·

CU

PE PE

· · ·

· · ·

Identify availble

parallel devices

Compile kernels

for devices

Generate code

from the controller

Figure 3: Work ow of pFaces/GBFP.

pFaces

that computes the best task distribution among available

PEs. Parallel tasks are handled completely by pFaces.

After constructing the abstraction, PEs are synchronized to make

sure all PEs start the synthesis task with correct abstraction mem-

ory. Also, after each FP iteration, synchronization is required as

discussed in Subsection 3.2. Such requirement is mainly a conse-

quence of the check in line (12) in Algorithm 4 and the fact that

Z0

is maintained as a distributed data container when the algo-

rithm is executed in multiple CUs. This introduces overhead that

reduces the scalability of the synthesis. Fortunately, such overhead

is mitigated by the fact that dynamical systems possess some lo-

cality. More specically, when a PE runs the check in line (12) in

Algorithm 4, a good possibility is that elements of

Post s

are close

(i.e., in the Euclidean distance) to

(¯

x,¯

u)

. Consequently, the check is

computed using memory from the same CU or neighboring CUs.

pFaces, on the other hand, distributes the tasks with an encoding

I(·) that promotes such locality.

Once the FP settles,

pFaces

/

GBFP

collects the controller data

and encodes it.

pFaces

facilitates encoding and saving data ob-

ject as Raw-Data, Binary Decision Diagrams (BDD), Bitmaps, or

compressed Bitmaps. Also,

pFaces

oers code-generation by im-

plementing the library BDD2Implement [

3

], helping users export

the map Cas C/C++-code or VHDL-code.

4 AN EXAMPLE

We show, with an example, how

pFaces

can be used to mitigate

computation complexities resulting from the state explosion prob-

lem. Consider the truck-with-a-trailer example presented in [

10

].

It is a three-dimensional system and the requirement is to reach

some target speed while maintaining a safe distance between the

truck and the trailer. We generalize the example to

Truck_N

where

Table 1: Details and results for the example Truck_N.

N=1N=2N=3N=4

n3 5 7 9

|¯

X×¯

U| × 1062.64 36.17 398.29 5520.4

Memory per (x,u)(Byte) 25 41 1 1

Total memory (M.B.) 63 1414 379 5264

pFaces-kernel § § §§ §§

HW conguration MIX1GPU2MIX2MIX2

Time to nd C(sec.) 0.89 0.98 0.96 46.6

Nis the number of trailers, knowing that adding an extra trailer

increases the dimensions by two. We require the controller to be

computed in real-time (RT) with a deadline window of 1.0 second.

For dierent

N

, we focus on the value of

|¯

X×¯

U|

as it aects directly

the complexity of Algorithms 2 and 4.

For the original problem (i.e.,

Truck_1

), the existing tool

SCOTS

solves the problem in 39 seconds using HW conguration

CPU2

in

Table 2, which violates the real-time constraint. With the same HW

conguration,

pFaces

/

GBFP

(denoted by

§

in Table 1) solves the

problem in 0.89 seconds with speedup of around 44x. Speedups are

calculated by dividing the time running the serial implementation,

which is the time reported by

SCOTS

for the current case, by the

time running the same example with the parallel implementation,

for a specic HW conguration.

We upgrade the system and synthesize a controller for

Truck_2

.

In order to respect the RT-deadline, we update the HW congura-

tion to

GPU2

. The problem is then solved in 0.98 seconds. Notice

the increase in memory per

(¯

x,¯

u)

when using

pFaces

/

GBFP

(see

the discussion about

K

in Subsection 3.1). To save memory, we use

pFaces/GBFPm(denoted by §§ in Table 1) for N>2.

For

Truck_3

, the HW conguration, that allows respecting the

RT-deadline, is much expensive to be installed in the truck. We

rent some CUs from Amazon-AWS, build the HW conguration

MIX2

, combining multiple GPUs, and share it among all trucks of

type

Truck_3

. Now, the problem is solved in 0.96 seconds. Here, all

trucks are assumed to have some access to the Cloud for submitting

requests and receiving a list of control actions, which is assumed

to take less than 4 milliseconds for communication.

Now, we experiment on

Truck_4

with the same HW congu-

ration. Unfortunately, due to the state-explosion problem, we no

longer can solve the problem in RT and it is solved in 46.6 seconds.

We emphasize that in order to control the complexity, more PEs

need to be added in this case.

Table 1 reports the collected results for the four cases. When

using

pFaces

/

GBFPm

, the 1-byte requirement for the memory per

(¯

x,¯

u)is used for controller synthesis not for the abstraction.

5 BENCHMARKING pFaces/GBFP

For benchmarking, we use HW congurations listed in Table 2. We

conduct a benchmark for scalability and report its results in Tables

3 and 4. Here, the dynamics and parameters of the examples dcdc

and

vehicle

are borrowed from [

11

], while those for the examples

robot and khepera are borrowed from [4].

The reference results for speedup computation are marked with

a black box. The highest speedup is underlined. N/A denotes "not

applicable" and used to indicate that the tool

SCOTS

does not run in

some HW congurations. The example

Truck_1

reported in Table

4 has smaller

η

than the one in Table 1. For

CLS1

, we report the time

HSCC ’19, April 16–18, 2019, Montreal, QC, Canada Mahmoud Khaled and Majid Zamani

Table 2: Used HW congurations for the proposed benchmarks.

Code Name Class

Number

of PEs PE Frequency

Memory

(G.B.)

Power

(Watt)

Price

($)

CPU1

Intel Core i5-6200U in Lenovo X260 Laptop 2016

CPU 2 2.8 GHz 8 15 281

CPU2Intel Xeon E5-2630 CPU 10 3.1 GHz 8 85 667

GPU1NVIDIA Quadro P5000 GPU 2560 600 MHz 16 180 1,800

GPU2NVIDIA Tesla V100 GPU 5120 800 MHz 16 250 10,664

GPU3

AMD Radeon Pro Vega 20 in Macbook Pro 2018

GPU 1280 1200 MHz 4 ≤50 350

PGA1Altera DE5-Net Board FPGA 2 50 MHz 8 ≤4 6,250

PGA2Kintex UltraScale FPGA KCU1500 FPGA 2 300 MHz 16 ≤10 2,500

MIX1CPU1and its internal GPU Mixed 24 2.8 GHz / 300 GHz 8 25 281

MIX28×GPU2with NVLink interconnection Mixed 40960 800 MHz 128 2000 85,312

CLS1Two networked CNs: 32-core CPU and GPU2

Cluster

5152 3.8 GHz / 800 MHz 488 450 37,000

Table 3: Scalability benchmarking for the examples: DCDC and Vehicle.

DCDC:|¯

U|=2,|¯

X|=639200 Vehicle:|¯

U|=49,|¯

X|=91035

CPU1CPU2GPU1GPU2PGA1PGA2MIX1MIX2CLS1CPU1CPU2GPU1GPU2GPU3PGA1MIX1MIX2CLS1

SCOTS

44.3

36.9

N/A N/A N/A N/A N/A

N/A N/A

207

203

N/A N/A N/A N/A N/A

N/A N/A

pFaces/GBFP 1.6

0.41 0.037 0.009 0.189 0.073 0.41 0.003 0.8/98% 52.1 10.9 0.98 0.152 1.72 16.8 13.6

0.04

12.0/99%

Speedup

23x 108x 997x 4100x 195x 505x 90x

12300x44x 4x

18x 207x 1350x 118x 12x 15x

5075x 16x

Table 4: Scalability benchmarking for the examples: Robot and Truck_1.

Truck_1:|¯

U|=51,|¯

X|=229327 Robot:|¯

U|=77,|¯

X|=1364889

CPU1CPU2GPU1GPU2PGA1PGA2MIX1MIX2CLS1CPU1CPU2GPU1GPU2GPU3PGA1MIX1MIX2CLS1

SCOTS

249

191

N/A N/A N/A N/A N/A

N/A N/A

4423

3949

N/A N/A N/A N/A N/A

N/A N/A

pFaces/GBFP 7.2 1.7

0.462 0.153 1.05 0.84

2.0

0.006 1.4/97% 154

97 8.3

1.85 13.2 147 136 0.309 96.2/96%

Speedup

26x 112x 413x 1249x 182x 227x 95x

31850x136x

25x

40x

475x 2134x 299x 26x 29x

12779x41x

to solve the problem and its percentage consumed by the network

communication overhead. We recommend using clusters only for

large problems where the FP computation time is expected to be

much longer than the communication overhead.

We only compare with the tool

SCOTS

since it implements, ex-

actly, the serial Algorithms 1 and 3. Nevertheless,

pFaces

/

GBFP

outperforms the tools reported in Section 1.1. For example, in [

5

], a

dierent technique is used and the example

DCDC

was reported to

take 0.36 seconds, which is clearly outperformed by

pFaces

/

GBFP

as number of PE increases. Other tools reported in Section 1.1 are

yet outperformed by

SCOTS

or the tool in [

5

]. Therefore, we do not

compare pFaces with them.

6 CONCLUSIONS AND FUTURE WORK

A software ecosystem is proposed to facilitate prototyping par-

allel algorithms serving research areas like symbolic control and

reachability analysis. A traditional symbolic control technique is re-

designed as a data-parallel task that scales with number of PEs help-

ing to tackle computational complexities. The kernel

pFaces

/

GBFP

scales very well but consumes lots of memory.

pFaces

/

GBFPm

is

memory-ecient but slower due to repeated computations. Fu-

ture work will focus on designing distributed data-structures that

balance between memory size and fast write/query time.

ACKNOWLEDGMENTS

We gratefully acknowledge the support of

Intel

,

Xilinx

,

NVIDIA

,

and

Amazon

corporations. The conguration

GPU1

was donated by

NVIDIA

Corporation. The conguration

PGA1

was donated by

Intel

Corporation. A paid access to Amazon-AWS for testing the tool on

an EC2-F1 instance was provided by

Xilinx

Corporation. The tests

on the HW congurations

GPU2

,

MIX2

and

CLS1

in Amazon-AWS

were provided through a grant from Amazon.

REFERENCES

[1]

K. Hsu, R. Majumdar, K. Mallik, and A. K. Schmuck. 2018. Multi-Layered

Abstraction-Based Controller Synthesis for Continuous-Time Systems. In Pro-

ceedings of the 21st International Conference on Hybrid Systems: Computation

and Control (Part of CPS Week) (HSCC ’18). ACM, New York, NY, USA, 120–129.

https://doi.org/10.1145/3178126.3178143

[2]

L. Kalms and D. Göhringer. 2017. Exploration of OpenCL for FPGAs using

SDAccel and comparison to GPUs and multicore CPUs. In 2017 27th International

Conference on Field Programmable Logic and Applications (FPL). IEEE, USA, 1–4.

https://doi.org/10.23919/FPL.2017.8056847

[3]

M. Khaled. 2017. BDD2Implement: A Code Generation Tool for Symbolic Con-

trollers. https://gitlab.lrz.de/hcs/BDD2Implement

[4]

M. Khaled, M. Rungger, and M. Zamani. June 2018. SENSE: Abstraction-Based

Synthesis of Networked Control Systems. In Electronic Proceedings in Theoretical

Computer Science (EPTCS), 272. Open Publishing Association (OPA), 111 Cooper

Street, Waterloo, Australia, 65–78. https://doi.org/10.4204/EPTCS.272.6

[5]

Y. Li and J. Liu. 2018. ROCS: A Robustly Complete Control Synthesis Tool for

Nonlinear Dynamical Systems. In Proceedings of the 21st International Conference

on Hybrid Systems: Computation and Control (Part of CPS Week) (HSCC ’18). ACM,

New York, NY, USA, 130–135. https://doi.org/10.1145/3178126.3178153

[6]

O. Maler, A. Pnueli, and J. Sifakis. 1995. On the synthesis of discrete controllers

for timed systems. In 12th Annual Symposium on Theoretical Aspects of Computer

Science (STACS 95), E. W. Mayr and C. Puech (Eds.). Springer Berlin Heidelberg,

Berlin, Heidelberg, 229–242. https://doi.org/10.1007/3- 540-59042-0_76

[7]

M. Mazo, A. Davitian, and P. Tabuada. 2010. PESSOA: A Tool for Embedded

Controller Synthesis. In Computer Aided Verication, Tayssir Touili, Byron Cook,

and Paul Jackson (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 566–569.

https://doi.org/10.1007/978-3- 642-14295- 6_49

[8]

S. Mouelhi, A. Girard, and G. Gössler. 2013. CoSyMA: A Tool for Controller

Synthesis Using Multi-scale Abstractions. In Proceedings of 16th International

Conference on Hybrid Systems: Computation and Control (HSCC ’13). ACM, New

York, NY, USA, 83–88. https://doi.org/10.1145/2461328.2461343

[9]

G. Reissig, A. Weber, and M. Rungger. April 2017. Feedback Renement Relations

for the Synthesis of Symbolic Controllers. IEEE Trans. Automat. Control 62, 4

(April 2017), 1781–1796. https://doi.org/10.1109/TAC.2016.2593947

[10]

M. Rungger, M. Mazo, Jr., and P. Tabuada. 2013. Specication-guided Controller

Synthesis for Linear Systems and Safe Linear-time Temporal Logic. In 16th In-

ternational Conference on Hybrid Systems: Computation and Control (HSCC ’13).

ACM, New York, NY, USA, 333–342. https://doi.org/10.1145/2461328.2461378

[11]

M. Rungger and M. Zamani. 2016. SCOTS: A Tool for the Synthesis of Symbolic

Controllers. In Proceedings of the 19th International Conference on Hybrid Systems:

Computation and Control (HSCC ’16). ACM, New York, NY, USA, 99–104. https:

//doi.org/10.1145/2883817.2883834

[12]

P. Tabuada. 2009. Verication and control of hybrid systems, A symbolic approach.

Springer, USA. https://doi.org/10.1007/978- 1-4419-0224- 5

[13]

T. Wongpiromsarn, U. Topcu, N. Ozay, H. Xu, and R. M. Murray. 2011. TuLiP:

A Software Toolbox for Receding Horizon Temporal Logic Planning. In 14th

International Conference on Hybrid Systems: Computation and Control (HSCC ’11).

ACM, New York, NY, USA, 313–314. https://doi.org/10.1145/1967701.1967747

[14]

M. Zamani, G. Pola, M. Mazo Jr., and P. Tabuada. 2012. Symbolic Models for

Nonlinear Control Systems Without Stability Assumptions. IEEE Trans. Automat.

Control 57, 7 (July 2012), 1804–1809. https://doi.org/10.1109/TAC.2011.2176409