An approximate dynamic programming approach to probabilistic reachability for stochastic hybrid systems
ABSTRACT This paper addresses the computational overhead involved in probabilistic reachability computations for a general class of controlled stochastic hybrid systems. An approximate dynamic programming approach is proposed to mitigate the curse of dimensionality issue arising in the solution to the stochastic optimal control reformulation of the probabilistic reachability problem. An algorithm tailored to this problem is introduced and compared with the standard numerical solution to dynamic programming on a benchmark example.
- Citations (13)
-
Cited In (0)
-
Conference Proceeding: Computational Approaches to Reachability Analysis of Stochastic Hybrid Systems.
Hybrid Systems: Computation and Control, 10th International Workshop, HSCC 2007, Pisa, Italy, April 3-5, 2007, Proceedings; 01/2007 -
SourceAvailable from: Maria Prandini
Article: Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems
[show abstract] [hide abstract]
ABSTRACT: In this work, probabilistic reachability over a finite horizon is investigated for a class of discrete time stochastic hybrid systems with control inputs. A suitable embedding of the reachability problem in a stochastic control framework reveals that it is amenable to two complementary interpretations, leading to dual algorithms for reachability computations. In particular, the set of initial conditions providing a certain probabilistic guarantee that the system will keep evolving within a desired ‘safe’ region of the state space is characterized in terms of a value function, and ‘maximally safe’ Markov policies are determined via dynamic programming. These results are of interest not only for safety analysis and design, but also for solving those regulation and stabilization problems that can be reinterpreted as safety problems. The temperature regulation problem presented in the paper as a case study is one such case.Automatica. -
Book: Stochastic Optimal Control: The Discrete-Time Case
01/1978; Academic Press, Inc..
Page 1
An approximate dynamic programming approach
to probabilistic reachability for stochastic hybrid systems
Alessandro Abate, Maria Prandini, John Lygeros, and Shankar Sastry
Abstract—This paper addresses the computational overhead
involved in probabilistic reachability computations for a general
class of controlled stochastic hybrid systems. An approximate
dynamic programming approach is proposed to mitigate the
curse of dimensionality issue arising in the solution to the
stochastic optimal control reformulation of the probabilistic
reachability problem. An algorithm tailored to this problem is
introduced and compared with the standard numerical solution
to dynamic programming on a benchmark example.
I. INTRODUCTION
Stochastic Hybrid Systems (SHS) are a general class of
models relevant to a wide range of application contexts
involving interacting discrete and continuous dynamics, as
well as probabilistic uncertainty, [5], [7].
In this paper we study the reachability problem for SHS.
Reachability is an important topic in systems theory. Quali-
tatively, it deals with the issue of evaluating whether the state
of a system will reach a certain set during a time interval,
starting from some initial conditions, and possibly subject to
a control input. If such a set represents an unsafe region of
the state space, one is dealing with a safety problem and
has the choice to select the control input so as to avoid
entering that set. In a stochastic context, it appears natural to
interpret the reachability concept in probabilistic terms, and
to investigate problems such as quantifying the probability
of reaching a set or minimizing the probability of entering
an unsafe set. If viable, this characterization appears richer
and in many cases preferable to a worst case viewpoint,
where one considers each admissible trajectory, neglecting
how much likely it is to occur.
Recently, a number of contributions on probabilistic reach-
ability of SHS have appeared in the literature, see for
example [2], [6], [11], [13], [14]. However, reachability
computations for SHS of practical scale remains a chal-
lenging open problem. In this paper we study this issue
for the general class of discrete-time controlled SHS in [2].
We focus on a safety problem where the objective is to
determine the control policy that maximizes the probability
of remaining within a given safe set during a finite time
horizon. In [2], such a safety problem has been reformulated
as a stochastic optimal control problem with multiplicative
cost for a controlled Markov chain, to which dynamic
This work was partially supported by MIUR under the project “New
methods for Identification and Adaptive Control for Industrial Systems,” by
the EC under the project iFly TREN/07/FP6AE/S07.71574/037180, and by
the NSF grant CCR-0225610.
A. Abate is with Stanford University aabate@stanford.edu;
M. Prandini iswith the
prandini@elet.polimi.it;J.
lygeros@control.ee.ethz.ch; S. Sastry is with the University of
California, Berkeley sastry@eecs.berkeley.edu.
Politecnico
Lygeros
di
ETH
Milano
Zurich iswith
programming can be applied. Given that the solution to
the value iteration equations obtained through the dynamic
programming approach cannot be written out explicitly, the
safety problem has to be solved in practice through some
approximation method.
The work in [1] has shown that, under appropriate continu-
ity assumptions on the transition probabilities that character-
ize the SHS dynamics, the numerical solution obtained by a
standard gridding scheme is asymptotically convergent as the
gridding scale parameter goes to zero. Non-asymptotic error
bounds can also be given. On the other hand, the overwhelm-
ing computational burden associated with this approach
makes it inapplicable in realistic situations. For problems of
practical scale, storing and manipulating functions over the
discretized hybrid state space becomes prohibitive, and some
approximation scheme is needed. Here, we investigate and
discuss an approximate value iteration algorithm that relies
on a neural approximation of the value function to mitigate
the curse of dimensionality. A comparative study between
this technique and the grid-based numerical approximation
is presented on a benchmark example.
II. DISCRETE-TIME STOCHASTIC HYBRID MODEL
The state of the controlled discrete-time stochastic hybrid
system (DTSHS) model introduced in [2] is characterized
by two components: a discrete and a continuous one. The
continuous state evolves according to a probabilistic law
that depends on the value taken by the discrete state. The
discrete state can transition between different values in a
finite set according to some probabilistic law that depends
on the continuous state. Both the continuous and the discrete
probabilistic evolutions can be affected by some control input
(transition input). Furthermore, whenever a transition in the
discrete state occurs, the continuous state is subject to a
probabilistic reset that may depend on an additional control
input (reset input).
Definition 1 (DTSHS): A discrete-time stochastic hybrid
system is a tuple H = (Q,n,U,Σ,T,Tq,R), where
- Q := {q1,q2,...,qm},m ∈ N, represents the discrete
state space
- n : Q → N assigns to each discrete state value q ∈ Q
the dimension of the continuous state space Rn(q). The
hybrid state space is then S := ∪q∈Q{q} × Rn(q)
- U is a Borel space denoting the transition control space
- Σ is a Borel space representing the reset control space
- T : B(Rn(·)) × S × U → [0,1] is a Borel-measurable
stochastic kernel on Rn(·)given S×U, which assigns to
each s = (q,x) ∈ S and u ∈ U a probability measure
T(dx|s,u) on the Borel space (Rn(q),B(Rn(q)))
4018
Proceedings of the
47th IEEE Conference on Decision and Control
Cancun, Mexico, Dec. 9-11, 2008
ThTA08.1
978-1-4244-3124-3/08/$25.00 ©2008 IEEE
Page 2
- Tq: Q × S × U → [0,1] is a discrete stochastic kernel
on Q given S × U, which assigns to each s ∈ S and
u ∈ U, a probability distribution Tq(q|s,u) over Q
- R : B(Rn(·))×S×Σ×Q → [0,1] is a Borel-measurable
stochastic kernel on Rn(·)given S×Σ×Q, that assigns
to each s ∈ S, σ ∈ Σ, and q?∈ Q, a probability measure
R(dx|s,σ,q?) on (Rn(q?),B(Rn(q?))).
In order to define an execution for a DTSHS we have to
specify how the system is initialized and how the control
inputs to the system are selected.
The system initialization at the initial time k = 0 is
specified through a probability measure π0: B(S) → [0,1]
on the Borel space (S,B(S)), where B(S) is the σ-field
generated by the subsets of S of the form ∪q{q} × Cq,
with Cq denoting a Borel set in Rn(q). As for the inputs,
we consider the case where the control inputs are selected
based on the current value of the hybrid state according to
a Markov policy [2]. A Markov policy for H is a sequence
µ = (µ0,µ1,...,µN−1) of universally measurable maps [3]
µk: S → U ×Σ, k = 0,1,...,N −1. We denote the set of
Markov policies as Mm.
Let τx: B(Rn(·))×S×U×Σ×Q → [0,1] be a stochastic
kernel on Rn(·)given S ×U ×Σ×Q, which assigns to each
s = (q,x) ∈ S, u ∈ U, σ ∈ Σ, and q?∈ Q a probability
measure on the Borel space (Rn(q?),B(Rn(q?))) as follows:
?
R(dx?|(q,x),σ,q?), if q??= q.
Based on τxwe can introduce the Borel-measurable stochas-
tic kernel Ts : B(S) × S × U × Σ → [0,1] on S given
S × U × Σ, which assigns to each s = (q,x) ∈ S and
(u,σ) ∈ U × Σ a probability measure on the Borel space
(S,B(S)) as follows:
Ts(ds?|s,(u,σ)) = τx(dx?|s,u,σ,q?)Tq(q?|s,u),
s,s?= (q?,x?) ∈ S, (u,σ) ∈ U × Σ.
Definition 2 (Execution): An execution for a DTSHS
H = (Q,n,U,Σ,T,Tq,R) associated with a Markov policy
µ = (µ0,µ1,...,µN−1) ∈ Mm and an initial distribution
π0 is an S-valued stochastic process {s(k),k ∈ [0,N]},
whose sample paths are obtained according to the following
algorithm, where all the random extractions considered are
mutually independent:
extract from S a value s0for s(0) according to π0;
for k = 0 to N − 1
set (uk,σk) = µk(sk);
extract from S a value sk+1for s(k + 1) according
to Ts(·|sk,(uk,σk));
end
A DTSHS H can then be described as a controlled
Markov process with state space S, control space U × Σ,
and controlled transition probability function Ts defined
in (1). Thus, the execution {s(k),k ∈ [0,N]} associated
with a specific µ ∈ Mm and a probability π0 is a time
inhomogeneous Markov process defined on the canonical
?
τx(dx?|(q,x),u,σ,q?) =
T(dx?|(q,x),u), if q?= q
(1)
?
sample space Ω = SN, endowed with its product topology
B(Ω), with probability measure Pµ
transition kernel Ts, the policy µ ∈ Mm, and the initial
probability measure π0(see [3, Proposition 7.45]). When π0
is concentrated on a point s ∈ S, that is π0(ds) = δs(ds),
we shall write simply Pµ
π0uniquely defined by the
s.
III. PROBABILISTIC REACHABILITY
We start by considering the following reachability prob-
lem: given a stochastic hybrid system H, determine the
probability that the execution associated with some Markov
policy µ ∈ Mmand initialization π0will remain in a Borel
set A ∈ B(S) during the whole time horizon [0,N]:
pµ
π0(A) := Pµ
π0(s(k) ∈ A for all k ∈ [0,N]).
(2)
If π0is concentrated on s ∈ S, we use the notation pµ
set¯A = S \A represents an unsafe set for H, by computing
pµ
it starts from s0 and is subject to the policy µ ∈ Mm.
Observe that
?
where sk∈ S, k ∈ [0,N] and 1A: S → {0,1} denotes the
indicator function of set A. Then,
?N
where Eµ
probability measure Pµ
pµ
by dynamic programming, [2]. Consider a Markov policy
µ = (µ0,µ1,...,µN−1) ∈ Mm. For each k ∈ [0,N],s ∈ S,
define Vµ
?
h=k+1
where?
probability of staying inside the safe set A over the (residual)
time horizon [k,N] under policy µ ∈ Mm, when the state at
time k is s ∈ S : Vµ
Note that Vµ
can then be expressed as
?
In [2] it is shown that, for a fixed Markov policy µ =
(µ0,µ1,...,µN−1), µk: S → U × Σ, k = 0,1,...,N − 1,
the functions Vµ
k
: S → [0,1], k = 0,1...,N, can be
computed by the following backward recursion:
?
initialized with Vµ
In the case when policy µ ∈ Mm is not fixed a-priori
and we are dealing with a safety problem, we have the
s(A). If
s0(A), we shall evaluate the safety level for system H when
N
k=0
1A(sk) =
?
1,
0,
if sk∈ A for all k ∈ [0,N]
otherwise,
pµ
π0(A) = Eµ
π0
?
k=0
1A(s(k))
?
,
π0denotes the expected value with respect to the
π0. Based on this representation of
π0(A) as a multiplicative cost, reachability can be addressed
k: S → [0,1] as
Vµ
k(s) := 1A(s)
AN−k
N−1
?
s(·|s) = Ts(·|s,µl(s)), s ∈ S, l ∈
k(s), s ∈ S, represents the
Tµh
s (dsh+1|sh)Tµk
s(dsk+1|s),
A0... = 1 and Tµl
[k,N −1]. It is easily seen that Vµ
k(s) = Eµ
π0
??N
h=k1A(s(h))|s(k) = s?.
k(s) does not depend on π0. For any π0,pµ
π0(A)
pµ
π0(A) =
S
Vµ
0(s)π0(ds).
Vµ
k(s) = 1A(s)
S
Vµ
k+1(sk+1)Tµk
s(dsk+1|s),
N(s) = 1A(s), s ∈ S.
47th IEEE CDC, Cancun, Mexico, Dec. 9-11, 2008ThTA08.1
4019
Page 3
possibility to design µ so as to maximize the safety level,
[2]. A Markov policy µ∗∈ Mm is maximally safe if
pµ∗
The problem of computing a maximally safe policy for
a DTSHS H is a stochastic optimal control problem with
multiplicative cost for a controlled Markov process on a
hybrid state space. Not surprisingly, Theorem 1 shows how
to compute a maximally safe policy based on a dynamic
programming backward iterative algorithm, [2].
Theorem 1: Define V∗
the backward recursion:
?
s ∈ S, initialized with V∗
Then, V∗
If µ∗
?
then, µ∗= (µ∗
policy.
Note that V∗
the time interval [k,N] starting from s ∈ S: V∗
supµ∈MmVµ
and V∗
respectively, and the backward recursion that yields V∗
known as value iteration.
s(A) = supµ∈Mmpµ
s(A), ∀s ∈ S.
k: S → [0,1], k = 0,1,...,N, by
V∗
k(s) = sup
(u,σ)∈U×Σ
1A(s)
S
V∗
k+1(sk+1)Ts(dsk+1|s,(u,σ)),
N(s) = 1A(s), s ∈ S.
s(A), s ∈ S.
0(s) = supµ∈Mmpµ
k: S → U × Σ, k ∈ [0,N − 1], is such that, ∀s ∈ A
1A(s)
S
µ∗
k(s) ∈ arg sup
(u,σ)∈U×Σ
V∗
k+1(sk+1)Ts(dsk+1|s,(u,σ)),
(3)
0,...,µ∗
N−1) is a maximally safe Markov
?
k(s) represents the maximal safety level over
k(s) =
k(s). In the dynamic programming literature Vµ
kare called value function and optimal value function,
k
0is
IV. APPROXIMATE VALUE ITERATION FOR
REACHABILITY COMPUTATIONS
The generality of the stochastic hybrid model introduced
in Definition 1 and the structure of the value iteration in The-
orem 1 suggest that the solution to the reachability problem
will rarely admit an explicit form. Hence, an implementable
version of the procedure in Theorem 1 needs to be proposed.
In particular, the computational aspects associated with the
problem are of key importance for its “practical” solution.
The classical method for the numerical solution to dy-
namic programming rests on the discretization of the state
and control spaces: the (approximate) optimal value func-
tions are represented by piecewise constant functions on a
partition of the state space, and optimization is performed
over the discretized input set. In [1], this approximation
scheme is applied to the value iteration in Theorem 1. Since
the optimal value functions V∗
are identically zero outside the safe set A = ∪q∈Q{q}×Aq,
then computations are in fact confined to A, hence gridding
can be restricted to the sets Aq ∈ B(Rn(q)),q ∈ Q. In
[1], it is shown that if Aq, q ∈ Q, are compact sets, then,
under weak regularity assumptions on the stochastic kernel
Ts(Lipschitz continuity), the so-obtained numerical solution
converges to the actual solution with known rate, as the grid
size goes to zero.
Unfortunately, as is often the case for grid-based methods,
the scalability issue appears to be critical for the applicabil-
ity to practical problems of this numerical approximation
scheme. In each iteration one has to manipulate and store
k: S → [0,1],k = 0,1,...,N,
functions that are represented by a number of values that
grows exponentially with the dimension of the continuous
state space. This curse of dimensionality makes the solution
of reachability problems for high-dimensional DTSHS pro-
hibitive and calls for some approximation scheme to reduce
the computational burden.
The approximate value iteration method tries to defeat
the curse of dimensionality by approximating the optimal
value functions by finitely parameterized functions of known
structure. Value iteration is then applied to these compactly-
represented approximations, rather than to look-up tables,
as in the grid-based approximation of the original DP.
Depending on the size of the parameters set, this may result
in an effective speedup of the overall computations. Typi-
cally, a linear combination of pre-specified basis functions is
adopted. In our hybrid setting the approximate optimal value
function at time k ∈ {0,1,...,N − 1} takes the form:
h
?
wheretheparametervector
(wq1
Determining the approximate optimal value function at time
k amounts to determining its parameter vector w∗
unlike the more standard non-hybrid setting, for any k we
have m approximate functions, one per mode q ∈ Q, and
each one with its own parameter vector wq
The approximate optimal value functionsˆV∗
0,1,...,N − 1, are computed according to a backward
iterative procedure initialized as in Theorem 1, where each
iteration consists of two steps, as described hereafter, at time
k ∈ {0,1,...,N−1}. Suppose thatˆV∗
Then,ˆV∗
Step 1. apply the value iteration operator toˆV∗
?
ˆV∗
k(s;wk) =
i=1
wq
i,kφi(x), s = (q,x) ∈ S,
consists
2,k,...,wq
of
h,k).
wk
=
k,wq2
k,...,wqm
k),wq
k= (wq
1,k,wq
k. Here,
k.
k(·;w∗
k), k =
k+1(·;w∗
k+1) is known.
k(s;w∗
k) is obtained as follows:
k+1(·;w∗
k+1):
¯Vk(s) = sup
(u,σ)∈U×Σ
1A(s)
A
ˆV∗
k+1(sk+1;w∗
k+1)Ts(dsk+1|s,(u,σ)),
(4)
Step 2. minimize the weighted L2-norm of the error:
?
where π(s) ≥ 0, s ∈ A, is a weighting function that allows to
obtain a more accurate fitting for those states that are known
to be critical or frequently visited by the system executions.
In our reachability application π takes larger values close to
the boundary of the safe set A. The resulting two-step itera-
tion can be viewed as the application of a modified version
of the value iteration operator that includes a projection with
respect to a weighted L2-norm, [8].
Note that, in the implementation of the approximate value
iteration algorithm, one needs to compute the integral in
equation (4). Despite the fact that V∗
in analytic form, this integral, in general, has to be solved
numerically. In contrast with the numerical approximation to
DP based on gridding, though, the values of V∗
that are needed for solving the integral can be determined
w∗
k= argmin
wk∈Rmh
A
π(s)?¯Vk(s) −ˆV∗
k(s;wk)?2ds,
(5)
k+1(·;w∗
k+1) is known
k+1(·;w∗
k+1)
47th IEEE CDC, Cancun, Mexico, Dec. 9-11, 2008ThTA08.1
4020
Page 4
on-the-fly, based on its analytic expression. As for the
integral in (5), an accurate approximation can be obtained
by considering π as a probability density with support on A
and replacing (5) with:
?
where¯S are a set of samples extracted from A according
to π. The problem of solving (6) can be viewed as that
of training the neural networkˆV∗
based on the training data set {(s,¯Vk(s), s ∈
allows to use well-studied algorithms developed in the neural
networks field for the implementation of the approximate
value iteration [4].
Once the approximate optimal value functionsˆV∗
k = 0,1,...,N − 1, are known, it is then possible to
compute the approximate maximally safe policy ˆ µ∗
(ˆ µ∗
an analytic–though approximate–expression of the optimal
value functions that is also easy to store makes it more
convenient to compute on-line only the control input to be
applied at the states actually visited by the system during its
execution.
w∗
k= argmin
wk∈Rmh
s∈¯ S
?¯Vk(s) −ˆV∗
k(s;wk)?2,
(6)
k(·;wk) with weights wk
¯S}. This
k(·;w∗
k),
=
0,..., ˆ µ∗
N−1) as in Theorem 1. The availability of
V. CASE STUDY
In this section we refer to a benchmark case study for
hybrid systems described in [10]. The objective is the simul-
taneous temperature regulation in r rooms, where r ≥ 1, by
means of a single heater that can switch between different
rooms. The task consists of designing a (switching) control
strategy that establishes which room should be heated at what
time based on the measurements of the r rooms temperatures,
so as to maintain the temperature of each room within a
prescribed range over a finite time horizon.
We compare the results obtained by the approximate value
iteration algorithm (denoted as AVI) with those obtained
through the numerical solution to the dynamic programming
equations based on state space gridding (denoted as DP),
which have been shown in [1] to be potentially as close to
the actual solutions as desired.
A. Modeling
The system is modeled by a DTSHS, whose discrete
state component q represents which of the r rooms is
being heated, and whose continuous state component x =
(x1,...,xr) represents the uniform temperature in the r
rooms. The discrete state space can then be defined as
Q = {ON1, ON2, ..., ONr,OFF}, where in mode ONiit is
room i to be heated and in mode OFF no room is heated.
The map n : Q → N is the constant map n(q) = r,∀q ∈ Q.
The reset control space is trivial, Σ = {0}. The transition
control space is U = {SW1, SW2, ..., SWr,SWOFF}, where
SWiand SWOFFcorrespond to the command of heating room
i and heating no room, respectively.
The evolution of the temperature xiin room i is governed
by the following linear stochastic difference equation:
?
+ bi
xi(k + 1) = xi(k) +
j?=i
aij
?xj(k) − xi(k)?
(7)
?xa− xi(k)?+ cihi(k) + ni(k),
which is obtained by discretizing, via the constant-step
Euler-Maruyama scheme with discretization step ∆, a set
of continuous time equations, as described in [2]. The term
xais the ambient temperature, which is assumed to be fixed.
The constants bi, aij, and ciare non-negative and represent
the average heat loss rates of room i to the ambient (bi) and
to room j ?= i (aij), and the heat rate supplied by the heater
in room i (ci), all normalized with respect to the average
thermal capacity of room i and rescaled by ∆ (according
to the integration scheme). The term hi(k) is a Boolean
function equal to 1 if q(k) = ONi (i.e. if it is room i to
be heated at time k), and equal to 0 otherwise. Furthermore,
the disturbance {ni(k),k = 0,...,N − 1} affecting the
temperature evolution is a sequence of i.i.d Gaussian random
variables with zero mean and variance ν2proportional to ∆.
Let N(·;m,V ) denote the probability measure over
(Rr,B(Rr)) associated with a Gaussian density function
with mean m and covariance matrix V . Then, the continuous,
control-independent transition kernel T (implicitly defined
by (7)) can be expressed as follows:
T(·|(q,x),u) = T(·|(q,x)) = N(·;x + Ξx + Γ(q),ν2I),
where Ξ is a square matrix of size r, Γ(q) is an r-dimensional
column vector that depends on q ∈ Q, and I is the identity
matrix of size r. For any i = 1,...,r, the element in row i
and column j of matrix Ξ is given by [Ξ]ij= aij, if j ?= i,
and [Ξ]ij= −bi−?
and [Γ(q)]i= bixa, if q ∈ Q \ {ONi}.
We assume that whenever a discrete transition occurs, say
from mode q to mode q?, the temperature resets according
to the dynamics of mode q. This is modeled by defining
the reset kernel R(·|(q,x),σ,q?) = T(·|(q,x)), q,q?∈ Q,
σ ∈ Σ, x ∈ Rr.
The transition control input affects the discrete state
evolution through the discrete transition kernel Tq. In this
case study, discrete transitions are not influenced by the
continuous state component, so that the discrete state evolves
according to a (finite state and finite input) controlled Markov
chain with controlled transition probabilities Tq: Q × Q ×
U → [0,1], where Tq(q?|q,u) represents the probability that
mode q?is the successor of mode q when the transition
control input u is applied. For ease of notation we set
Tq(q?|q,u) = αqq?(u), q,q?∈ Q.
B. Control
The objective of the case study is to maintain the temper-
ature of the r rooms within a certain range over a finite time
horizon by heating one room at a time and switching the
heating action between the different rooms. To this purpose,
we devise a Markov policy that decides at each time instant
which room should be heated based on the current value of
the temperature in the r rooms. This control design problem
k?=i,k∈Qaik, if j = i; as for the vector
Γ(q), its ithcomponent is [Γ(q)]i= bixa+ ci, if q = ONi,
47th IEEE CDC, Cancun, Mexico, Dec. 9-11, 2008 ThTA08.1
4021
Page 5
can be reformulated as a safety problem for the model
introduced above. The ‘safe’ set A is represented by the
desired temperature range for each room.
The value iteration in Theorem 1 can be used to compute a
maximally safe policy µ∗= (µ∗
U, k = 0,1,...,N−1, and the maximal safety level function
V∗
of remaining within the safe set A over the time horizon
[0,N], starting from the initial condition s ∈ A.
C. Numerical results
We present the results for the case of r = 1 and r =
2 rooms. The temperature is measured in degrees Celsius
and one discrete time unit corresponds to ∆ = 2 minutes.
The discrete time horizon is [0,N] with N = 300, which
corresponds to 10 hours.
In the single room case, the discrete state space is Q =
{ON,OFF} and the continuous state space is R. For r = 2,
Q = {ON1,ON2,OFF} and the continuous space is R2. The
desired temperature interval is [17.5, 22] in both cases. Thus,
the safe set A is given by A = Q×Axwith Ax:= [17.5, 22]
if r = 1, while Ax:= [17.5, 22] × [17.5, 22] if r = 2.
The parameters in equation (7) for the case r = 2 are set
equal to: xa= 6, b1= b2= 0.1/30, a12= a21= 0.25/30,
c1= 12/30, c2= 14/30, and ν2= 1/30. In the single room
case (r = 1), only xa,b1,c1, and ν2should be considered.
The transition control input takes values in U
{SW1,SW2,SWOFF} for r = 2, and U = {SWON,SWOFF} for
r = 1. We suppose that when a command to transition
from one mode to another is issued, then the prescribed
switch actually occurs with a probability 0.8, whereas the
complement probability is evenly shared between the case
where the situation remains unchanged (which models a
delay) or (in the r = 2 configuration) the case where
a transition to the third, non-recommended node, occurs
(which models a faulty behavior). Instead, when a command
of remaining in the current mode of operation is issued,
this happens with probability 1. These specifications can be
formalized by appropriately defining the transition probabil-
ities {αqq?(u), q,q?∈ Q}, for any u ∈ U. In the r = 1
case, for u = SWON, αONON(SWON) = 1, αONOFF(SWON) = 0,
αOFFON(SWON) = 0.8, and αOFFOFF(SWON) = 0.2. Instead,
in the r = 2 case, for u = SW1, αON1ON1(SW1) = 1,
αON2ON1(SW1) = 0.8, αON2ON2(SW1) = 0.1, αOFFON1(SW1) =
0.8, and αOFFOFF(SW1)=
αqq?(SW1) being determined by the normalization condition
?
gridding of the continuous domains. More precisely, in the
r = 1 case the safe interval Ax= [17.5,22] of temperatures
is partitioned in 100 subintervals. In the r = 2 case, we
have considered three levels of uniform discretization, made
up of respectively 18, 36, and 72 bins. Hence, by “level of
discretization k” we mean that each side of the continuous set
Ax= [17.5,22]×[17.5,22] is partitioned into k subintervals–
thus inducing a partition of Axinto k2cells.
With regards to the AVI approximation described in sec-
tion IV, in both studies we have employed a set ¯S of
300 representative points per mode for the training of the
0,µ∗
1,...,µ∗
N−1), µ∗
k: S →
0(s), s ∈ A, representing the maximal probability pµ∗
s(A)
=
0.1, the other probabilities
q?∈Qαqq?(SW1) = 1, q ∈ Q.
For the DP approximation, we have adopted a uniform
approximating neural network. As we expect the safety level
to be relatively flat in the central region of the temperature
range, while dropping at its boundaries, these points have
been sampled according to a probability distribution that
favors extractions close to the boundary of the safe set. The
integral in equation (4) has been solved numerically using
uniform gridding as in the DP approximation. Generalized
regression neural networks, that is linearly-parameterized
radial basis networks, have been chosen for approximating
the optimal value functions. The incremental gradient descent
method with an adaptive step has been adopted to solve the
least squares minimization problem for training the neural
network on the estimated training data, [4].
Figure 1 compares the maximally safe policy estimates
obtained from the AVI (left plot) and DP (right plot) approx-
imation schemes. Function ˆ µ∗
coding SWONwith 1 and SWOFFwith 0. Figure 2 shows some
executions obtained by selecting the initial condition through
uniform sampling over the safe set, and by applying the
maximally safe policy derived with, respectively, the AVI
(left plot) and the DP (right plot) approximation schemes.
kat time step k = 50 is plotted,
17.51818.519 19.5
Temperature [o C]
2020.52121.522
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Maximally Safe Action, {0,1}
17.51818.51919.5
Temperature [o C]
2020.52121.5 22
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Maximally Safe Action, {0,1}
Fig. 1. Single room case: maximally safe policy at time k = 50 determined
through the AVI (left) and the DP (right) approximation scheme. The blue
line represents ˆ µ∗
k(OFF,·), and the red line represents 1 − ˆ µ∗
k(ON,·).
050100 150
Time
200250300
14
16
18
20
22
24
26
Temperature [o C]
050100150
Time
200250300
14
16
18
20
22
24
26
Temperature [o C]
Fig. 2.
determined through the AVI (left) and the DP (right) approximation scheme.
Single room case: executions under the maximally safe policy ˆ µ∗
As for the case r = 2, Figure 3 plots the maximal safety
levelˆV∗
as a function of x for the initial mode q = OFF, when
the discretization level is 18 and 36. Figure 4 represents the
corresponding estimates of the maximally safe policy at time
k = 0 for the discretization level 18.
effort involved in the AVI approximation is compared in
Table I with that of the DP approximation based on uniform
gridding, for different discretization levels. The average time
required for reachability computations is reported together
with the standard deviation. Computations were performed
0(q,x) obtained via the AVI and DP approximations
The computational
47th IEEE CDC, Cancun, Mexico, Dec. 9-11, 2008ThTA08.1
4022
Page 6
17
18
19
20
21
22
17
18
19
20
21
22
0.7
0.75
0.8
0.85
0.9
0.95
1
Temperature [o C]
Temperature [o C]
Safety Probability, [0,1]
17
18
19
20
21
22
17
18
19
20
21
22
0
0.2
0.4
0.6
0.8
1
Temperature [o C]
Temperature [o C]
Safety Probability, [0,1]
17
18
19
20
21
22
17
18
19
20
21
22
0.7
0.75
0.8
0.85
0.9
0.95
1
Temperature [o C]
Temperature [o C]
Safety Probability, [0,1]
17
18
19
20
21
22
16
18
20
22
0
0.2
0.4
0.6
0.8
1
Temperature [o C]
Temperature [o C]
Safety Probability, [0,1]
Fig. 3. Two rooms case: maximal safety level function determined through
the AVI (left) and the DP (right) approximation scheme, initial mode OFF,
discretization level 18 (top row) and 36 (bottom row).
17.5
18
18.5
19
19.5
20
20.5
21
21.5
22
17.51818.51919.5
Temperature [o C]
2020.52121.522
Temperature [o C]
17.5
18
18.5
19
19.5
20
20.5
21
21.5
22
17.51818.51919.5
Temperature [o C]
2020.52121.522
Temperature [o C]
Fig. 4.
by the AVI (left) and DP (right) approximation scheme, initial mode OFF,
discretization level 18. Red = SWOFF, dark blue = SW1, light blue = SW2.
Two rooms case: maximally safe policy at time k = 0 determined
TABLE I
COMPUTATIONAL PERFORMANCE OF AVI AND DP APPROXIMATIONS
AVI [sec]
µ = 11.47
σ = 0.04
µ = 21.91
σ = 0.46
µ = 26.67
σ = 0.54
µ = 45.07
σ = 2.67
DP [sec]
µ = 6.13
σ = 0.37
µ = 212.05
σ = 3.03
µ = 892.2
σ = 8.01
µ = 6265.3
σ = 16.1
one room
partition in 100 intervals
two rooms
discretization level 18
two rooms
discretization level 36
two rooms
discretization level 72
using MATLAB on an Intel Xeon CPU with 2GHz, 4
GB. Table I shows that AVI eventually outperforms the DP
numerical approximation. In particular, it is by considering
how the simulation time scales with the discretization level
that the difference between the two approaches emerges.
While the performances are comparable in the case when the
discretization level is low, as soon as a higher accuracy is
required, the DP approximation method considerably slows
down.
VI. CONCLUSIONS
Prompted by the well-known curse of dimensionality issue
arising in the numerical solution to dynamic programming
and inspired by [9], [12], this paper proposes to adopt a neu-
ral approximation to perform probabilistic reachability com-
putations. An approximate value iteration algorithm tailored
to the probabilistic safety problem for a controlled stochastic
hybrid model is proposed. The outcome of a simulation study
on a benchmark example suggests that the approximate value
iteration can outperform the standard numerical approxima-
tion to dynamic programming based on gridding when the
dimension of the continuous state space is high. In turn,
the approximate value iteration approach requires a certain
amount of setup, in terms of the choice of the kernels for
the approximating functions, of the selection of the number
of samples, and of the tuning of the training algorithm.
Moreover, unlike in the numerical dynamic programming
approximation [1], it is quite difficult to analyze the quality
of the approximate value iteration solution, which depends
on the adopted class of approximating functions and on the
error propagation through iterations. In this respect, this work
should be seen as a first step towards the adoption of neural
approximation for probabilistic reachability computations of
stochastic hybrid systems.
REFERENCES
[1] A. Abate, S. Amin, M. Prandini, J. Lygeros, and S. Sastry. Computa-
tional approaches to reachability analysis of stochastic hybrid systems.
In A. Bemporad, A. Bicchi, and G. Buttazzo, editors, Hybrid Systems:
Computation and Control, Lecture Notes in Computer Science 4416,
pages 4–17. Springer Verlag, 2007.
[2] A. Abate, M. Prandini, J. Lygeros, and S. Sastry.
reachability and safety for controlled discrete time stochastic hybrid
systems. Automatica, Nov 2008. In press.
[3] D. P. Bertsekas and S. E. Shreve. Stochastic optimal control: the
discrete-time case. Athena Scientific, 1996.
[4] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming.
Athena Scientific, 1996.
[5] H.A.P. Blom and J. Lygeros, editors.
Theory and Safety Critical Applications. LNCIS, vol. 337, Springer
Verlag, 2006.
[6] M. L. Bujorianu. Extended stochastic hybrid systems and their reach-
ability problem. In R. Alur and G. Pappas, editors, Hybrid Systems:
Computation and Control, Lecture Notes in Computer Science 2993,
pages 234–249. Springer Verlag, 2004.
[7] C.G. Cassandras and J. Lygeros, editors. Stochastic hybrid systems.
Automation and Control Engineering Series 24. Taylor & Francis
Group/CRC Press, 2006.
[8] D.P. de Farias and B. Van Roy. On the existence of fixed points for
approximate value iteration and temporal-difference learning. Journal
of Optimization Theory and Applications, 105, 2000.
[9] B. Djeridane and J. Lygeros. Neural approximation of PDE solutions:
An application to reachability computations. In IEEE Conference on
Decision and Control, pages 3034–3039, San Diego, USA, December
2006.
[10] A. Fehnker and F. Ivanˇ ci´ c.Benchmarks for hybrid systems ver-
ifications.In R. Alur and G.J. Pappas, editors, Hybrid Systems:
Computation and Control, Lecture Notes in Computer Science 2993,
pages 326–341. Springer Verlag, 2004.
[11] X. Koutsoukos and D. Riley. Computational methods for reachability
analysis of stochastic hybrid systems. In J. Hespanha and A. Tiwari,
editors, Hybrid Systems: Computation and Control, Lecture Notes in
Computer Science 3927, pages 377–391. Springer Verlag, 2006.
[12] K. N. Niarchos and J. Lygeros. A Neural Approximation to Continuous
Time Reachability Computations. In IEEE Conference on Decision
and Control, San Diego, USA, December 2006.
[13] S. Prajna, A. Jadbabaie, and G.J. Pappas. A framework for worst-
case and stochastic safety verification using barrier certificates. IEEE
Transactions on Automatic Control, 52(8):1415–1428, Aug 2007.
[14] M. Prandini and J. Hu. Stochastic reachability: Theory and numerical
approximation. In C.G. Cassandras and J. Lygeros, editors, Stochastic
hybrid systems, Automation and Control Engineering Series 24, pages
107–138. Taylor & Francis Group/CRC Press, 2006.
Probabilistic
Stochastic Hybrid Systems:
47th IEEE CDC, Cancun, Mexico, Dec. 9-11, 2008ThTA08.1
4023
View other sources
Hide other sources
-
Available from Ashe ale Abate · 16 Apr 2013
-
Available from tudelft.nl