Page 1
arXiv:1108.3299v1 [cs.SY] 16 Aug 2011
Submitted to Operations Research
manuscript (Please, provide the mansucript number!)
Bounding Procedures for Stochastic Dynamic
Programs with Application to the Perimeter Patrol
Problem
Myoungkuk Park
Department of Mechanical Engineering, Texas A&M University, College Station, TX 77843, robotian@gmail.com
Krishnamoorthy Kalyanam
Infoscitex Corporation, Dayton, OH 45431, krishna.kalyanam@gmail.com
Swaroop Darbha
Department of Mechanical Engineering, Texas A&M University, College Station, TX 77843, dswaroop@tamu.edu
Phil Chandler
Control Design & Analysis Branch, Air Force Research Laboratory, WPAFB, OH 45433, phillip.chandler@wpafb.af.mil
Meir Pachter
Electrical Engineering Department, Air Force Institute of Technology, WPAFB, OH 45433, meir.pachter@afit.edu
One often encounters the curse of dimensionality in the application of dynamic programming to determine
optimal policies for controlled Markov chains. In this paper, we provide a method to construct sub-optimal
policies along with a bound for the deviation of such a policy from the optimum via a linear programming
approach. The state-space is partitioned and the optimal cost-to-go or value function is approximated by a
constant over each partition. By minimizing a non-negative cost function defined on the partitions, one can
construct an approximate value function which also happens to be an upper bound for the optimal value
function of the original Markov Decision Process (MDP). As a key result, we show that this approximate
value function is independent of the non-negative cost function (or state dependent weights as it is referred
to in the literature) and moreover, this is the least upper bound that one can obtain once the partitions
are specified. Furthermore, we show that the restricted system of linear inequalities also embeds a family
of MDPs of lower dimension, one of which can be used to construct a lower bound on the optimal value
function. The construction of the lower bound requires the solution to a combinatorial problem. We apply
the linear programming approach to a perimeter surveillance stochastic optimal control problem and obtain
numerical results that corroborate the efficacy of the proposed methodology.
Key words: Stochastic Dynamic Programs, Linear Programming, State Aggregation
1
Page 2
Author: Park et al.
2
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
1. Introduction
The Linear Programming (LP) approach to solving dynamic programs (DPs) originated from the
papers: Manne (1960), d’Epenoux (1963), Denardo (1970), Hordijk and Kallenberg (1979). The
basic feature of an LP approach for solving DPs corresponding to maximization of a discounted
payoff is that the optimal solution of the DP (also referred to as the optimal value function) is
the optimal solution of the LP for every non-negative cost function. The constraint set describing
the feasible solution of the LP and the number of independent variables are typically very large
(curse of dimensionality) and hence, obtaining the exact solution of a DP (stochastic or otherwise)
via an LP approach is not practical. Despite this limitation, an LP approach provides a tractable
method for approximate dynamic programming (Mendelssohn 1980, Schweitzer and Seidmann
1985, Trick and Zin 1997) and the advantages of this approach may be summarized as follows:
1. One can restrict the value function to be of a certain parameterized form, thereby reducing
the dimension of the LP to the size of the parameter set to make it tractable.
2. The solution to the LP provides upper bounds for the value function (lower bounds, if minimiz-
ing a discounted cost, as opposed to maximizing discounted payoff, is considered as the optimization
criteria).
The main questions regarding the tractability and quality of approximate DP revolve around
restricting the value function in a suitable way. The questions are: (1) How does one restrict the
value function, i.e., what basis functions should one choose for parameterizing the value function?
(2) Are there any (a posteriori) bounds that one can provide about the value function from the
solution of a restricted LP? If the restrictions imposed on the value function are consistent with the
physics/structureof the problem, one can expect reasonably tight bounds. There is another question
that naturally arises: In the unrestricted case, the optimal solution of the LP is independent of
the choice of the non-negative cost function. While it is unreasonable to expect that the optimal
value function be a feasible solution of the restricted LP, one can ask if the optimal solution of
the restricted LP is the same for every choice of non-negative cost function for the LP. It has been
reported in the literature that this is unfortunately not the case (De Farias and Van Roy 2003).
Page 3
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
3
If the LP is not properly restricted, it can lead to poor approximation and perhaps, even infea-
sibility (Gordon 1999). A common approach is to approximate the value (cost-to-go) function
by a linear functional of a priori chosen basis functions (Schweitzer and Seidmann 1985). This
approach is attractive in that for a certain class of basis functions, feasibility of the approximate (or
restricted) LP is guaranteed (De Farias and Van Roy 2003). A straightforward method for select-
ing the basis functions is through a state aggregation method. Here the state space is partitioned
into disjoint sets or partitions and the approximate value function is restricted to be the same for
all the states in a partition. The number of variables for the LP therefore reduces to the number of
partitions. State aggregation based approximation techniques were originally proposed by Axs¨ ater
(1983), Bean et al. (1987), Mendelssohn (1982). Since then, substantial work has been reported in
the literature on this topic (see Van Roy (2006) and the reference therein). In this article, we adopt
the state aggregation method.
Although imposing restrictions on the value function reduces the size of the restricted LP, the
number of constraints does not change. Since the number of constraints is at least of the same
order as the number of states of the DP, one is faced with a restricted LP with a large number
of constraints. An LP with a large number of constraints may be solved if there is an automatic
way to separate a non-optimal solution from an optimal one (Gr¨ otschel et al. 1981); otherwise,
one may have to resort to heuristics or settle for an approximate solution. Separation of a non-
optimal solution from an optimal one is easier if one has a compact representation of constraints
(Morrison and Kumar 1999) or if a subset of the constraints that dominate other constraints can
easily be identified from the structure of the problem (Krishnamoorthy et al. 2011b). Heuristic
methods include aggregation of constraints, sub-sampling of constraints (De Farias and Van Roy
2003), constraint generation methods(Gr¨ otschel and Holland 1991, Schuurmans and Patrascu
2001) and other approaches (Trick and Zin 1993).
If the solution of the restricted LP is the same for every non-negative cost function of the LP,
then it suggests that the constraint set for the restricted LP embeds the constraint set for the
exact LP corresponding to a reduced order Markov Decision Process (MDP). If one adopts a naive
Page 4
Author: Park et al.
4
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
approach and “aggregates” every state into a separate partition, we obtain the original exact LP
and clearly, for this LP, the solution is independent of the non-negative cost function. It would
seem reasonable to expect that this would generalize to partitions of arbitrary size and in fact, we
prove this to be the case in this article. One can construct a sub-optimal policy from the solution
to the restricted LP by considering the policy that is greedy with respect to the approximate value
function (Porteus 1975). By construction, the expected discounted payoff for the sub-optimal policy
will be a lower bound to the optimal value function and hence, can be used to quantify the quality
of the sub-optimal policy. Also the lower bound will be closer to the optimal value function than
the approximate value function by virtue of the monotonicity property of the Bellman operator.
But the lower bound computation is not efficient since the procedure involved is tantamount to
policy evaluation which involves the solution to a system of linear equations of the same size as the
state-space. In this work, we have developed a novel disjunctive LP, whose solution can be used
to construct a lower bound to the optimal value function. The contributions of our work may be
summarized as follows:
• If one were to adopt a state aggregation approach, then the solution to the restricted LP
is shown to be independent of the non-negative cost function. Moreover, the optimal solution is
dominated by every feasible solution to the restricted LP.
• We also show that considering alternate LP formulations via lifting of variables or by consider-
ing a bigger feasible set via iterated Bellman inequalities (Wang and Boyd 2010) does not improve
upon the upper bound provided by the restricted LP.
• A subset of the constraints of the restricted LP can be used for constructing a lower bound
for the optimal value function. However, this involves solving a disjunctive LP, which may not be
computationally tractable.
• We demonstrate the use of aggregation based restricted LPs for a perimeter surveillance
stochastic control problem. For the application considered here, we show that both the lower
bounding disjunctive LP and the upper bounding restricted LP can be solved efficiently since they
both reduce to exact LPs corresponding to some lower dimensional MDPs.
Page 5
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
5
The rest of the paper is organized as follows: we provide a general overview of stochastic dynamic
programs in section 2 followed by LP preliminaries in section 2.1. In section 3, we introduce the
aggregation method and discuss the restricted LP approach that can be used to approximate the
optimal value function. In the same section, we also present a novel disjunctive LP that can be
used to compute a lower bound to the optimal value function. We introduce the perimeter alert
patrol problem in section 4 and also elaborate on the efficient LP formulations that arise out of
the structure in the problem. We corroborate the structure in the perimeter patrol problem via
numerical results in section 5. Finally, we support the proposed approximation methodology via
simulation results in section 5.1, followed by summary in section 6. Supplementary material and
lengthy proofs, that have been left out of the main body of the paper, for clarity, have been included
in the Appendix.
2. Stochastic Dynamic Programming
Consider a discrete-time Markov decision process(MDP) with a finite state space S ={1,2,...,|S|}.
For each state x ∈ S, there is a finite set of available actions Ux. From current state x, taking
action u ∈ Uxunder the random influence Y results in a reward Ru(x). The system follows some
discrete-time dynamics given by:
x(t+1)=f(x(t),u(t),Y(t)),
(1)
where t indicates time. We assume that the random input Y can only take a finite set of values Yl;l =
0,...,m and there is a probability associated with each choice pl. State transition probabilities
Pu(x,y) represent, for each pair (x,y) of states and each action u ∈ Ux, the probability that the
next state will be y given that the current state is x and the current action taken is u i.e.,
Pu(x,y)=
?0, if y ?=f(x,u,Yl) for any l ∈{0,...,m},
?
j∈Cpj, where C ={l|y =f(x,u,Yl)}.
(2)
Any stationary policy, π, specifies for each state x∈S, a control action u=π(x). We abuse notation
and also write the transition probability matrix associated with policy π to be Pπ, where Pπ(x,y)=
Pπ(x)(x,y). Similarly, we express the column vector of immediate payoffs associated with the policy
Page 6
Author: Park et al.
6
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
π to be Rπ, where Rπ(x) = Rπ(x)(x). We are interested in solving a stochastic control problem,
which amounts to selecting a policy that maximizes the infinite-horizon discounted reward of the
form,
Vπ(x0) =E
?
∞
?
t=0
λtRπ(x(t))
?????x(0)=x0
?
,
where λ ∈ [0,1) is a temporal discount factor. We obtain the optimal policy by solving Bellman’s
equation,
V∗(x)=max
u∈Ux
?
Ru(x)+λ
m
?
l=0
plV∗(f(x,u,Yl))
?
,∀x∈S,
(3)
where, V∗(x) is the optimal value function (or optimal discounted payoff) starting from state x.
The optimal policy then is given by,
π∗(x)=argmax
u∈Ux
?
Ru(x)+λ
m
?
l=0
plV∗(f(x,u,Yl))
?
,∀x∈S.
(4)
The Bellman equation (3) can be solved using standard DP methodssuch as value iteration (Howard
1960) or policy iteration (Bellman 1957); however, it is computationally not tractable, if the size
of state space considered is unmanageably large. For this reason, one is interested in tractable
approximate methods that yield suboptimal solutions with some guarantees on the deviation of
the associated approximate value function from the optimal one.
2.1. Linear Programming Approach
In this subsection, we briefly touch upon two lemmas that we will use in the subsequent sections.
Bellman’s equation suggests that the optimal value function satisfies the following set of linear
inequalities, which we will refer to as the Bellman inequalities:
V (x) ≥ Ru(x)+λ
m
?
l=0
plV (f(x,u,Yl)), ∀u∈Ux, ∀x∈S.
⇔V ≥ Ru+λPuV, ∀u.
(5)
Consider any integer L ≥ 1 and for j = 1,2,...,L, let Vjbe a vector satisfying a generalization of
the Bellman inequalities, referred to as the iterated Bellman inequalities (Wang and Boyd 2010):
Vj+1(x) ≥ Ru(x)+λ
m
?
l=0
plVj(f(x,u,Yl)), ∀x,u,
∀j =1,2,...,L−1,
(6)
Page 7
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
7
V1(x) ≥ Ru(x)+λ
m
?
l=0
plVL(f(x,u,Yl)), ∀x,u.
(7)
Clearly, when L = 1, the above system of inequalities collapses to the Bellman inequalities. The
iterated Bellman inequalities may be compactly represented as:
Vj+1≥ Ru+λPuVj, ∀ u, j =1,2,...,L−1,
V1≥ Ru+λPuVL, ∀u.
(8)
We note that the above set of inequalities have cyclic symmetry, i.e., one gets the same set of
inequalities by replacing the vectors V1,V2,...,VLby V2,V3,...,VL,V1respectively. Let π be any
stationary policy. Then we have,
Vj+1≥ Rπ+λPπVj, j =1,2,...,L−1,
(9)
V1≥ Rπ+λPπVL.
(10)
By recursively applying (9) to VL,VL−1,... etc., in (10), we get,
[I −λLPL
π]V1≥ [I +λPπ+···+λL−1PL−1
π
]Rπ, ∀ π.
By cyclic symmetry, every Vj, j =2,3,...,L, also satisfies the above inequality.
Lemma 1. Let the vector V satisfy the following set of inequalities:
?I −λLPL
π
?V ≥?I +λPπ+···+λL−1PL−1
π
?Rπ,
∀π.
(11)
Then, we have V ≥V∗.
Remark 1. We readily see that every feasible solution of the system of inequalities (5) or (8)
is lower bounded by the optimal value function V∗. By cyclic symmetry, we conclude that every
feasible Vj, j =1,...,L is also lower bounded by V∗.
The following result relates the optimal value function to the optimal solution of an LP with a
non-negative cost function and constraints of the form given by the Bellman inequalities (5) or
iterated Bellman inequalities (8).
Page 8
Author: Park et al.
8
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
Lemma 2. Let c be a vector of state-dependent weights with c(x) ≥ 0 for every x ∈ S. Then V∗
minimizes the linear functional cTV among all V ’s satisfying the Bellman inequalities (5). Corre-
spondingly, the L-tuple (V∗,··· ,V∗) minimizes the linear functional?L
(V1,...,VL) satisfying the iterated Bellman inequalities (8).
j=1cTVjamong all L-tuples
Proof of Lemma 2.The proof follows from the fact that V ≥ V∗and hence, cT(V − V∗) ≥ 0.
Since V∗is feasible for the inequalities (8) for any L ≥ 1, the result follows. Similarly, since the
L-tuple (V∗,··· ,V∗) is feasible for (8) and since Vj≥ V∗for j =1,2,...,L, it readily follows that
the L-tuple is optimal.
?
3. Bounds using Partitioning
Let the set of all states S be partitioned into M disjoint sets, Si,i = 1,...,M. We will call the
set Sithe ithpartition. Henceforth, we will use the following notation: if f(x,u,Ys) represents the
state the system transitions to starting from x and subject to a control input u and a stochastic
disturbance Ys, then¯f(x,u,Ys) represents the partition to which the final state belongs. For a given
u and partition index i, we define the tuple zi,u
x =(¯f(x,u,Y0),¯f(x,u,Y1),...,¯f(x,u,Ym)) for every
x∈Si. We denote by T (i,u) the set of all distinct zi,u
x
for a given partition index i and control u.
3.1. Restricted Linear Program
We have, from Lemma 2, that the optimal solution to the following LP,
ELP := mincTV,
subject to (12)
V ≥ Ru+λPuV,
∀u,
referred to as the “exact LP” in the literature, is the optimal value function V∗. Let us start
with restricting the exact LP by requiring further that V (x) = v(i) for all x ∈ Si, i = 1,...,M.
Augmenting these constraints to the exact LP, one gets the following restricted LP.
RLP := min
M
?
i=1
?
x∈Si
c(x)v(i) subject to(13)
Page 9
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
9
v(i) ≥ Ru(x)+λ
m
?
l=0
plv(¯f(x,u,Yl)),
∀x∈Si, i=1,...,M, ∀u.
The restricted LP can also be written in the following compact form:
RLP = mincTΦv
subject to (14)
Φv ≥ Ru+λPuΦv,
∀u,
where the columns of Φ (commonly referred to as “basis functions” in the literature) are given by,
Φ(x,i)=
?1, if x∈ Si,
0, otherwise.
,i=1,...,M.
(15)
The restricted LP typically deals with a much smaller number of variables i.e., M << |S|. An
approximate value function can be constructed from every feasible solution to RLP according to
Vup= Φv ⇒ Vup(x) = v(i), ∀x ∈ Si,i = 1,...,M. Since the approximate value function satisfies, by
construction, the Bellman inequalities (5), it is automatically an upper bound to V∗by Lemma 1.
So, if v∗is the optimal solution to RLP (13), then clearly, Φv∗≥V∗. Now we are ready to address
one of the main results of the paper.
Theorem 1. The optimal solution, v∗, to the RLP is independent of the cost vector c once the
partitions are specified.
Proof of Theorem 1. The main idea behind the proof is the following: The constraints in the
restricted LP (13) do not, in general, correspond to those of a Markov Decision Process (MDP)
because the transition from one partition to another for a given control u and random input Ylis
not specified unambiguously. This is because different states in the same partition can transition
to different partitions for the same u and Yl. If one were to think of a “random” selector for a state
in a partition, then the specification of u, Yltogether with the random selector specifies exactly
which partition the system would transition to next, from the current partition. Let us specify the
probability of picking a state in a partition, corresponding to the random selector, via the optimal
dual variables for RLP. For a given partition index i, the RLP specifies a constraint on v(i) for
Page 10
Author: Park et al.
10
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
each x ∈ Si and u. Let the dual variable corresponding to this constraint be µi
u(x) ≥ 0 and the
corresponding optimal dual variable be ¯ µi
u(x). With this definition, we can proceed to prove the
result via the following steps:
1. We show that for every partition index i, there is a u such that ¯ µi
u(x) > 0 for some x ∈ Si.
This is necessary for constructing a MDP of reduced dimension in the next step; otherwise, the
corresponding value of v(i) is not lower bounded.
2. We define a reduced order MDP on the partitions with immediate reward and transition
probability given by,
ru(i)=
?
x∈Si
hi
u(x)Ru(x) and˜Pu(i,j):=
??
x∈Sihi
u(x)?
y∈SjPu(x,y), if u∈ Ui,
0,
otherwise,
where u∈ Uiif?
of picking the state x from the partition Si.
x∈Si¯ µi
u(x)> 0. We may interpret the term hi
u(x) =
¯ µi
x∈Si¯ µiu(x)as the probability
u(x)
?
3. We show that the so-called “surrogate LP” obtained by aggregating the constraints of RLP
via the optimal dual variables,
SLP(¯ µ) := min
M
?
i=1
?
? ?? ?
?
x∈Si
c(x)
¯ c(i)
v(i),
subject to(16)
v(i) ≥ ru(i)+λ
M
j=1
˜Pu(i,j)v(j), ∀u∈Ui,i =1,...,M,
is the exact LP corresponding to the reduced order MDP defined in step 2 above. In essence, for
a given c, the optimal value function of the reduced order MDP is the optimal solution of RLP.
We use the properties of surrogate duality (Greenberg and Pierskalla 1970, Glover 1975, 1968) to
demonstrate that SLP(¯ µ) =RLP.
4. Finally, to show that the optimal solution to RLP is independent of c, we note that the
constraints of SLP(¯ µ) are obtained by taking convex combinations of the constraints in RLP.
Hence, any feasible solution to RLP is also feasible for SLP(¯ µ). Since every feasible solution of the
exact LP corresponding to an MDP dominates the optimal solution (from Lemma 1), we conclude
Page 11
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
11
that the optimal solutions corresponding to two different cost functions c1 and c2 necessarily
dominate each other and hence, have to be the same.
?
We shall now establish the surrogate LP result via the following lemma with the proof provided
in the Appendix.
Lemma 3. Consider a surrogate LP for the RLP through a set of dual variables, µ given by:
SLP(µ) := min¯ cTv,
subject to
?
(17)
?
x∈Si
µi
u(x)v(i) ≥
?
x∈Si
µi
u(x)
Ru(x)+λ
m
?
l=0
plv(¯f(x,u,Yl))
?
, ∀u, i=1,...,M.
Then, ∃¯ µ≥0 such that, SLP(¯ µ)=RLP, and, for every partition index i =1,...,M, ∃u such that
?
any other feasible solution v to RLP dominates v∗.
x∈Si¯ µi
u(x)>0. Moreover, the optimal solution v∗to RLP is independent of the cost vector ¯ c and
Theorem 1 implies that the upper bound for the optimal value function cannot be improved
by changing the cost function from a linear to a non-linear function or by restricting the feasible
set of RLP further since the optimal solution of RLP is dominated by every feasible solution
of RLP. Also Φv∗is the least upper bound to the optimal value function V∗since any other
feasible v to RLP satisfies Φv ≥ Φv∗. Hence, a refinement of the upper bound must necessarily
involve an enlargement of the feasible set if one wants to stick to an LP formulation, i.e., it should
include the feasible set of (13) and possibly other tighter upper bounds than the optimal solution
of RLP. Lifting of variables is one way to improve the bound; in this connection, we show in
the following section that neither a general lifted LP nor one obtained by including the iterated
Bellman inequalities in the constraint set improves the upper bound.
Remark 2. If one considers the sub-optimal dual variables, µi
u(x)=
1
|Si|,∀x∈ Si,∀u, then solving
the corresponding surrogate dual, SLP(µ), to obtain an approximate value function, would result
in the so-called “hard aggregation” method (see Sec. 4 of Bertsekas (2007)).
Remark 3. When µ and Φ are allowed to have arbitrary positive entries satisfying?|S|
1,∀i ∈ {1,...,M} and?M
x=1µi
u(x)=
j=1Φ(y,j) = 1,∀y ∈ S, the method is referred to as “soft aggregation”
Page 12
Author: Park et al.
12
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
(Singh et al. 1995). Unfortunately, in this case, the optimal solution to the restricted LP formula-
tion (14) has been shown to be dependent on the cost function (De Farias and Van Roy 2003).
3.2. Lifted Restricted Linear Programs
It may appear that we can get tighter upper bounds than those provided by the RLP by considering
either lifted LPs whose feasible set is larger than that of RLP or LPs with a different objective
function. We will show, in this section, that unfortunately this is not the case. In general, one can
construct a lifted LP of the form:
LLP := min¯ cTv +dTz,
subject to
V (x) ≥ Ru(x)+λ
m
?
l=0
pl(V (f(x,u,Yl))), ∀x,u,
(18)
V (x) = v(i), ∀x∈ Si, i =1,...,M,
(19)
z ≥ 0,
where z is the additional vector of variables used in lifting so that the feasible set is not empty. Then,
it follows that if (˜ v, ˜ z) is optimal to LLP, then ˜ v will be a feasible solution to RLP. Consequently,
Φ˜ v ≥ Φv∗, where v∗is the optimal solution of the RLP. In other words, one gets no better bound
via lifting if the constraints (18) and (19) are included. One could also use the iterated Bellman
inequalities (8) for constructing a lifted LP of the form:
IB := min
L
?
j=1
¯ cTvj,
subject to
vj+1(i) ≥ Ru(x)+λ
m
?
m
?
l=0
plvj(¯f(x,u,Yl)),
∀x∈Si, ∀i,u, j =1,...,L−1,
(20)
v1(i) ≥ Ru(x)+λ
l=0
plvL(¯f(x,u,Yl)),
∀x∈Si, ∀i,u.
(21)
Again, it turns out that the above lifted LP is incapable of providing a better bound, as can be
seen from the following result.
Theorem 2. If vIB=(v1,··· ,vL) is a feasible solution to IB, then vj≥ v∗for j =1,...,L, where
v∗is the optimal solution to RLP.
Page 13
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
13
The proof for Theorem 2 follows along the lines of Lemma 3. We will construct a surrogate LP
for the lifted LP (20) with the optimal dual variables of RLP. We immediately recognize that the
inequalities defining the surrogate LP are, in fact, the iterated Bellman inequalities associated with
the reduced order MDP defined in step 2 of the proof of Theorem 1. So, the result follows from
Lemma 2 and Remark 1.
Proof of Theorem 2.Let ¯ µ be the optimal dual variables to RLP (13). From Lemma 3, for
every partition index i ∈ {1,...,M}, there exists a u such that?
u, we multiply the inequalities (20, 21) associated with a particular x ∈ Siwith ¯ µi
x∈Si¯ µi
u(x) >0. For a fixed i and
u(x) and sum
over all the x∈Si. Then, we get the following surrogate LP:
SIB := min
L
?
j=1
¯ cTvj,
subject to
vj+1(i) ≥ ru(i)+λ
?
?
x∈Si
hi
u(x)
m
?
m
?
l=0
plvj(¯f(x,u,Yl)), ∀u∈Ui,∀i, j =1,...,L−1,
(22)
v1(i) ≥ ru(i)+λ
x∈Si
hi
u(x)
l=0
plvL(¯f(x,u,Yl)), ∀u∈ Ui,∀i,
(23)
where, u∈ Uiif?
x∈Si¯ µi
u(x)>0. As before, the one-step reward function,
ru(i)=
?
x∈Si¯ µi
?
u(x)Ru(x)
x∈Si¯ µi
u(x)
, where, hi
u(x)=
¯ µi
u(x)
?
x∈Si¯ µi
u(x),
∀u∈Ui.
By Lemma 2, the optimal solution to SIB is of the form v∗
SIB=(v∗,··· ,v∗), where v∗is the optimal
solution to SLP(¯ µ) (and by Lemma 3, also the optimal solution to RLP). Since any feasible
solution to IB, vIB= {v1,...,vL} is also feasible to SIB, it follows, from Lemma 1, that vj≥ v∗
for every j =1,...,L.
?
So, we conclude that lifting through the use of iterated Bellman inequalities does not help in
finding a tighter upper bound than the RLP optimal solution. Also using any other non-linear
objective function will not improve the upper bound as long as the iterated Bellman inequalities
(20) and (21) are included in the constraints set. In the next section, we focus our attention on the
construction of a lower bound for the optimal value function.
Page 14
Author: Park et al.
14
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
3.3. Lower Bound for the Optimal Value Function
For any candidate approximate value function˜V , one can construct a sub-optimal “greedy” policy
according to:
˜ π(x)=argmax
u
?
Ru(x)+λ
?
y
Pu(x,y)˜V (y)
?
,
∀x∈S.
Let us define the improvement in value function, ˜ α(x):= R˜ π(x)+λ?
that there is no improvement, i.e., ˜ α ≡ 0, when˜V = V∗. The expected discounted payoff, V˜ π,
yP˜ π(x,y)˜V (y)−˜V (x). Note
corresponding to the suboptimal policy ˜ π, satisfies the following bound (Porteus 1975):
˜V (x)+
1
1−λmin
y
˜ α(y)≤ V˜ π(x)≤V∗(x),
∀x∈S.
In our experience,the lower bound to the optimal value function provided by V˜ πis very conservative.
Also computation of V˜ πinvolves solving a linear system of equations of size |S|, which would be
expensive for a large state-space. So, we construct a novel alternate lower bound as follows. Recall
that for each x∈Si, V∗(x) satisfies the Bellman inequality (5):
V∗(x) ≥ Ru(x)+λ
m
?
m
?
l=0
plV∗(f(x,u,Yl)),
∀u,
≥ Ru(x)+λ
l=0
pl
min
y∈¯ f(x,u,Yl)V∗(y),
∀u.
(24)
Let ¯ w(i) :=minx∈SiV∗(x), i=1,...,M. Then, it follows from (24) that,
¯ w(i) ≥ min
x∈Si
?
Ru(x)+λ
m
?
l=0
pl¯ w(¯f(x,u,Yl))
?
∀u, i=1,...,M.
(25)
The above set of inequalites motivates the following non-linear program:
NLP := min¯ cTw,
subject to
w(i) ≥ min
x∈Si
?
Ru(x)+λ
m
?
l=0
plw(¯f(x,u,Yl))
?
,
∀u, i=1,...,M.
(26)
Let w∗be the optimal solution to NLP. By construction, we see that ¯ w is a feasible solution to
the NLP and hence,
¯ cTw∗≤ ¯ cT¯ w =
M
?
i=1
¯ c(i)min
x∈SiV∗(x).
Page 15
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
15
So, by choosing ¯ c(i) = 1 and ¯ c(j) = 0 for all j ?= i, one can obtain a lower bound to the optimal
value function for all the states in the ithpartition. Moreover, if the problem under consideration
exhibits a special structure, one can show that NLP collapses to an LP that can be efficiently
solved. The perimeter patrol problem considered herein exhibits such a structure; we demonstrate
this in the next section.
Remark 4. The NLP is referred to as a disjunctive linear program (Balas 1979) and the optimal
solution to NLP is the solution that minimizes the same linear objective function over the convex
hull of the feasible solutions of NLP. Balas (1998) provides two methods to solve the problem: one
through a lifted representation for the convex hull of the feasible set of NLP and the other through
a cutting plane technique. Since the number of lifted variables is of O(M2|U|); if M =10,000, then
one must deal with a lifted LP with 100 million variables. The original (non-aggregated LP) has
about 10 million variables and hence, the lifted representation method is not practical. For this
reason, the cutting plane technique is a viable alternate method.
Remark 5. The lower bound provided by NLP is a non-trivial one because the optimal solution
is the optimal value function of a reduced order MDP. Hence, the lower bound will be better
than at least the value function associated with some suboptimal policy and so, is non-trivial and
non-conservative.
Remark 6. While Simay have a lot of states, the number of entries on the right hand side of the
non-linear constraint (26) over which the minimization must be carried out is the cardinality of
T (i,u). NLP is combinatorial in nature, in the sense that one must pick one (m+1) tuple for each
i and u over which the optimization must be carried out. However, for each (m+1) tuple picked,
one obtains an MDP. So, the system of inequalities (26) describes a family of underlying MDPs.
4. Perimeter Patrol Problem
The perimeter patrol problem arose from the Cooperative Operations in Urban Terrain
(COUNTER) project at AFRL (Gross et al. 2006). In this problem, there is a perimeter which must
Page 16
Author: Park et al.
16
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
Figure 1
Perimeter patrol scenario with UAV loitering at alert station.
be monitored by a collection of UAVs (we will consider only one UAV here). Along the perimeter,
there are m alert stations equipped with Unattended Ground Sensors (UGSs) which detect intru-
sions or incursions into the perimeter. For the sake of simplicity, we assume that incursions into the
perimeter can only occur at the stations. An incursion could be a nuisance (false alarm) or a real
threat. The UGS raise an alarm or an alert whenever there is an incursion. The camera equipped
UAV responds to an alert by flying to the alert site and loitering there, while a remotely located
operator steers the gimballed camera looking for the source of the alarm. Here the operator serves
the role of a classifier or a sensor, i.e., the operator must determine, from the video information,
whether the intrusion is a nuisance or a threat. For details on the perimeter alert patrol problem
and the variants thereof, we refer the reader to the authors’ prior work (Chandler et al. 2009,
Darbha et al. 2010, Krishnamoorthy et al. 2011b,a). Figure 1 shows a typical scenario, where there
are 4 alert stations with the UAV at a station (location 0) with an alert. The decision problem
we solve is the following: Given that the arrival process of the alerts is Poisson with known arrival
rate, what is the optimal time a UAV should spend at a station before resuming its patrol? We
associate an information gain with a UAV loitering and servicing an alert and we model this gain
as a monotonically increasing function of the loiter/dwell time d.
Page 17
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
17
4.1. Problem Statement
The patrolled perimeter is a simple closed curve with N(≥ m) nodes which are (spatially) uniformly
separated, of which m correspond to the alert stations. Let the m distinct station locations be
elements of the set Ω⊂ {0,...,N −1}. A typical scenario shown in Figure 1 has 15 nodes, of which,
nodes {0,3,7,11} correspond to the UGS. Here, station locations 3, 7 and 11 have no alerts, and
station location 0 has an alert being serviced by the loitering UAV. At time instant t, let ℓ(t) be
the position of the UAV on the perimeter (ℓ ∈ {0,...,N −1}), d(t) be the dwell time (number of
loiters completed if at an alert site) and τj(t) be the delay in servicing an alert at location j ∈ Ω.
Let yj(t) be a binary, but random, variable indicating the arrival of an alert at location j ∈ Ω.
We will assume that the statistics associated with the random variable yj(t) are known and that
yj;j ∈ Ω are independent. We model the arrival of alerts as follows: There is a single queue with
a Poisson arrival stream of alerts at a rate of α alerts per unit time. After an alert is queued up,
we assume it shows up arbitrarily at any one of the m stations (assuming choice of station is a
uniformly distributed random variable). For this reason, only one alert can arrive at one of the m
stations at any instant of time. Hence, there are m+1 possibilities for the value of the vector of
alerts y(t)=[y1(t)y2(t) ···ym(t)], with the first one being that there is no alert at any station and
the other m correspond to an alert at each of the m stations. The control decisions are indicated
by the variable u. If u=1, then the UAV continues in the same direction as before; if u=−1, then
the UAV reverses its direction of travel and if u =0, the UAV dwells at the current alert station.
We will assume that a UAV advances by one node in unit time if u?= 0. We also assume that the
time to complete one loiter is also the unit time. We denote the UAV’s direction of travel by ω,
where ω =1 and ω =−1 indicate the clockwise and counter-clockwise directions respectively. One
may write the state update equations for the system as follows:
ℓ(t+1) = [ℓ(t)+ω(t)u(t)] mod N,
ω(t+1) = ω(t)u(t)+δ(u(t)),
d(t+1) = (d(t)+1)δ(u(t)),
(27)
Page 18
Author: Park et al.
18
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
τj(t+1) = (τj(t)+1){(1−δ(ℓ(t)−j)δ(u(t))}max{σ(τj(t)),yj(t)},
∀j ∈Ω,
where δ is the Kronecker delta function and σ(·) = 1 −δ(·). We denote the status of the alert at
station location j ∈Ω at time t by Aj(t), i.e.,
Aj(t)=
?0, if τj(t)=0
1, otherwise
,∀j ∈Ω.
(28)
Also, we have the constraints: u(t)=0 only if ℓ(t)∈Ω and d(t)≤ D. If d(t)=D, then u(t)?=0 i.e.,
the UAV is forced to leave the station if it has already completed the maximum (allowed) number
of dwell orbits. Combining the different components in (27), we express the evolution equations
compactly as:
x(t+1)=f(x(t),u(t),y(t)),
where, x(t) is the system state at time t with components ℓ(t), ω(t),d(t) and τj(t), ∀j ∈ Ω. Let us
denote the m+1 possible values that y(t) can take by the row vector Ylwhere,
Y0=?0 0 ... 0?,Y1=?1 0 ... 0?, ...
and
Ym=?0 ... 0 1?.
(29)
Given a Poisson arrival stream of alerts at the rate of α alerts per unit time, the probability that
there is no alert in unit time interval is p=e−αand hence, the probability that y(t) takes any one
of the m+1 possible values in (29) is given by,
pl:=Prob{y(t)=Yl}=
?p,l =0,
(1−p)
m, l =1,...,m.
(30)
To be consistent with the notation introduced earlier (in Sec 2), we shall use S to denote the set
of all system states and use x ∈ {1,...,|S|} to denote a particular state. Our objective is to find
a suitable policy that simultaneously minimizes the service delay and maximizes the information
gained upon loitering. The information gain, I, which is based on an operator error model (see
Appendix EC.1), is plotted as a function of dwell time in fig. 2. We model the one-step payoff/
reward function as follows:
Ru(x)=[I(dx+1)−I(dx)]δ(u)−ρ max{¯ τx,Γ},x=1,...,|S|,
(31)
Page 19
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
19
Figure 2
Value of Information gained vs dwell time.
Dwell Time (d)
Information Gain (I)
Value of Information gained versus Dwell time
0123456
0
0.005
0.01
0.015
0.02
0.025
0.03
where dxis the dwell associated with state x and ¯ τx=maxj∈Ωτj,xis the worst service delay (among
all stations) associated with state x. The parameter Γ(>> 0) is a judiciously chosen maximum
penalty. The positive parameter ρ is a constant weighing the incremental information gained upon
loitering once more at the current location against the delay in servicing alerts at other stations.
From the state definition, we can compute the total number of states in the MDP to be,
|S|=2×N ×(Γ+1)m+D×m×(Γ+1)m−1,
(32)
where, the factor 2 comes from the UAV being bi-directional. For the loiter states, directionality
is irrelevant and hence when d ≥ 1, we reset ω to be 1. Note that, in lieu of the reward function
defintion (31), we do not keep track of delays beyond Γ and hence the state-space S only includes
states x with τi≤Γ,∀i∈Ω and so, is finite. We immediately see that the problem size is an mthorder
polynomial in Γ and hence solving for the optimal value function and policy using exact dynamic
programming (DP) methods are rendered intractable for practical values of Γ and m. Hence, we
employ the restricted LP approach developed earlier to compute approximate value functions; from
which we compute the corresponding greedy sub-optimal policy. In the next section, we exploit the
structure in the perimeter patrol problem to simplify the RLP and NLP formulations and show
Page 20
Author: Park et al.
20
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
that both collapse to exact LPs corresponding to MDPs defined on the M partitions.
4.2. Structure associated with the Perimeter Patrol Problem
In the perimeter patrol problem considered herein, we see that, by definition (31), the reward
function Ru(x) is bounded. Consequently the optimal value function is bounded. To explain the
inherent structure in the reward, consider a station where an alert is being serviced by a UAV.
The information gained by the UAV about the alert is only a function of the service delay at the
station and the amount of time the UAV dwells at the station servicing the alert. There is a natural
partitioning of states; where no matter what the delays are at the other stations, the reward is the
same, as long as the maximum delay and the dwell time of the UAV at the station are the same. So,
we aggregate all the states which have the same values for ℓ, ω, d, Aj, ∀j ∈ Ω and ¯ τ = maxj∈Ωτj,
into one partition. As a result of aggregation, the number of partitions can be shown to be,
M =2×N +2×N ×(2m−1)×Γ+m×D+m×D ×(2m−1−1)×Γ,
(33)
which is linear in Γ and hence considerably smaller than the total number of states (32).
We introduce the following notation, that will be used hereafter: Let ℓx,dx,ωx,τj,x and Aj,x
represent respectively, the location, dwell, direction of UAV’s motion and the service delay and
alert status at station location j ∈ Ω corresponding to some state x ∈ {1,...,|S|}. Also, we will
use ℓ(i),d(i),ω(i), ¯ τ(i) and Aj(i) to denote the location, dwell, direction, maximum delay, and the
alert status at station location j ∈ Ω that correspond to some partition index i ∈ {1,...,M}. We
will also denote by x(t;x0,ut,yt) the state at time t > 0; if the initial state at t = 0 is x0and the
sequence of inputs, ut= {u(0),u(1),...,u(t−1)} and disturbances, yt= {y(0),y(1),...,y(t−1)}.
We also introduce a partial ordering of the states according to: x ≥ y iff ℓx= ℓy, dx= dy, ωx= ωy
and τj,x≥ τj,y, ∀j ∈ Ω. By the same token, we also partially order partitions, Si≥ Sjiff for every
z ∈ Sj, there exists an x ∈ Sisuch that x ≥ z. Recall that T (i,u) is the set of all distinct (m+1)
tuples of partition indices, that the system can transition to, from partition Siunder control action
u. For the sake of notational simplicity, we denote the lthcomponent of any tuple k ∈ T (i,u) by
Page 21
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
21
kl−1and the cardinality of the set T (i,u) by |T (i,u)|. Also we define the partitions to be of two
types: a partition Siis of type 1 and we write i ∈ P1if ℓ(i)∈ Ω, d(i) =0, Aℓ(i)(i) =1, and Aj(i) =
1, for some j ∈ Ω,j ?= ℓ(i), i.e., the UAV is at a station with an alert, the dwell time is zero and
also there is an alert at some other station. Else it is of type 2 and we write i ∈ P2. Given this
definition, we have the following important result, that we will make use of, in the remainder of
the paper.
Lemma 4. The cardinality of T (i,u) is given by:
|T (i,u)|=
?¯ τ(i), i∈P1and u=0,
1, otherwise.
Proof of Lemma 4.First we consider partition index i of type 1 and control input u = 0.
Since the UAV has decided to loiter at the current station i.e., ℓ(i) ∈ Ω, the service delay at
that station, τℓ(i) will be reset to zero in the next time step. Hence the future state (and par-
tition) maximum delay will be determined by the highest of the service delays, say ˜ τ, among
the other stations with alerts (at least one such station exists since partition i is of type 1). So
∀j ∈ {1,...,¯ τ(i)}, ∃ xj∈ Sisuch that ˜ τxj= j. The corresponding tuple of future partition indices
zi,0
xj= (¯f(xj,u,Y0),¯f(xj,u,Y1),...,¯f(xj,u,Ym)) will have maximum delay j + 1 and so T (i,0) =
∪¯ τ(i)
j=1{zi,0
xj} ⇒ |T (i,0)| = ¯ τ(i). For all other control choices, u ?= 0, all the states x ∈ Siwill transi-
tion to future states with the same maximum delay ¯ τ(i) +1. So, for u ?= 0, T (i,u) is a singleton
set and hence |T (i,u)| = 1. For partition indices j of type 2 with ¯ τ(j) > 0, all the states x ∈ Sj
will transition to future states with the same maximum delay ¯ τ(j)+1 and so |T (j,u)|=1, ∀u. If
¯ τ(j) = 0, then the partition Sjis a singleton set as per the aggregation scheme (see Sec 4.2) and
hence |T (j,u)|=1, ∀u.
?
Theorem 3. For the perimeter patrol problem, the NLP (26) reduces to the following LP.
LBLP := min¯ cTw,
subject to
w(i) ≥ ru(i)+λ
m
?
l=0
plw(kl), ∀u, i=1,...,M,
(34)
Page 22
Author: Park et al.
22
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
where the tuple k ∈ T (i,u), if |T (i,u)| = 1, else k = k∗, where k∗∈ T (i,u) is the tuple of partition
indices such that ¯ τ(k∗
l)= ¯ τ(i)+1, l =0,...,m. Furthermore, the optimal solution, w∗is dominated
by every feasible w for the NLP and, in particular, it is a lower bound to the optimal value function
i.e., for all i=1,...,M, one has w∗(i)≤ minx∈SiV∗(x).
Before proceeding further, we make two key claims that are essential for the proof of Theorem 3.
The justification for the claims have been provided in the Appendix.
Claim 1. If x1≥x2, then for the same sequence of inputs utand disturbances yt, the system state
evolves in such a way that x(t;x1,ut,yt)≥x(t;x2,ut,yt) for every t>0.
Claim 2. If x1≥ x2, then V∗(x1) ≤ V∗(x2). Furthermore, if Si≥ Sj, then minx∈SiV∗(x) ≤
minz∈SjV∗(z).
Proof of Theorem 3.Recall the non-linear constraints (25) satisfied by ¯ w(i) := minx∈SiV∗(x)
that motivated the NLP formulation:
¯ w(i) ≥ min
x∈Si
?
Ru(x)+λ
m
?
l=0
pl¯ w(¯f(x,u,Yl))
?
,
∀u,i=1,...,M,
(35)
which, given the definition of T (i,u), can be written in the following equivalent form:
¯ w(i)≥ ru(i)+λ min
k∈T (i,u)
m
?
l=0
pl¯ w(kl),
∀u,i=1,...,M,
(36)
where ru(i) is the reward associated with partition index i, and given the partitioning scheme,
satisfies Ru(x) =ru(i),∀x∈ Si. Given the structure in the perimeter patrol problem, we will show
that the above (36) will collapse to a single linear inequality constraint for every partition index i
and control u. Let us focus our attention on partition index i of type 1 and control action u = 0.
For this choice, the cardinality of T (i,0) is ¯ τ(i) as per Lemma 4. Indeed ∃ ¯ x ∈ Sisuch that the
corresponding tuple of future partition indices k∗= (¯f(¯ x,0,Y0),¯f(¯ x,0,Y1),...,¯f(¯ x,0,Ym)) has the
highest possible maximum delay, i.e., ¯ τ(k∗
l)= ¯ τ(i)+1,l =0,...,m. Since k∗
l≥kl, l =0,...,m, ∀k ∈
T (i,u), we have from Claim 2 that, ¯ w(k∗
l) ≤ ¯ w(kl), l = 0....,m, ∀k ∈ T (i,u). So, the non-linear
inequality corresponding to partition index i∈ P1and control u=0 becomes:
¯ w(i) ≥ r0(i)+λ
m
?
l=0
pl¯ w(k∗
l).
(37)
Page 23
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
23
If u?=0, then |T (i,u)|=1. So there exists exactly one tuple k in T (i,u) and hence, the non-linear
constraint (36) reduces to the linear inequality:
¯ w(i)≥ru(i)+λ
m
?
l=0
pl¯ w(kl).
(38)
For partition indices j of type 2, |T (j,u)| = 1, ∀u. So, as before, the non-linear inequality (36)
collapses to the linear inequality (38).
In summary, we have the following: regardless of which partition one considers, the corresponding
non-linear constraint in NLP collapses to a linear constraint and hence, NLP for the perimeter
patrol problem collapses to the following LP:
LBLP := min¯ cTw,
subject to
w(i) ≥ ru(i)+λ
m
?
l=0
plw(kl),
∀u, i=1,...,M,
(39)
where the tuple k ∈T (i,u), if |T (i,u)|=1, else k =k∗, where k∗∈ T (i,u) is the tuple of partition
indices such that ¯ τ(k∗
l)= ¯ τ(i)+1, l =0,...,m.
To prove the second part of the Theorem, we observe that LBLP defined above is the exact LP
corresponding to a reduced order MDP defined on the M partitions. Hence, we readily have from
Lemmas 1 and 2 that the optimal solution w∗lower bounds every feasible solution including ¯ w and
hence, w∗(i)≤ ¯ w(i) =minx∈SiV∗(x)≤ V∗(y), ∀y ∈Si, i=1,...,M.
?
So, for the perimeter patrol problem, one can compute a lower bound for the optimal value
function efficiently by solving LBLP. The next logical question is whether the upper bound for-
mulation, RLP (13), also simplifies, given the structure in the problem. It turns out that this is
indeed the case, as can be seen from the following theorem.
Theorem 4. For the perimeter patrol problem, the RLP (13) reduces to the following LP.
UBLP := min¯ cTw,
subject to
w(i) ≥ ru(i)+λ
m
?
l=0
plw(kl),
∀u,i=1,...,M,
(40)
where the tuple k ∈ T (i,u), if |T (i,u)| = 1, else k = k∗, where k∗∈ T (i,u) is the tuple of partition
indices such that ¯ τ(k∗
l)=2, l =0,...,m.
Page 24
Author: Park et al.
24
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
Proof of Theorem 4.Given the partitioning scheme, one can rewrite the Bellman inequalities
(5) as follows: for each i=1,...,M,
V∗(x)≥ru(i)+λ
m
?
l=0
plV∗(f(x,u,Yl)), ∀u, ∀x∈ Si.
(41)
With the restriction that V (x)=v(i),∀x∈Si, we get the following constraint for RLP (13),
v(i)≥ ru(i)+λ
m
?
l=0
plv(kl), ∀k ∈T (i,u),∀u, i=1,...,M.
(42)
For partition index i ∈ P1, ∃ ¯ x ∈ Sithat transitions to future states with the least possible max-
imum delay, 2. Hence f(¯ x,0,Yl) ≤ f(x,0,Yl), l = 0,...,m, ∀x ∈ Siand so from Claim 2 we have,
V∗(f(¯ x,0,Yl))≥ V∗(f(x,0,Yl)), l =0,...,m, ∀x∈ Si. So, for i∈P1and u=0, the inequalities (41)
can be written as follows,
V∗(x)≥ r0(i)+λ
m
?
l=0
plV∗(f(¯ x,0,Yl))∀u, ∀x∈Si.
(43)
The above implies that the ¯ τ(i) constraints (42) in RLP can be replaced by the single constraint,
v(i)≥ r0(i)+λ
m
?
l=0
plv(k∗
l),
(44)
where k∗= (¯f(¯ x,0,Y0),¯f(¯ x,0,Y1),...,¯f(¯ x,0,Ym)) is the tuple of future partition indices (corre-
sponding to ¯ x) with the least possible maximum delay, i.e., ¯ τ(k∗
l) = 2, l = 1,...,m. For the other
control choices, u ?= 0, there exists only one tuple¯k in T (i,u) (since |T (i,u)| = 1) and hence the
constraint (42) is the single constraint,
v(i) ≥ru(i)+λ
m
?
l=0
plv(¯kl), u?=0.
(45)
Similarly, for partitions Sj of type 2, |T (j,u)| = 1, ∀u, and so the constraint (42) is the single
constraint (45).
In summary, we have the following: regardless of which partition index i∈ {1,...,M} and control
action u are considered, the corresponding |T (i,u)| linear constraints in RLP collapse to a single
constraint and hence, RLP for the perimeter patrol problem reduces to the following exact LP:
UBLP := min¯ cTw,
subject to
Page 25
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
25
w(i) ≥ ru(i)+λ
m
?
l=0
plw(kl),
∀u,i=1,...,M,
(46)
where the tuple k ∈T (i,u), if |T (i,u)|=1, else k =k∗, where k∗∈ T (i,u) is the tuple of partition
indices such that ¯ τ(k∗
l)=2, l =0,...,m.
?
In conclusion, we have two complementary LP formulations, UBLP and LBLP that can be
used to efficiently compute upper bound and lower bound approximate value functions respectively,
for the perimeter alert patrol problem. Note that the two formulations involve computing the
optimal value functions for reduced order MDPs defined over the M partitions and in that sense
are computationally attractive (compared to solving the original problem) since M <<|S|. In the
following section, we will provide numerical results that corroborate the key claims made earlier
regarding the structure in the perimeter alert patrol problem.
5. Numerical Results
We consider a perimeter with N = 15 nodes of which node numbers {0,3,7,11} are alert stations
and a maximum allowed dwell of D =5 orbits. The other parameters were chosen to be: weighing
factor, ρ = .005 and temporal discount factor, λ = 0.9. Based on experience, we chose the alert
arrival rate α =
1
60. This reflects a rather low arrival rate where we expect 2 alerts to occur on
average in the time taken by the UAV to complete an uninterrupted patrol around the perimeter.
We set the maximum delay time, that we keep track of, to be Γ =15; for which the total number
of states comes out to be |S| = 2,048,000. Before venturing into the simulation, we first provide
numerical results that corroborate the key Claim 2, made earlier in the paper. For this, we solve
for the optimal value function V∗. This is possible since the size of the example problem considered
in this section is small and hence an exact solution can be obtained. In Figure 3, we show results
supporting the claim that for partially ordered states x1≥ x2, the corresponding optimal value
functions satisfy V∗(x1) ≤ V∗(x2). For this, we plot the optimal value function V∗corresponding
to states with alert status Aj=1,∀j ∈Ω (all stations have alerts), dwell d=0, direction ω =1 and
the UAV located at one of the four station locations ℓ∈ Ω. The partially ordered states represented
Page 26
Author: Park et al.
26
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
Figure 3
Monotonically decreasing value function corresponding to partially ordered states with increasing max-
imum delay.
4812 16 2024 28 3236 4044 48 5256
−0.6
−0.55
−0.5
−0.45
−0.4
−0.35
−0.3
−0.25
−0.2
States
Optimal Value Function
Monotonic decreasing value function w/ Partial Ordering of States
Station 0
Station 3
Station 7
Station 11
Figure 4
Monotonically decreasing least value function corresponding to partially ordered partitions with increas-
ing maximum delay.
3 8 15 24 35 4863 8099120 143168 195224
−0.35
−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
States
Value Function
Monotonic decreasing minimum value function w/ Partial Ordering of partitions
Optimal value function
Least value in partition
in the X-axis are non-decreasing from left to right with maximum delay ¯ τ varying from 2 to Γ. The
dotted grid lines in the plot separate the different partitions that the states fall into. In Figure 4, we
show results supporting the claim that for partially ordered partitions Si≥ Sj, the corresponding
optimal value functions satisfy minx∈SiV∗(x)≤ miny∈SjV∗(y). For this, we plot the value functions
corresponding to states with alert status A = 1001 (station locations 0 and 11 have alerts), dwell
d = 0, direction ω = 1 and ℓ = 0. The partially ordered partitions demarcated by the dotted grid
Page 27
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
27
lines in the X-axis are non-decreasing from left to right with maximum delay ¯ τ varying from 2 to
Γ. Within each partition, we plot the value function associated with every state in the partition
and also the least value function in the partition shown as the green line. One can easily see that
the claim above is satisfied.
In the next section, we shall consider the same example problem and show that the proposed
approximate methodology is effective. For this, we compute the approximate value functions via
the restricted LP formulation and compare them with the optimal value function. In addition, we
also compute the greedy sub-optimal policy corresponding to the approximate value function and
compare it with the optimal policy in terms of the two performance metrics: alert service delay
and information gained upon loitering.
5.1. Simulation Results
We aggregate the states in the example problem based on the reward function (see section 4.2
for details). This results in M = 8900 partitions, which is considerably smaller than the original
number of states, |S|. We solve both the UBLP and LBLP formulations which give us the upper
and lower bounds, v∗and w∗respectively, to the optimal value function V∗. Since we have the
optimal value function for the example problem, we use it for comparison with the approximations.
Note that for higher values of m and Γ, the problem would essentially become intractable and one
would not have access to the optimal value function. Nevertheless, one can compute v∗and w∗
and the difference between the two would give an estimate of the quality of the approximation. We
give a representative sample of the approximation results by choosing all the states in partitions
corresponding to alert status Aj= 1,∀j ∈ Ω (all stations have alerts) and maximum delay ¯ τ = 2.
Figure 5 compares the optimal value function V∗with the upper and lower bound approximate
value functions, Vup=Φv∗and Vlow=Φw∗for this subset of the state-space. The first 15 partitions
shown in the X-axis of Figure 5 i.e., partition numbers, i = 1,...,15, correspond to the clockwise
states:
ℓ=i−1,d=0,ω =1,
¯ τ =max
j∈Ωτj=2,
Aj=1, ∀j ∈ Ω,
(47)
Page 28
Author: Park et al.
28
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
Figure 5
Comparison of approximate value functions with the optimal.
13579 11 131517 19 2123 252729
−0.32
−0.3
−0.28
−0.26
−0.24
−0.22
−0.2
−0.18
−0.16
−0.14
−0.12
Partition Number
Value Function
Comparison of Optimal Value Function with Bounds
Optimal (V*)
Upper Bound (Vup)
Lower Bound (Vlow)
and the last 15 partitions shown in the X-axis i.e., partition numbers, i=16,...,30, correspond to
the counter-clockwise states:
ℓ=i−N −1,d=0,ω =−1,
¯ τ =max
j∈Ωτj=2,
Aj=1, ∀j ∈Ω.
(48)
Interestingly, we notice immediately that the lower bound appears to be tighter than the upper
bound. Recall that our objective is to obtain a good sub-optimal policy and so, we consider the
policy that is greedy with respect to Vlow:
πs(x)=argmax
u
?
Ru(x)+λ
m
?
l=0
plVlow(f(x,u,Yl))
?
,
∀x∈{1,...,|S|}.
(49)
To assess the quality of the sub-optimal policy, we also compute the expected discounted payoff,
Vsubthat corresponds to the sub-optimal policy πs, by solving the system of equations:
(I −λPπs)Vsub=Rπs.
(50)
Since Vsub corresponds to a sub-optimal policy and in lieu of the monotonicity property of the
Bellman operator, the following inequalities hold:
Vlow≤Vsub≤V∗≤Vup.
Page 29
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
29
Figure 6
Comparison of value function corresponding to suboptimal policy πs with the optimal.
123456
Partition Number
789 10111213 1415
−0.32
−0.3
−0.28
−0.26
−0.24
−0.22
−0.2
Value Function
Comparison of optimal w/ Lower bound sub−optimal policy
Optimal (V*)
Lower Bound (Vlow)
Sub−Optimal Policy(Vsub)
In Figure 6, we compare Vsubwith the optimal value function V∗for the clockwise states defined
in (47) and note that the approximation is quite good. Finally, we compare the performance of the
sub-optimal policy πswith that of the optimal strategy π∗in terms of the two important metrics:
service delay and information gain (measured via the dwell time). To collect the performance
statistics, we ran Monte Carlo simulations with alerts generated from a Poisson arrival stream
with rate α =
1
60over a 60000 time unit simulation window. Both the optimal and sub-optimal
policies were tested against the same alert sequence. Figure 7 shows histogram plots for the service
delay (top plot) and the dwell time (bottom plot) for all serviced alerts in the simulation run.
The corresponding mean and worst case service delays and the mean dwell time are also shown in
Table 1. We see that there is hardly any difference in terms of either metric between the optimal
and the sub-optimal policies. This substantiates the claim that the aggregation approach gives us a
sub-optimal policy that performs almost as well as the optimal policy itself. This is to be expected,
given that the value functions corresponding to the optimal and sub-optimal policies are close to
each other (see Figure 6). Since the false alarm rate α is fairly low, we see from the bottom plot of
Figure 7 that roughly 90% of the alerts were cleared within ten time steps. Also from the top plot
of Figure 7, we see that maximum information was gained (5 loiters completed) on almost 90% of
Page 30
Author: Park et al.
30
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
Figure 7
Comparison of service delay and number of loiters between optimal and sub-optimal policies.
12345
0
20
40
60
80
100
Number of Loiter Orbits
% Data points
Comparison of Alert Servicing Performance
Sub−optimal (πs)
Optimal
0123456789 10111213 141516 17 1819 20
0
5
10
15
20
Service Delay
% Data points
Sub−optimal (πs)
Optimal
Table 1
Comparison of alert servicing performance between optimal and
sub-optimal policies.
PolicyMean number of loitersMean service delayWorst service delay
π∗
4.75.6 15
πs
4.75.618
the serviced alerts.
6. Conclusions
We have provided a state aggregation based restricted LP method to construct sub-optimal poli-
cies for stochastic DPs along with a bound for the deviation of such a policy from the optimum
value function. As a key result, we have shown that the solution to the aggregation based LP is
independent of the underlying cost function and we do so by demonstrating that the restricted LP
is, in fact, the exact LP that corresponds to a lower dimensional MDP defined over the partitions.
We also provide a novel non-linear program that can be used to compute a non-trivial lower bound
to the optimal value function. In particular, for the perimeter patrol stochastic control problem, we
have shown that both the upper and lower bound formulations simplify to exact LPs corresponding
to some reduced order MDPs. To do so, we have exploited the partial ordering of the states that
Page 31
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
31
comes about because of the structure inherent in the reward function. It would be interesting to
see if the simplification can be achieved for other problems that exhibit a similar structure. For
the perimeter patrol problem, numerical results obtained via Monte Carlo simulations show that
the sub-optimal policy obtained via the approximate value functions perform almost as well as the
optimal policy. The literature suggests that, in general, the solution to a restricted LP depends on
the underlying cost function; when the value function is parameterized by arbitrary basis functions.
We have shown that, for the special case of hard aggregation, this is not true. Surely, there exist
other basis functions with the same property and it would be useful to uncover the class of basis
functions, for which the independence result holds.
References
Axs¨ ater, S. 1983. State aggregation in dynamic programming: An application to scheduling of independent
jobs on parallel processors. Oper. Res. Letters 2 171–176.
Balas, E. 1979. Disjunctive programming, Annals of Discrete Mathematics, vol. 5. North-Holland Publishing
Company, 3–51.
Balas, E. 1998. Disjunctive programming: Properties of the convex hull of feasible points. Discrete Applied
Math. 89(1-3) 3–44.
Bean, J. C., J. R. Birge, R. L. Smith. 1987. Aggregation in dynamic programming. Oper. Res. 35 215–220.
Bellman, R. E. 1957. Dynamic Programming. Princeton University Press, Princeton, NJ.
Bertsekas, D. P. 2007. Dynamic Programming and Optimal Control, vol. II, chap. Approximate Dynamic
Programming. 3rd ed. Athena Scientific.
Chandler, P., J. Hansen, R. Holsapple, S. Darbha, M. Pachter. 2009. Optimal perimeter patrol alert servicing
with Poisson arrival rate. AIAA Guidance, Navigation and Control Conf.. Chicago, IL.
Darbha, S., K. Krishnamoorthy, M. Pachter, P. Chandler. 2010. State aggregation based linear programming
approach to approximate dynamic programming. Proc. IEEE Conf. Decision and Control. Atlanta,
GA, 935–941.
De Farias, D. P., B. Van Roy. 2003. The linear programming approach to approximate dynamic programming.
Oper. Res. 850–865.
Page 32
Author: Park et al.
32
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
Denardo, E. V. 1970. On linear programming in a Markov decision problem. Management Sci. 16(5) 282–288.
d’Epenoux, F. 1963. A probabilistic production and inventory problem. Management Sci. 10(1) 98–108.
Glover, F. 1968. Surrogate constraints. Oper. Res. 16(4) 741–749.
Glover, F. 1975. Surrogate constraint duality in mathematical pragramming. Oper. Res. 23(3) 434–451.
Gordon, G. 1999. Approximate solutions to Markov decision processes. Ph.D. thesis, Carnegie Mellon
University, Pittsburg, PA.
Greenberg, H. J., W. P. Pierskalla. 1970. Surrogate mathematical programming. Oper. Res. 18(5) 924–939.
Gross, D., S. Rasmussen, P. Chandler, G. Feitshans. 2006. Cooperative Operations in UrbaN TERrain
(COUNTER). Defense and Security Sympos.. SPIE, Orlando, FL.
Gr¨ otschel, M., O. Holland. 1991. Solution of large-scale symmetric travelling salesman problems. Math.
Programming 51 141–202.
Gr¨ otschel, M., L.Lov´ asz, A. Schijver. 1981. The ellipsoid method and its consequences in combinatorial
optimization. combinatorica 1(2) 169–197.
Hordijk, A., L. C. M. Kallenberg. 1979. Linear programming and Markov decision chains. Management Sci.
25(4) 352–362.
Howard, R. A. 1960. Dynamic Programming and Markov Processes. The MIT Press, Cambridge, MA.
Krishnamoorthy, K., M. Pachter, P. Chandler, D. Casbeer, S. Darbha. 2011a. UAV perimeter patrol oper-
ations optimization using efficient dynamic programming. American Control Conf.. San Fransisco,
CA.
Krishnamoorthy, K., M. Pachter, S. Darbha, P. Chandler. 2011b. Approximate dynamic programming with
state aggregation applied to UAV perimeter patrol. Internat. J. of Robust and Nonlinear Control 21.
Manne, A. S. 1960. Linear programming and sequential decisions. Management Sci. 6(3) 259–267.
Mendelssohn, R. 1980. Improved bounds for aggregated linear programs. Oper. Res. 28(6) 1450–1453.
Mendelssohn, R. 1982. An iterative aggregation procedure for Markov decision processes. Oper. Res. 30(1)
62–73.
Morrison, J. R., P. R. Kumar. 1999. New linear program performance bounds for queueing networks. J.
Optim. Theory and Appl. 100(3) 575–597.
Page 33
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
33
Porteus, E. L. 1975. Bounds and transformations for discounted finite Markov decision chains. Oper. Res.
23(4) 761–784.
Schuurmans, D., R. Patrascu. 2001. Direct value-approximation for factored MDPs, Advances in Neural
Information Processing Systems, vol. 14. MIT Press, Cambridge, MA, 1579–1586.
Schweitzer, P. J., A. Seidmann. 1985. Generalized polynomial approximations in Markovian decision pro-
cesses. J. Math. Anal. and Appl. 110(2) 568–582.
Singh, S. P., T. Jaakkola, M. I. Jordan. 1995. Reinforcement learning with soft state aggregation. Advances
in Neural Information Processing Systems 7: Proceedings of the 1994 Conference. MIT Press, 361–368.
Trick, M., S. Zin. 1993. A linear programming approach to solving stochastic dynamic programs.
Trick, M., S. Zin. 1997. Spline approximation to value functions: A linear programming approach. Macroe-
conomic Dynamics 1 255–277.
Van Roy, B. 2006. Performance loss bounds for approximate value iteration with state aggregation. Math.
Oper. Res. 31(2) 234–244.
Wang, Y., S. Boyd. 2010. Approximate dynamic programming via iterated Bellman inequalities. URL
http://www.stanford.edu/~boyd/papers/adp_iter_bellman.html.
Page 34
e-companion to Author: Park et al.
ec1
This page is intentionally blank. Proper e-companion title
page, with INFORMS branding and exact metadata of the
main paper, will be produced by the INFORMS office when
the issue is being assembled.
Page 35
ec2
e-companion to Author: Park et al.
Appendix to “Bounding Procedures for Stochastic Dynamic
Programs with Application to the Perimeter Alert Patrol
Problem” by Park et al.
This appendix contains supplementary material to the paper and also lengthy proofs that were
left out of the main document.
EC.1. Operator Error Model
We treat the operator as a sensor-in-the-loop automaton. The operator is not infallible and we
account for that statistically in the optimization. To quantify the operator’s performance, we
consider two random variables: the variable X that specifies whether the alert is a real threat
(target T) or a nuisance (false target FT) and the operator decision Z which specifies whether
he determines the alert to be a real threat Z1 or a nuisance Z2. We stipulate that the a priori
probability that an alert is a real target,
Prob{X =T} =p<<1.
(EC.1)
We assume, based on experience, that p = 0.01 in this work. The conditional probabilities which
specify whether the operator correctly reported a threat and a nuisance are assumed to be functions
of the dwell time, d:
PTR(d):=Prob{Z =Z1|X =T} = a+b(1−e−µ1d),
PFTR(d):= Prob{Z =Z2|X =FT} = c+g(1−e−µ2d).
(EC.2)
where the acronyms TR and FTR stand for Target Report and False Target Report respectively.
The parameters a, b, µ1, c, g, µ2characterize the “confusion matrix” and the performance of the
operator as a sensor; for details on sensor performance modeling, see Sec 7.2 in Kish et al. (2009).
The parameters satisfy the constraints:
0<a+b ≤1,
0<c+g ≤ 1,µ1≥0 and
µ2≥ 0.
In this work, we chose a=c=0.5, b=g =0.45 and µ1=µ2=1. The choice a=c=0.5 correspond
to an uninformed or unbiased operator, i.e., the operator cannot tell if the alert is a threat or a
Page 36
e-companion to Author: Park et al.
ec3
nuisance without having seen any video footage of the alert site. We wish to maximize the mutual
information - derived along the lines of information theory (Cover and Thomas 2006) - between
the random variables X and Z given by:
I(X;Z) = H(X)−H(X|Z)
=
?
x,z
Prob{X =x,Z =z}log
Prob{X =x,Z =z}
Prob{X =x}Prob{Z =z},
(EC.3)
where H(X) is the entropy of X and H(X|Z) is the conditional entropy of X given Z. Using
Bayes’ rule and the probabilities (EC.1) and (EC.2), one can show that the mutual information is
a function of dwell time, d:
I(d) = pPTRlog
PTR
pPTR+(1−p)(1−PFTR)
+ p(1−PTR)log
1−PTR
p(1−PTR)+(1−p)PFTR
+ (1−p)(1−PFTR)log
1−PFTR
pPTR+(1−p)(1−PFTR)
PFTR
p(1−PTR)+(1−p)PFTR,
+ (1−p)PFTRlog
(EC.4)
since the conditional probabilities, PTRand PFTRare both functions of d (EC.2).
EC.2. Proofs to lemma in Section 2.1
Lemma 1. Let the vector V satisfy the following set of inequalities:
?I −λLPL
π
?V ≥?I +λPπ+···+λL−1PL−1
π
?Rπ,
∀π.
(EC.5)
Then, we have V ≥V∗.
Proof of Lemma 1.For every stationary policy π, we have:
?I −λLPL
π
?V ≥?I +λPπ+···+λL−1PL−1
π
?Rπ.
(EC.6)
Since Pπis a stochastic matrix (i.e., it is non-negative and its row sum equals 1), and λ ∈ [0,1),
the matrix?I −λLPπ
L?−1admits the following analytic series expansion:
?I −λLPπ
L?−1=I +λLPπ
L+λ2LP2L
π +....
Page 37
ec4
e-companion to Author: Park et al.
So, all the entries of
?I −λLPπ
L?−1are non-negative and hence (EC.6) implies the following
(although the converse is not true!):
V ≥?I −λLPL
π
?−1?I +λPπ+···+λL−1PL−1
π
?Rπ=
∞
?
i=0
λiPi
πRπ,
∀π.
(EC.7)
So, V dominates the expected payoff associated with every policy π, including the optimal policy
π∗. Hence V ≥V∗.
?
EC.3. Proof to lemma in Section 3.1
Lemma 3. Consider a surrogate LP for the RLP through a set of dual variables, µ given by:
SLP(µ) := min¯ cTv,
subject to
?
(EC.8)
?
x∈Si
µi
u(x)v(i) ≥
?
x∈Si
µi
u(x)
Ru(x)+λ
m
?
l=0
plv(¯f(x,u,Yl))
?
, ∀u, i=1,...,M.
Then, ∃¯ µ≥0 such that, SLP(¯ µ)=RLP, and, for every partition index i =1,...,M, ∃u such that
?
any other feasible solution v to RLP dominates v∗.
x∈Si¯ µi
u(x)>0. Moreover, the optimal solution v∗to RLP is independent of the cost vector ¯ c and
Proof of Lemma 3.Consider the Langrangian dual problem to RLP,
LD(µ):=min
v
?
¯ cTv −
?
u(x)?v(i)−Ru(x)−λ?m
i,u
?
x∈Si
µi
u(x)
?
v(i)−Ru(x)−λ
m
?
l=0
plv(¯f(x,u,Yl))
??
.
Let φ(v,µ) = ¯ cTv −?
set for RLP and let F(µ) be the feasible set of SLP(µ). Then, we have,
i,u
?
x∈Siµi
l=0plv(¯f(x,u,Yl))?. Let F be the feasible
LD(µ) := min
v
φ(v,µ)
≤
min
v∈SLP(µ)φ(v,µ)
≤
min
v∈SLP(µ)¯ cTv =SLP(µ).
Since F ⊂F(µ) for every µ, it readily follows that SLP(µ)≤ RLP. Also, RLP is feasible. For eg.,
consider the feasible solution ˜ v given by,
˜ v(i)=maxx,uRu(x)
1−λ
,∀i∈ {1,...,M}.
Page 38
e-companion to Author: Park et al.
ec5
Moreover, any feasible v satisfies,
v(i) ≥minx,uRu(x)
1−λ
,∀i∈ {1,...,M}.
So, RLP is also bounded from below and hence it satisfies the requirements of strong duality for
LPs. Hence, there exists a ¯ µ which is optimal for the dual of RLP and also satisfies LD(¯ µ)=RLP.
Therefore, the same ¯ µ must also be such that SLP(¯ µ) = RLP. Now for every partition index
i = 1,...,M, there exists at least one u for which?
every x ∈ Siand for every u, then SLP(¯ µ) will not have any constraints lower bounding v(i). It
x∈Si¯ µi
u(x) > 0. If for some i, ¯ µi
u(x) = 0 for
will then admit solutions for v(i) that are arbitrarily negative and correspondingly, one can find a
direction in which the cost of SLP(¯ µ) decreases without bound. However, this is a contradiction,
since RLP is lower bounded. So, we can rewrite SLP(¯ µ) in the following manner:
SLP(¯ µ) := min¯ cTv,
subject to(EC.9)
v(i) ≥ ru(i)+λ
1
?
x∈Si¯ µi
u(x)
?
x∈Si
¯ µi
u(x)
m
?
l=0
plv(¯f(x,u,Yl)), ∀u∈Ui,i=1,...,M,
where, u ∈ Ui if?
reduced dimension with one-step reward function,
x∈Si¯ µi
u(x) > 0. Clearly, SLP(¯ µ) is the exact LP corresponding to a MDP of
ru(i)=
?
x∈Si¯ µi
?
u(x)Ru(x)
x∈Si¯ µi
u(x)
,
∀u∈Ui,
and transition probability matrix˜Pugiven by,
˜Pu(i,j):=
?
1
?
0,
x∈Si¯ µiu(x)
?
x∈Si¯ µi
u(x)?
y∈SjPu(x,y), if u∈Ui,
otherwise.
So, by Lemma 2, the optimal solution v∗is also the optimal value function associated with the
same underlying MDP. Also any feasible v to RLP is also a feasible solution to SLP(¯ µ) since the
constraints for SLP(¯ µ) are obtained by a convex combination of the constraints of RLP. So, it
follows from Lemma 1 that v ≥v∗.
Finally, let RLP(¯ c) and RLP(¯d) denote the restricted LPs corresponding to two different cost
vectors ¯ c and¯d respectively. Let the corresponding optimal solutions be v∗
cand v∗
d. Since v∗
dis a
feasible solution for RLP(¯ c), we have v∗
d≥ v∗
c. By the same token, v∗
c≥v∗
d. Hence, v∗
c=v∗
d.
?
Page 39
ec6
e-companion to Author: Park et al.
EC.4. Proofs to claims in Section 4.2
Claim 1. If x1≥x2, then for the same sequence of inputs utand disturbances yt, the system state
evolves in such a way that x(t;x1,ut,yt)≥x(t;x2,ut,yt) for every t>0.
Proof of Claim 1.We use induction. Clearly at t = 0, x1≥ x2. By the semi-group property of
state transitions, it is sufficient to show that the result holds for t = 1. We define the state, x, of
the patrol system to be of two types. If the following holds:
ℓx∈ Ω, dx=0, Aℓx,x=1, and Aj,x=1, for some j ∈ Ω,j ?=ℓx,
(EC.10)
i.e., the UAV is at a station with an alert, the dwell time is zero and also there is an alert at some
other station, then the state x is of type 1. Else it is of type 2. Note that if x1≥ x2, then the states
x1and x2are necessarily of the same type. The key property we will be using in proving Claim 1 is
the following: service delay at a station either remains at zero (if no new alert has occurred there)
or it goes up by 1 (if there is an unserviced alert there) or it is reset to zero (if a UAV decides to
loiter there).
If x1and x2are of type 1 and the UAV chooses to loiter, i.e., u(0)=0, we clearly see that neither
the location nor the dwell will differ at t=1. Furthermore, the delays at t=1 associated with the
stations corresponding to initial state x1will be no less than the delays associated with stations
corresponding to initial state x2since x1≥ x2. If z1= x(1;x1,0,y(0)) and z2= x(1;x2,0,y(0)), we
see that ℓz1= ℓz2, dz1= dz2, ωz1= ωz2, and τj,z1≥ τj,z2, ∀j ∈ Ω for every disturbance y(0) and so
z1≥ z2. The same relationship holds for other possible control choices, u(0) ?= 0, as well. By a
similar argument, one can show that x(1;x1,u(0),y(0))≥ x(1,x2,u(0),y(0)) holds, regardless of the
control choice, even if the states x1, x2are of type 2. We use the semi-group property as follows:
suppose the claim holds for all t lying between 0 and l for some l >0. Then, we will treat the state
at t=l as the initial condition for determining the evolution of the state at t=l+1. The clock is
reset as:˜t = t−l, t ≥ l. By the preceding arguments, Claim 1 holds for˜t = 1 which is equivalent
to saying that it holds for t=l+1.
?
Page 40
e-companion to Author: Park et al.
ec7
Claim 2. If x1≥ x2, then V∗(x1) ≤ V∗(x2). Furthermore, if Si≥ Sj, then minx∈SiV∗(x) ≤
minz∈SjV∗(z).
Proof of Claim 2.Let π∗be the optimal policy; accordingly π∗(x) is fixed for every x∈S. Then,
for every t > 0, we can determine x(t;x1,u∗
t,yt) for some sequence of disturbances yt, where the
optimal input sequence u∗
t= {u∗(0),...,u∗(t−1)} (starting with x1) can be recursively obtained
as follows:
u∗(t)=π∗(x(t−1;x1,u∗
t−1,yt−1)).
(EC.11)
with the initialization u∗(0)=π∗(x1). For the above u∗and y, we can then determine the evolution
of the states corresponding to initial state x2. Since x(t;x1,u∗
t,yt)≥ x(t;x2,u∗
t,yt) by Claim 1, we
notice readily that the reward Ru∗(x(t;x1,u∗
t,yt)) ≤ Ru∗(x(t;x2,u∗
t,yt)) for every t ≥ 0 (since the
one-step reward is based only on the maximum delay, dwell time and control input, the inequality
follows). Since the above holds for any given disturbance sequence, the expected discounted payoff
associated with the state starting from x1i.e., V∗(x1), is no more than the expected discounted
payoff associated with the state starting from x2, which we will denote by Vu∗(x2). As a result,
V∗(x1)≤Vu∗(x2) ≤V∗(x2). The second part of the inequality holds since u∗
tas defined in (EC.11)
is a sub-optimal control policy for the state evolution starting from x2 and hence the expected
discounted payoff associated with that policy is necessarily dominated by the optimal value function
starting from x2. To complete the proof, consider two different partitions Si and Sj such that
Si≥ Sj. Let ¯ z =argminz∈SjV∗(z) and this can always be found since we are dealing with a subset,
Sjof a finite state space S. Since Si≥ Sj, ∃¯ x ∈ Sisuch that ¯ x ≥ ¯ z. We have shown that for this
case, V∗(¯ x)≤ V∗(¯ z) =minz∈SjV∗(z)⇒minx∈SiV∗(x)≤ minz∈SjV∗(z).
?
Acknowledgments
This work was also partly supported by the AFRL Summer Faculty Program and AFOSR award
no. FA9550-10-1-0392.
Page 41
ec8
e-companion to Author: Park et al.
References
Kish, B., M. Pachter, D. Jacques. 2009. UAV Cooperative Decision and Control: Challenges and Practical
Approaches, chap. Effectiveness Measures for Operations in Uncertain Environments. Advances in
Design and Control, SIAM, 103–124.
Cover, Thomas M., Joy A. Thomas. 2006. Elements of Information Theory. 2nd ed. Wiley-Interscience.
Download full-text