Bounding Procedures for Stochastic Dynamic Programs with Application to the Perimeter Patrol Problem
ABSTRACT One often encounters the curse of dimensionality in the application of
dynamic programming to determine optimal policies for controlled Markov chains.
In this paper, we provide a method to construct suboptimal policies along with
a bound for the deviation of such a policy from the optimum via a linear
programming approach. The statespace is partitioned and the optimal costtogo
or value function is approximated by a constant over each partition. By
minimizing a nonnegative cost function defined on the partitions, one can
construct an approximate value function which also happens to be an upper bound
for the optimal value function of the original Markov Decision Process (MDP).
As a key result, we show that this approximate value function is {\it
independent} of the nonnegative cost function (or state dependent weights as
it is referred to in the literature) and moreover, this is the least upper
bound that one can obtain once the partitions are specified. Furthermore, we
show that the restricted system of linear inequalities also embeds a family of
MDPs of lower dimension, one of which can be used to construct a lower bound on
the optimal value function. The construction of the lower bound requires the
solution to a combinatorial problem. We apply the linear programming approach
to a perimeter surveillance stochastic optimal control problem and obtain
numerical results that corroborate the efficacy of the proposed methodology.
 Citations (3)
 Cited In (0)

Article: Phenomenology of light quarks
[Show abstract] [Hide abstract]
ABSTRACT: In this contribution I review some recent developments in light quark spectroscopy, from a somewhat theoretical viewpoint. I concentrate almost exclusively on light mesons, since this area of hadron physics has seen the greatest share of recent advances. In theory I will discuss recent LGT results for hybrid meson masses, hybrid decay calculations and conventional quarkonium decay calculations. In experiment I review new or recent candidates for higher quarkonium states (the new experimental hybrid candidates are reviewed by other contributers in these proceedings), and then discuss an interesting miscellany in gammagamma physics, K1 mixing angles, KK molecules and Ds decays.AIP Conference Proceedings. 05/1998; 432(1). 
Conference Paper: Feature extracting hearing aids for the profoundly deaf using a neural network implemented on a TMS320C51 digital signal processor
[Show abstract] [Hide abstract]
ABSTRACT: Not AvailableApplications of Signal Processing to Audio and Acoustics, 1991. Final Program and Paper Summaries., 1991 IEEE ASSP Workshop on; 11/1991 
Conference Paper: State aggregation based linear programming approach to approximate dynamic programming.
[Show abstract] [Hide abstract]
ABSTRACT: One often encounters the curse of dimensionality in the application of dynamic programming to determine optimal policies for controlled Markov chains. In this paper, we provide a method to construct suboptimal policies along with a bound for the deviation of such a policy from the optimum through the use of restricted linear programming. The novelty of this approach lies in circumventing the need for a value iteration or a linear program defined on the entire statespace. Instead, the statespace is partitioned based on the reward structure and the optimal costtogo or value function is approximated by a constant over each partition. We associate a metastate with each partition, where the transition probabilities between these metastates can be derived from the original Markov chain specification. The state aggregation approach results in a significant reduction in the computational burden and lends itself to a restricted linear program defined on the aggregated statespace. Finally, the proposed method is bench marked on a perimeter surveillance stochastic control problem.Proceedings of the 49th IEEE Conference on Decision and Control, CDC 2010, December 1517, 2010, Atlanta, Georgia, USA; 01/2010
Page 1
arXiv:1108.3299v1 [cs.SY] 16 Aug 2011
Submitted to Operations Research
manuscript (Please, provide the mansucript number!)
Bounding Procedures for Stochastic Dynamic
Programs with Application to the Perimeter Patrol
Problem
Myoungkuk Park
Department of Mechanical Engineering, Texas A&M University, College Station, TX 77843, robotian@gmail.com
Krishnamoorthy Kalyanam
Infoscitex Corporation, Dayton, OH 45431, krishna.kalyanam@gmail.com
Swaroop Darbha
Department of Mechanical Engineering, Texas A&M University, College Station, TX 77843, dswaroop@tamu.edu
Phil Chandler
Control Design & Analysis Branch, Air Force Research Laboratory, WPAFB, OH 45433, phillip.chandler@wpafb.af.mil
Meir Pachter
Electrical Engineering Department, Air Force Institute of Technology, WPAFB, OH 45433, meir.pachter@afit.edu
One often encounters the curse of dimensionality in the application of dynamic programming to determine
optimal policies for controlled Markov chains. In this paper, we provide a method to construct suboptimal
policies along with a bound for the deviation of such a policy from the optimum via a linear programming
approach. The statespace is partitioned and the optimal costtogo or value function is approximated by a
constant over each partition. By minimizing a nonnegative cost function defined on the partitions, one can
construct an approximate value function which also happens to be an upper bound for the optimal value
function of the original Markov Decision Process (MDP). As a key result, we show that this approximate
value function is independent of the nonnegative cost function (or state dependent weights as it is referred
to in the literature) and moreover, this is the least upper bound that one can obtain once the partitions
are specified. Furthermore, we show that the restricted system of linear inequalities also embeds a family
of MDPs of lower dimension, one of which can be used to construct a lower bound on the optimal value
function. The construction of the lower bound requires the solution to a combinatorial problem. We apply
the linear programming approach to a perimeter surveillance stochastic optimal control problem and obtain
numerical results that corroborate the efficacy of the proposed methodology.
Key words: Stochastic Dynamic Programs, Linear Programming, State Aggregation
1
Page 2
Author: Park et al.
2
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
1. Introduction
The Linear Programming (LP) approach to solving dynamic programs (DPs) originated from the
papers: Manne (1960), d’Epenoux (1963), Denardo (1970), Hordijk and Kallenberg (1979). The
basic feature of an LP approach for solving DPs corresponding to maximization of a discounted
payoff is that the optimal solution of the DP (also referred to as the optimal value function) is
the optimal solution of the LP for every nonnegative cost function. The constraint set describing
the feasible solution of the LP and the number of independent variables are typically very large
(curse of dimensionality) and hence, obtaining the exact solution of a DP (stochastic or otherwise)
via an LP approach is not practical. Despite this limitation, an LP approach provides a tractable
method for approximate dynamic programming (Mendelssohn 1980, Schweitzer and Seidmann
1985, Trick and Zin 1997) and the advantages of this approach may be summarized as follows:
1. One can restrict the value function to be of a certain parameterized form, thereby reducing
the dimension of the LP to the size of the parameter set to make it tractable.
2. The solution to the LP provides upper bounds for the value function (lower bounds, if minimiz
ing a discounted cost, as opposed to maximizing discounted payoff, is considered as the optimization
criteria).
The main questions regarding the tractability and quality of approximate DP revolve around
restricting the value function in a suitable way. The questions are: (1) How does one restrict the
value function, i.e., what basis functions should one choose for parameterizing the value function?
(2) Are there any (a posteriori) bounds that one can provide about the value function from the
solution of a restricted LP? If the restrictions imposed on the value function are consistent with the
physics/structureof the problem, one can expect reasonably tight bounds. There is another question
that naturally arises: In the unrestricted case, the optimal solution of the LP is independent of
the choice of the nonnegative cost function. While it is unreasonable to expect that the optimal
value function be a feasible solution of the restricted LP, one can ask if the optimal solution of
the restricted LP is the same for every choice of nonnegative cost function for the LP. It has been
reported in the literature that this is unfortunately not the case (De Farias and Van Roy 2003).
Page 3
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
3
If the LP is not properly restricted, it can lead to poor approximation and perhaps, even infea
sibility (Gordon 1999). A common approach is to approximate the value (costtogo) function
by a linear functional of a priori chosen basis functions (Schweitzer and Seidmann 1985). This
approach is attractive in that for a certain class of basis functions, feasibility of the approximate (or
restricted) LP is guaranteed (De Farias and Van Roy 2003). A straightforward method for select
ing the basis functions is through a state aggregation method. Here the state space is partitioned
into disjoint sets or partitions and the approximate value function is restricted to be the same for
all the states in a partition. The number of variables for the LP therefore reduces to the number of
partitions. State aggregation based approximation techniques were originally proposed by Axs¨ ater
(1983), Bean et al. (1987), Mendelssohn (1982). Since then, substantial work has been reported in
the literature on this topic (see Van Roy (2006) and the reference therein). In this article, we adopt
the state aggregation method.
Although imposing restrictions on the value function reduces the size of the restricted LP, the
number of constraints does not change. Since the number of constraints is at least of the same
order as the number of states of the DP, one is faced with a restricted LP with a large number
of constraints. An LP with a large number of constraints may be solved if there is an automatic
way to separate a nonoptimal solution from an optimal one (Gr¨ otschel et al. 1981); otherwise,
one may have to resort to heuristics or settle for an approximate solution. Separation of a non
optimal solution from an optimal one is easier if one has a compact representation of constraints
(Morrison and Kumar 1999) or if a subset of the constraints that dominate other constraints can
easily be identified from the structure of the problem (Krishnamoorthy et al. 2011b). Heuristic
methods include aggregation of constraints, subsampling of constraints (De Farias and Van Roy
2003), constraint generation methods(Gr¨ otschel and Holland 1991, Schuurmans and Patrascu
2001) and other approaches (Trick and Zin 1993).
If the solution of the restricted LP is the same for every nonnegative cost function of the LP,
then it suggests that the constraint set for the restricted LP embeds the constraint set for the
exact LP corresponding to a reduced order Markov Decision Process (MDP). If one adopts a naive
Page 4
Author: Park et al.
4
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
approach and “aggregates” every state into a separate partition, we obtain the original exact LP
and clearly, for this LP, the solution is independent of the nonnegative cost function. It would
seem reasonable to expect that this would generalize to partitions of arbitrary size and in fact, we
prove this to be the case in this article. One can construct a suboptimal policy from the solution
to the restricted LP by considering the policy that is greedy with respect to the approximate value
function (Porteus 1975). By construction, the expected discounted payoff for the suboptimal policy
will be a lower bound to the optimal value function and hence, can be used to quantify the quality
of the suboptimal policy. Also the lower bound will be closer to the optimal value function than
the approximate value function by virtue of the monotonicity property of the Bellman operator.
But the lower bound computation is not efficient since the procedure involved is tantamount to
policy evaluation which involves the solution to a system of linear equations of the same size as the
statespace. In this work, we have developed a novel disjunctive LP, whose solution can be used
to construct a lower bound to the optimal value function. The contributions of our work may be
summarized as follows:
• If one were to adopt a state aggregation approach, then the solution to the restricted LP
is shown to be independent of the nonnegative cost function. Moreover, the optimal solution is
dominated by every feasible solution to the restricted LP.
• We also show that considering alternate LP formulations via lifting of variables or by consider
ing a bigger feasible set via iterated Bellman inequalities (Wang and Boyd 2010) does not improve
upon the upper bound provided by the restricted LP.
• A subset of the constraints of the restricted LP can be used for constructing a lower bound
for the optimal value function. However, this involves solving a disjunctive LP, which may not be
computationally tractable.
• We demonstrate the use of aggregation based restricted LPs for a perimeter surveillance
stochastic control problem. For the application considered here, we show that both the lower
bounding disjunctive LP and the upper bounding restricted LP can be solved efficiently since they
both reduce to exact LPs corresponding to some lower dimensional MDPs.
Page 5
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
5
The rest of the paper is organized as follows: we provide a general overview of stochastic dynamic
programs in section 2 followed by LP preliminaries in section 2.1. In section 3, we introduce the
aggregation method and discuss the restricted LP approach that can be used to approximate the
optimal value function. In the same section, we also present a novel disjunctive LP that can be
used to compute a lower bound to the optimal value function. We introduce the perimeter alert
patrol problem in section 4 and also elaborate on the efficient LP formulations that arise out of
the structure in the problem. We corroborate the structure in the perimeter patrol problem via
numerical results in section 5. Finally, we support the proposed approximation methodology via
simulation results in section 5.1, followed by summary in section 6. Supplementary material and
lengthy proofs, that have been left out of the main body of the paper, for clarity, have been included
in the Appendix.
2. Stochastic Dynamic Programming
Consider a discretetime Markov decision process(MDP) with a finite state space S ={1,2,...,S}.
For each state x ∈ S, there is a finite set of available actions Ux. From current state x, taking
action u ∈ Uxunder the random influence Y results in a reward Ru(x). The system follows some
discretetime dynamics given by:
x(t+1)=f(x(t),u(t),Y(t)),
(1)
where t indicates time. We assume that the random input Y can only take a finite set of values Yl;l =
0,...,m and there is a probability associated with each choice pl. State transition probabilities
Pu(x,y) represent, for each pair (x,y) of states and each action u ∈ Ux, the probability that the
next state will be y given that the current state is x and the current action taken is u i.e.,
Pu(x,y)=
?0, if y ?=f(x,u,Yl) for any l ∈{0,...,m},
?
j∈Cpj, where C ={ly =f(x,u,Yl)}.
(2)
Any stationary policy, π, specifies for each state x∈S, a control action u=π(x). We abuse notation
and also write the transition probability matrix associated with policy π to be Pπ, where Pπ(x,y)=
Pπ(x)(x,y). Similarly, we express the column vector of immediate payoffs associated with the policy
Page 6
Author: Park et al.
6
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
π to be Rπ, where Rπ(x) = Rπ(x)(x). We are interested in solving a stochastic control problem,
which amounts to selecting a policy that maximizes the infinitehorizon discounted reward of the
form,
Vπ(x0) =E
?
∞
?
t=0
λtRπ(x(t))
?????x(0)=x0
?
,
where λ ∈ [0,1) is a temporal discount factor. We obtain the optimal policy by solving Bellman’s
equation,
V∗(x)=max
u∈Ux
?
Ru(x)+λ
m
?
l=0
plV∗(f(x,u,Yl))
?
,∀x∈S,
(3)
where, V∗(x) is the optimal value function (or optimal discounted payoff) starting from state x.
The optimal policy then is given by,
π∗(x)=argmax
u∈Ux
?
Ru(x)+λ
m
?
l=0
plV∗(f(x,u,Yl))
?
,∀x∈S.
(4)
The Bellman equation (3) can be solved using standard DP methodssuch as value iteration (Howard
1960) or policy iteration (Bellman 1957); however, it is computationally not tractable, if the size
of state space considered is unmanageably large. For this reason, one is interested in tractable
approximate methods that yield suboptimal solutions with some guarantees on the deviation of
the associated approximate value function from the optimal one.
2.1. Linear Programming Approach
In this subsection, we briefly touch upon two lemmas that we will use in the subsequent sections.
Bellman’s equation suggests that the optimal value function satisfies the following set of linear
inequalities, which we will refer to as the Bellman inequalities:
V (x) ≥ Ru(x)+λ
m
?
l=0
plV (f(x,u,Yl)), ∀u∈Ux, ∀x∈S.
⇔V ≥ Ru+λPuV, ∀u.
(5)
Consider any integer L ≥ 1 and for j = 1,2,...,L, let Vjbe a vector satisfying a generalization of
the Bellman inequalities, referred to as the iterated Bellman inequalities (Wang and Boyd 2010):
Vj+1(x) ≥ Ru(x)+λ
m
?
l=0
plVj(f(x,u,Yl)), ∀x,u,
∀j =1,2,...,L−1,
(6)
Page 7
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
7
V1(x) ≥ Ru(x)+λ
m
?
l=0
plVL(f(x,u,Yl)), ∀x,u.
(7)
Clearly, when L = 1, the above system of inequalities collapses to the Bellman inequalities. The
iterated Bellman inequalities may be compactly represented as:
Vj+1≥ Ru+λPuVj, ∀ u, j =1,2,...,L−1,
V1≥ Ru+λPuVL, ∀u.
(8)
We note that the above set of inequalities have cyclic symmetry, i.e., one gets the same set of
inequalities by replacing the vectors V1,V2,...,VLby V2,V3,...,VL,V1respectively. Let π be any
stationary policy. Then we have,
Vj+1≥ Rπ+λPπVj, j =1,2,...,L−1,
(9)
V1≥ Rπ+λPπVL.
(10)
By recursively applying (9) to VL,VL−1,... etc., in (10), we get,
[I −λLPL
π]V1≥ [I +λPπ+···+λL−1PL−1
π
]Rπ, ∀ π.
By cyclic symmetry, every Vj, j =2,3,...,L, also satisfies the above inequality.
Lemma 1. Let the vector V satisfy the following set of inequalities:
?I −λLPL
π
?V ≥?I +λPπ+···+λL−1PL−1
π
?Rπ,
∀π.
(11)
Then, we have V ≥V∗.
Remark 1. We readily see that every feasible solution of the system of inequalities (5) or (8)
is lower bounded by the optimal value function V∗. By cyclic symmetry, we conclude that every
feasible Vj, j =1,...,L is also lower bounded by V∗.
The following result relates the optimal value function to the optimal solution of an LP with a
nonnegative cost function and constraints of the form given by the Bellman inequalities (5) or
iterated Bellman inequalities (8).
Page 8
Author: Park et al.
8
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
Lemma 2. Let c be a vector of statedependent weights with c(x) ≥ 0 for every x ∈ S. Then V∗
minimizes the linear functional cTV among all V ’s satisfying the Bellman inequalities (5). Corre
spondingly, the Ltuple (V∗,··· ,V∗) minimizes the linear functional?L
(V1,...,VL) satisfying the iterated Bellman inequalities (8).
j=1cTVjamong all Ltuples
Proof of Lemma 2. The proof follows from the fact that V ≥ V∗and hence, cT(V − V∗) ≥ 0.
Since V∗is feasible for the inequalities (8) for any L ≥ 1, the result follows. Similarly, since the
Ltuple (V∗,··· ,V∗) is feasible for (8) and since Vj≥ V∗for j =1,2,...,L, it readily follows that
the Ltuple is optimal.
?
3. Bounds using Partitioning
Let the set of all states S be partitioned into M disjoint sets, Si,i = 1,...,M. We will call the
set Sithe ithpartition. Henceforth, we will use the following notation: if f(x,u,Ys) represents the
state the system transitions to starting from x and subject to a control input u and a stochastic
disturbance Ys, then¯f(x,u,Ys) represents the partition to which the final state belongs. For a given
u and partition index i, we define the tuple zi,u
x =(¯f(x,u,Y0),¯f(x,u,Y1),...,¯f(x,u,Ym)) for every
x∈Si. We denote by T (i,u) the set of all distinct zi,u
x
for a given partition index i and control u.
3.1. Restricted Linear Program
We have, from Lemma 2, that the optimal solution to the following LP,
ELP := mincTV,
subject to(12)
V ≥ Ru+λPuV,
∀u,
referred to as the “exact LP” in the literature, is the optimal value function V∗. Let us start
with restricting the exact LP by requiring further that V (x) = v(i) for all x ∈ Si, i = 1,...,M.
Augmenting these constraints to the exact LP, one gets the following restricted LP.
RLP := min
M
?
i=1
?
x∈Si
c(x)v(i)subject to (13)
Page 9
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
9
v(i) ≥ Ru(x)+λ
m
?
l=0
plv(¯f(x,u,Yl)),
∀x∈Si, i=1,...,M, ∀u.
The restricted LP can also be written in the following compact form:
RLP = mincTΦv
subject to(14)
Φv ≥ Ru+λPuΦv,
∀u,
where the columns of Φ (commonly referred to as “basis functions” in the literature) are given by,
Φ(x,i)=
?1, if x∈ Si,
0, otherwise.
,i=1,...,M.
(15)
The restricted LP typically deals with a much smaller number of variables i.e., M << S. An
approximate value function can be constructed from every feasible solution to RLP according to
Vup= Φv ⇒ Vup(x) = v(i), ∀x ∈ Si,i = 1,...,M. Since the approximate value function satisfies, by
construction, the Bellman inequalities (5), it is automatically an upper bound to V∗by Lemma 1.
So, if v∗is the optimal solution to RLP (13), then clearly, Φv∗≥V∗. Now we are ready to address
one of the main results of the paper.
Theorem 1. The optimal solution, v∗, to the RLP is independent of the cost vector c once the
partitions are specified.
Proof of Theorem 1.The main idea behind the proof is the following: The constraints in the
restricted LP (13) do not, in general, correspond to those of a Markov Decision Process (MDP)
because the transition from one partition to another for a given control u and random input Ylis
not specified unambiguously. This is because different states in the same partition can transition
to different partitions for the same u and Yl. If one were to think of a “random” selector for a state
in a partition, then the specification of u, Yltogether with the random selector specifies exactly
which partition the system would transition to next, from the current partition. Let us specify the
probability of picking a state in a partition, corresponding to the random selector, via the optimal
dual variables for RLP. For a given partition index i, the RLP specifies a constraint on v(i) for
Page 10
Author: Park et al.
10
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
each x ∈ Si and u. Let the dual variable corresponding to this constraint be µi
u(x) ≥ 0 and the
corresponding optimal dual variable be ¯ µi
u(x). With this definition, we can proceed to prove the
result via the following steps:
1. We show that for every partition index i, there is a u such that ¯ µi
u(x) > 0 for some x ∈ Si.
This is necessary for constructing a MDP of reduced dimension in the next step; otherwise, the
corresponding value of v(i) is not lower bounded.
2. We define a reduced order MDP on the partitions with immediate reward and transition
probability given by,
ru(i)=
?
x∈Si
hi
u(x)Ru(x) and˜Pu(i,j):=
??
x∈Sihi
u(x)?
y∈SjPu(x,y), if u∈ Ui,
0,
otherwise,
where u∈ Uiif?
of picking the state x from the partition Si.
x∈Si¯ µi
u(x)> 0. We may interpret the term hi
u(x) =
¯ µi
x∈Si¯ µiu(x)as the probability
u(x)
?
3. We show that the socalled “surrogate LP” obtained by aggregating the constraints of RLP
via the optimal dual variables,
SLP(¯ µ) := min
M
?
i=1
?
? ?? ?
?
x∈Si
c(x)
¯ c(i)
v(i),
subject to(16)
v(i) ≥ ru(i)+λ
M
j=1
˜Pu(i,j)v(j), ∀u∈Ui,i =1,...,M,
is the exact LP corresponding to the reduced order MDP defined in step 2 above. In essence, for
a given c, the optimal value function of the reduced order MDP is the optimal solution of RLP.
We use the properties of surrogate duality (Greenberg and Pierskalla 1970, Glover 1975, 1968) to
demonstrate that SLP(¯ µ) =RLP.
4. Finally, to show that the optimal solution to RLP is independent of c, we note that the
constraints of SLP(¯ µ) are obtained by taking convex combinations of the constraints in RLP.
Hence, any feasible solution to RLP is also feasible for SLP(¯ µ). Since every feasible solution of the
exact LP corresponding to an MDP dominates the optimal solution (from Lemma 1), we conclude
Page 11
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
11
that the optimal solutions corresponding to two different cost functions c1 and c2 necessarily
dominate each other and hence, have to be the same.
?
We shall now establish the surrogate LP result via the following lemma with the proof provided
in the Appendix.
Lemma 3. Consider a surrogate LP for the RLP through a set of dual variables, µ given by:
SLP(µ) := min¯ cTv,
subject to
?
(17)
?
x∈Si
µi
u(x)v(i) ≥
?
x∈Si
µi
u(x)
Ru(x)+λ
m
?
l=0
plv(¯f(x,u,Yl))
?
, ∀u, i=1,...,M.
Then, ∃¯ µ≥0 such that, SLP(¯ µ)=RLP, and, for every partition index i =1,...,M, ∃u such that
?
any other feasible solution v to RLP dominates v∗.
x∈Si¯ µi
u(x)>0. Moreover, the optimal solution v∗to RLP is independent of the cost vector ¯ c and
Theorem 1 implies that the upper bound for the optimal value function cannot be improved
by changing the cost function from a linear to a nonlinear function or by restricting the feasible
set of RLP further since the optimal solution of RLP is dominated by every feasible solution
of RLP. Also Φv∗is the least upper bound to the optimal value function V∗since any other
feasible v to RLP satisfies Φv ≥ Φv∗. Hence, a refinement of the upper bound must necessarily
involve an enlargement of the feasible set if one wants to stick to an LP formulation, i.e., it should
include the feasible set of (13) and possibly other tighter upper bounds than the optimal solution
of RLP. Lifting of variables is one way to improve the bound; in this connection, we show in
the following section that neither a general lifted LP nor one obtained by including the iterated
Bellman inequalities in the constraint set improves the upper bound.
Remark 2. If one considers the suboptimal dual variables, µi
u(x)=
1
Si,∀x∈ Si,∀u, then solving
the corresponding surrogate dual, SLP(µ), to obtain an approximate value function, would result
in the socalled “hard aggregation” method (see Sec. 4 of Bertsekas (2007)).
Remark 3. When µ and Φ are allowed to have arbitrary positive entries satisfying?S
1,∀i ∈ {1,...,M} and?M
x=1µi
u(x)=
j=1Φ(y,j) = 1,∀y ∈ S, the method is referred to as “soft aggregation”
Page 12
Author: Park et al.
12
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
(Singh et al. 1995). Unfortunately, in this case, the optimal solution to the restricted LP formula
tion (14) has been shown to be dependent on the cost function (De Farias and Van Roy 2003).
3.2. Lifted Restricted Linear Programs
It may appear that we can get tighter upper bounds than those provided by the RLP by considering
either lifted LPs whose feasible set is larger than that of RLP or LPs with a different objective
function. We will show, in this section, that unfortunately this is not the case. In general, one can
construct a lifted LP of the form:
LLP := min¯ cTv +dTz,
subject to
V (x) ≥ Ru(x)+λ
m
?
l=0
pl(V (f(x,u,Yl))), ∀x,u,
(18)
V (x) = v(i), ∀x∈ Si, i =1,...,M,
(19)
z ≥ 0,
where z is the additional vector of variables used in lifting so that the feasible set is not empty. Then,
it follows that if (˜ v, ˜ z) is optimal to LLP, then ˜ v will be a feasible solution to RLP. Consequently,
Φ˜ v ≥ Φv∗, where v∗is the optimal solution of the RLP. In other words, one gets no better bound
via lifting if the constraints (18) and (19) are included. One could also use the iterated Bellman
inequalities (8) for constructing a lifted LP of the form:
IB := min
L
?
j=1
¯ cTvj,
subject to
vj+1(i) ≥ Ru(x)+λ
m
?
m
?
l=0
plvj(¯f(x,u,Yl)),
∀x∈Si, ∀i,u, j =1,...,L−1,
(20)
v1(i) ≥ Ru(x)+λ
l=0
plvL(¯f(x,u,Yl)),
∀x∈Si, ∀i,u.
(21)
Again, it turns out that the above lifted LP is incapable of providing a better bound, as can be
seen from the following result.
Theorem 2. If vIB=(v1,··· ,vL) is a feasible solution to IB, then vj≥ v∗for j =1,...,L, where
v∗is the optimal solution to RLP.
Page 13
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
13
The proof for Theorem 2 follows along the lines of Lemma 3. We will construct a surrogate LP
for the lifted LP (20) with the optimal dual variables of RLP. We immediately recognize that the
inequalities defining the surrogate LP are, in fact, the iterated Bellman inequalities associated with
the reduced order MDP defined in step 2 of the proof of Theorem 1. So, the result follows from
Lemma 2 and Remark 1.
Proof of Theorem 2.Let ¯ µ be the optimal dual variables to RLP (13). From Lemma 3, for
every partition index i ∈ {1,...,M}, there exists a u such that?
u, we multiply the inequalities (20, 21) associated with a particular x ∈ Siwith ¯ µi
x∈Si¯ µi
u(x) >0. For a fixed i and
u(x) and sum
over all the x∈Si. Then, we get the following surrogate LP:
SIB := min
L
?
j=1
¯ cTvj,
subject to
vj+1(i) ≥ ru(i)+λ
?
?
x∈Si
hi
u(x)
m
?
m
?
l=0
plvj(¯f(x,u,Yl)), ∀u∈Ui,∀i, j =1,...,L−1,
(22)
v1(i) ≥ ru(i)+λ
x∈Si
hi
u(x)
l=0
plvL(¯f(x,u,Yl)), ∀u∈ Ui,∀i,
(23)
where, u∈ Uiif?
x∈Si¯ µi
u(x)>0. As before, the onestep reward function,
ru(i)=
?
x∈Si¯ µi
?
u(x)Ru(x)
x∈Si¯ µi
u(x)
, where, hi
u(x)=
¯ µi
u(x)
?
x∈Si¯ µi
u(x),
∀u∈Ui.
By Lemma 2, the optimal solution to SIB is of the form v∗
SIB=(v∗,··· ,v∗), where v∗is the optimal
solution to SLP(¯ µ) (and by Lemma 3, also the optimal solution to RLP). Since any feasible
solution to IB, vIB= {v1,...,vL} is also feasible to SIB, it follows, from Lemma 1, that vj≥ v∗
for every j =1,...,L.
?
So, we conclude that lifting through the use of iterated Bellman inequalities does not help in
finding a tighter upper bound than the RLP optimal solution. Also using any other nonlinear
objective function will not improve the upper bound as long as the iterated Bellman inequalities
(20) and (21) are included in the constraints set. In the next section, we focus our attention on the
construction of a lower bound for the optimal value function.
Page 14
Author: Park et al.
14
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
3.3. Lower Bound for the Optimal Value Function
For any candidate approximate value function˜V , one can construct a suboptimal “greedy” policy
according to:
˜ π(x)=argmax
u
?
Ru(x)+λ
?
y
Pu(x,y)˜V (y)
?
,
∀x∈S.
Let us define the improvement in value function, ˜ α(x):= R˜ π(x)+λ?
that there is no improvement, i.e., ˜ α ≡ 0, when˜V = V∗. The expected discounted payoff, V˜ π,
yP˜ π(x,y)˜V (y)−˜V (x). Note
corresponding to the suboptimal policy ˜ π, satisfies the following bound (Porteus 1975):
˜V (x)+
1
1−λmin
y
˜ α(y)≤ V˜ π(x)≤V∗(x),
∀x∈S.
In our experience,the lower bound to the optimal value function provided by V˜ πis very conservative.
Also computation of V˜ πinvolves solving a linear system of equations of size S, which would be
expensive for a large statespace. So, we construct a novel alternate lower bound as follows. Recall
that for each x∈Si, V∗(x) satisfies the Bellman inequality (5):
V∗(x) ≥ Ru(x)+λ
m
?
m
?
l=0
plV∗(f(x,u,Yl)),
∀u,
≥ Ru(x)+λ
l=0
pl
min
y∈¯ f(x,u,Yl)V∗(y),
∀u.
(24)
Let ¯ w(i) :=minx∈SiV∗(x), i=1,...,M. Then, it follows from (24) that,
¯ w(i) ≥ min
x∈Si
?
Ru(x)+λ
m
?
l=0
pl¯ w(¯f(x,u,Yl))
?
∀u, i=1,...,M.
(25)
The above set of inequalites motivates the following nonlinear program:
NLP := min¯ cTw,
subject to
w(i) ≥ min
x∈Si
?
Ru(x)+λ
m
?
l=0
plw(¯f(x,u,Yl))
?
,
∀u, i=1,...,M.
(26)
Let w∗be the optimal solution to NLP. By construction, we see that ¯ w is a feasible solution to
the NLP and hence,
¯ cTw∗≤ ¯ cT¯ w =
M
?
i=1
¯ c(i)min
x∈SiV∗(x).
Page 15
Author: Park et al.
Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!)
15
So, by choosing ¯ c(i) = 1 and ¯ c(j) = 0 for all j ?= i, one can obtain a lower bound to the optimal
value function for all the states in the ithpartition. Moreover, if the problem under consideration
exhibits a special structure, one can show that NLP collapses to an LP that can be efficiently
solved. The perimeter patrol problem considered herein exhibits such a structure; we demonstrate
this in the next section.
Remark 4. The NLP is referred to as a disjunctive linear program (Balas 1979) and the optimal
solution to NLP is the solution that minimizes the same linear objective function over the convex
hull of the feasible solutions of NLP. Balas (1998) provides two methods to solve the problem: one
through a lifted representation for the convex hull of the feasible set of NLP and the other through
a cutting plane technique. Since the number of lifted variables is of O(M2U); if M =10,000, then
one must deal with a lifted LP with 100 million variables. The original (nonaggregated LP) has
about 10 million variables and hence, the lifted representation method is not practical. For this
reason, the cutting plane technique is a viable alternate method.
Remark 5. The lower bound provided by NLP is a nontrivial one because the optimal solution
is the optimal value function of a reduced order MDP. Hence, the lower bound will be better
than at least the value function associated with some suboptimal policy and so, is nontrivial and
nonconservative.
Remark 6. While Simay have a lot of states, the number of entries on the right hand side of the
nonlinear constraint (26) over which the minimization must be carried out is the cardinality of
T (i,u). NLP is combinatorial in nature, in the sense that one must pick one (m+1) tuple for each
i and u over which the optimization must be carried out. However, for each (m+1) tuple picked,
one obtains an MDP. So, the system of inequalities (26) describes a family of underlying MDPs.
4. Perimeter Patrol Problem
The perimeter patrol problem arose from the Cooperative Operations in Urban Terrain
(COUNTER) project at AFRL (Gross et al. 2006). In this problem, there is a perimeter which must
View other sources
Hide other sources
 Available from Meir Pachter · May 26, 2014
 Available from ArXiv