Page 1
Approximating Optimal Policies for Agents with Limited Execution Resources
Dmitri A. Dolgov and Edmund H. Durfee
Department of Electrical Engineering and Computer Science
University of Michigan
Ann Arbor, MI 48109
{ddolgov, durfee}@umich.edu
Abstract
An agent with limited consumable execution re-
sources needs policies that attempt to achieve good
performance while respecting these limitations.
Otherwise, an agent (such as a plane) might fail
catastrophically (crash) when it runs out of re-
sources (fuel) at the wrong time (in midair). We
present a new approach to constructing policies for
agents with limited execution resources that builds
on principles of real-time Al, as well as research
in constrained Markov decision processes. Specif-
ically, we formulate, solve, and analyze the pol-
icy optimization problem where constraints are im-
posed on the probability of exceeding the resource
limits. We describe and empirically evaluate our
solution technique to show that it is computation-
ally reasonable, and that it generates policies that
sacrifice some potential reward in order to make the
kinds of precise guarantees about the probability of
resource overutilization that are crucial for mission-
critical applications.
1 Introduction
Optimality is the gold standard in rational decision making
(e.g., [Russell and Subramanian, 1995]), and, consequently,
the problem of constructing optimal policies for autonomous
agents has received considerable attention over the years.
Traditionally, this problem has been viewed separately from
the problem of actually carrying out the policies. However,
real agents have limitations as to what they can execute, and,
clearly, a policy is less useful if an agent might run out of
resources while carrying out the policy.
In this paper, we present a new approach for construct-
ing policies for agents that have limited consumable re-
sources where running out of the resources can have neg-
ative consequences. Whereas AI research has mostly fo-
cused on (PO)MDP [Boutilier et ai, 1999; Dean et ai,1993;
Howard, 1960; Puterman, 1995] methods for formulating
policies for agents without emphasizing constraints on their
execution resources, the Operations Research literature has
developed constrained MDP (CMDP) [Altaian, 1999; Put-
erman, 1995] techniques that can account for resource con-
straints. CMDP methods are particularly useful for domains
where the current resource amounts are unobservable and
cannot be easily estimated by the agent, or where modeling
resource amounts in the state description is computationally
infeasible. In an aircraft scenario, some resources and situa-
tions where such methods arc beneficial could include an air-
plane with a broken fuel gauge (fuel amount is unobservable),
pilot fatigue (attention is a resource that cannot be easily es-
timated), or a combination of non-critical resources (ex. var-
ious refreshments) that should nevertheless not be exhausted,
but explicit modeling of which unnecessarily complicates the
optimization problem and increases policy size. In the rest of
the paper, we will use the fuel example, simply because it is
a very intuitive instance of a consumable resource.
However, as we will explain later, standard risk-neutral
CMDP optimization techniques are not applicable to prob-
lems where violating the constraints can have negative or, in
the limit, catastrophic1 consequences. The main contribution
of this work is that it extends the standard CMDP techniques
to handle the types of hard constraints that naturally arise in
problems involving critical resources. In particular, we for-
mulate an optimization problem where constraints are im-
posed on improbability of resource overutilization, and show
how the problem can be solved using standard linear pro-
gramming algorithms. Our formulation yields sub-optimal
solutions to the constrained problem because it sacrifices po-
tential reward to make guarantees about the probability of
violating the resource constraints. As we later show, when
violating constraints incurs dire costs, these guarantees are
worthwhile ([Musliner et al, 1995] and references therein).
We introduce our model in section 2, where we review
Markov models, introduce notation, and specify our assump-
tions about the problem domain. Section 3 describes and
compares several candidate solutions for addressing aspects
of our problem. We then present in section 4 our new method,
and empirically evaluate it in section 5. We conclude with a
discussion about the strengths and limitations of our results,
and about our future directions.
2 The Model
The stochastic properties of the environments in the prob-
lems that we are addressing lead us to formulate our op-
!The term catastrophic is, of course, relative. We assume that the
system designer is willing to accept certain risks to receive payoff.
RESOURCE-BOUNDED REASONING
1107
Page 2
timization problem as a stationary, discrete-time, Markov
decision process. In this section, we review some well-
known facts from the theory of standard [Puterman, 1995;
Boutilier et a/., 1999] and constrained [Altman, 1999] fully-
observable Markov decision processes and also discuss the
assumptions that are particular to the class of problems that
we address in this work. This section provides the neces-
sary background for the subsequent sections, where we dis-
cuss resource-constrained optimization problems.
2.1 Markov Decision Processes
A standard Markov decision process can be defined as a tuple
where S is a finite set of states that the agent
can be in, is a finite set of actions that the agent can execute,
R defines the transition function
is the probability that the agent will go to state if it executes
action a in state i), and
function (agent receives a reward of ria for executing action
in state
Clearly, the total probability of transitioning out of a
state, given a particular action, cannot be greater than 1, i.e.
As we discuss below, we are actually interested
in domains where there exist states for which
A policy is defined as a procedure for selecting an ac-
tion in each state. A policy that makes its choices accord-
ing to a probability distribution over the set of actions is
called randomized and can be described as a mapping of
states to probability distributions over actions:
A deterministic policy that always chooses
the same action for a state is, of course, a special case
of a randomized policy. It can be seen (similarly to the
case of standard CMDPs, as described in [Kallenberg, 1983;
Puterman, 1995]) that under our optimization criterion and
constraints (section 4), deterministic policies can be subop-
timal. Therefore, in this work, we focus on approximating
optimal policies in the class of randomized ones.
If, at time 0, the agent has an initial probability distribution
over the state space, a Markov system follows the
following trajectory:
is the reward
(1)
where is the probability distribution at time
2.2
Typically, Markov decision processes are divided into two
categories: finite-horizon problems, where the total number
of steps that the agent spends in the system is finite and
is known a priori, and infinite-horizon problems, where the
agent is assumed to stay in the system forever (see [Puterman,
1995] for a detailed discussion of both types of models).
In this work we concentrate on dynamic real-time domains,
where agents have tasks to accomplish. For example, con-
sider an agent flying a plane, whose goal is to safely get to its
destination and land there. This example does not naturally
correspond to a finite-horizon problem, because the duration
of executing various policies is not predetermined (unless we
artificially impose such a finite duration, which is not easily
Assumptions
justifiable). On the other hand, the problem does not natu-
rally fit the definition of the infinite-horizon model, because
the plane obviously cannot keep on flying forever.
This leads us to make a slightly different and less common
(although, certainly, not novel) assumption about how much
time the agent spends executing its policy. We assume that
there is no predefined number of steps that the agent spends
in the system, but that optimal policies always yield transient
Markov processes (decision problems of this type were exten-
sively studied by Kallenberg f 1983]). A policy is said to yield
a transient Markov process if the agent executing that policy
will eventually leave the corresponding Markov chain, after
spending a finite number of time steps in it. Given a finite
state space, this assumption implies that there has to be some
"leakage" of probability out of the system, i.e. there have
to exist some state-action pairs
One particular case where the above assumption holds is in
a system in which all trajectories lead to absorbing states.
Once an agent enters an absorbing state, it has finished (or
failed to finish) some task and has nowhere else to go, i.e. the
probability of transitioning out of an absorbing state i is zero:
In the plane-flying example, all trajectories
lead to either a safe landing or a crash, and once the agent
enters one of these states, the probability of transitioning to
other states is zero.
The transient nature of our problems leads us to adopt the
expected total reward as the policy evaluation criterion. Given
that an agent receives a reward whenever it executes an action,
the total expected utility of a policy can be expressed as:
for which
(2)
where T is the number of steps during which the agent accu-
mulates utility. For a transient system with bounded rewards,
the above sum converges for any T.
3 Related Work
In this section, we briefly survey several approaches (based
on well-known techniques) to solving the problem of finding
optimal policies for agents with limited resources, and point
out the assumptions, strengths, and limitations of these meth-
ods. This establishes a landscape of solution algorithms, to
which we can compare our method (presented in the next sec-
tion) in terms of complexity, efficiency, and solution quality.
3.1 Fully Observable MDPs
The most straightforward way of handling resource con-
straints in a MDP framework is to explicitly model the re-
sources by making their current amounts a part of the state
description. This yields a standard FOMDP, which can be
solved by a wide variety of efficient methods. The benefit
An alternative way of handling these states is to treat them as
infinitely-recurrent, i.e. once the agent gets there, it infinitely tran-
sitions back to itself. We do not adopt this model, because it is less
natural for our domains and also leads to unnecessary complications
in the optimization problems.
1108 RESOURCE-BOUNDED REASONING
Page 3
of this approach is that it allows one to make use of all rele-
vant information to construct the best possible policy, as the
agent is able to base its action choice on the current state and
resource amounts. The downside of this approach is that it
requires an a priori discretization of resource amounts to be
made when the world model is constructed. Also, there is an
additional burden of specifying the rewards and state transi-
tions as functions of current resource amounts. Furthermore,
in this model, the size of the state space and, consequently,
the policy size, explodes exponentially in the number of re-
sources, as compared to the state space where the resource
amounts are not folded into the state description.
The FOMDP approach relies on the fact that the agent can
observe the exact amounts of all resources at runtime. How-
ever, this may not hold, especially in multiagent domains with
shared resources, when an agent does not know what other
agents have been doing and how much of the shared resources
they have been consuming.
3.2
An alternative to the FOMDP approach described above is to
treat the problem as a constrained Markov decision process
[Altman, 1999], where the resources are not explicitly mod-
eled, but rather are treated as constraints that are imposed on
the space of feasible solutions (policies).
A constrained MDP for a resource-limited agent differs
from a standard MDP in that the agent, besides getting re-
wards for executing actions, also incurs resource costs. Con-
sequently, a constrained MDP (CMDP) can be described as a
tuple (S, A, P, R, C, Q), where C = _
defines a vector of actions' costs
are used when action
and Q is the vector of amounts of available
resources (there is units of resource
The benefit of the CMDP is that it does not require one
to explicitly model how the resources affect the world. In-
deed, if all policies satisfy the resource constraints, one does
not have to worry about what happens when the resources
are overutilized. Consequently, the state space and the re-
sulting policies are exponentially smaller than the ones in the
FOMDP model. The standard CMDP formulation constrains
the expected usage of all resources to be below a certain limit
and can be formulated as the following linear program:
Constrained MDPs
units of resource
is executed in state i),
initially available).
(3)
where the variables
number of times action a is executed in state i.
A weakness of this approach is that it yields policies that
can be suboptimal, as compared to the ones constructed by
the FOMDP method, because the agent does not base its de-
cision on the current resource amounts, but rather completely
ignores that aspect of the state. However, as mentioned in the
previous section, in domains where the resources are not ob-
servable, or if the policy size is of vital importance (for exam-
ple if the agent's architecture imposes constraints on policies
that it can store), this approach could prove fruitful.
However, the biggest problem with this approach (as
pointed out, for example, by Ross and Chen in the telecom-
have the interpretation of the expected
RESOURCE-BOUNDED REASONING
munication domain [1988]) is that a standard CMDP imposes
constraints on the expected amounts. Clearly, this method
does not work for critical resources, whose overutilization has
negative consequences. Indeed, an agent that pilots aircraft
would not be satisfied with a policy that on average uses an
amount of fuel that does not exceed the tank capacity.
3.3
As just mentioned, Ross and Chen pointed out the weak-
ness of the CMDP approach with constraints on the expected
amounts. As a possible solution, Ross and Varadarajan [1989;
1991] propose an approach where constraints are placed on
the actual sample-path costs. In their work, the space of feasi-
ble solutions is constrained to the set of policies whose proba-
bility of violating the constraints (overutilizing the resources)
in the long run is zero. However, their work concentrates on
the average per-step costs and rewards, whereas we are inter-
ested in the total amounts, whose distributions are not easily
calculable. The approach of Ross and Varadarajan has the
same benefits as the standard CMDP method, i.e. no explicit
modeling of resources is required, and the state space and
policies are small. In addition, unlike the standard CMDP,
this method is suitable for problems with critical resources.
However, a weakness of this approach is that for some prob-
lems it might be too restrictive, in that it allows no possibility
of overutilizing the resources. Indeed, policies produced by
the sample path method might have significantly lower pay-
off, as compared to policies that have some probability of
resource overutilization. Furthermore, for some domains it
might be desirable for the system designer to be able to con-
trol the probability of resource overutilization, as a means of
balancing optimality and risk.
Sample Path MDPs
3.4 MDPs With Constraints on Variance
Another approach to handling deviations from the expecta-
tion in Markov models is to impose additional constraints on
(or to assign additional cost to) the variance. Sobel [1985]
proposed to constrain the expected cost and to maximize the
mean-variance ratio of the reward. Huang and Kallenberg
[1994] proposed a unified approach to handling variance via
an algorithm based on parametric linear programming. These
approaches have the same benefits as the standard CMDP and
the sample-path methods (compared to the FOMDP formula-
tion) in terms of state space and policy size, as well as the
complexity of constructing the initial world model. Addi-
tionally, they allow one to somewhat balance payoff and the
deviation from the expected. However, these methods do not
allow one to make hard guarantees about the probability of
overutilizing the resources.
4
As hinted at in the previous sections, we would like to be able
to constrain the feasible solution space to the set of policies
whose probability of overutilizing the resources is below a
user-specified threshold
would like to be able to solve the following math program:
Linear Approximation
In other words, we
(4)
1109
Page 4
where
policy, and
The trouble is that the optimization is in the space of
which can be interpreted as the expected number of times for
executing the actions in the corresponding states. However,
it is difficult to express
the optimization variables
information about the probability distribution of the random
variable of the number of times the action is actually executed
in the corresponding state - only the expected values.3 In
this section, we present a linear approximation to the above
program, based on the Markov inequality:4
is the total amount of resource
is the upper bound on that resource (as before).
that is used by the
as a simple function of
because the latter contain no
(5)
Using this inequality and the fact that the expected resource
usage can be expressed as
mization problem can be formulated as a linear program:
the opti-
A potential weakness of this approach is that the Markov
inequality gives a very rough upper bound, and the policies
that correspond to solutions to this LP can be too conserva-
tive, in that their real probability of overutilizing the resources
can be significantly lower than
guarantees about the probability of overutilizing the resources
is of vital importance, this method might prove valuable, as it
yields policies that, in general, would have higher payoff than
the ones obtained by the sample-path method (as the latter is
often too restrictive). On the other hand, unlike the standard
CMDPs that impose constraints on the expected resource us-
age, and the MDPs that constrain the variance, this method
allows one to explicitly bound the allowable probability of re-
source overutilization, and to make precise guarantees about
the behavior of the system in that respect.
It is also worth noting that, unlike the sample-path method
or the methods that constrain on variance, this method re-
lies on solving a standard linear program, whereas the for-
mer require solving quadratic or parametric linear programs.
Therefore, the above formulation appears to be a reasonable
approximation, because it should be no harder to solve than
the standard CMDP (see section 5 for experimental results),
while providing a means of balancing solution quality with
the precisely quantifiable risk of resource overutilization.
However, if making hard
5
To verify our hypotheses about the properties of the approxi-
mation described in the previous section, we have performed
a set of numerical experiments that compare its behavior to
a standard CMDP with constraints on the expected resource
amounts (section 3), and to an unconstrained MDP.
Evaluation
3Generally speaking, the total cost is a sum of a random num-
ber of differently-distributed random variables, and calculating its
probability distribution is a nontrivial task.
4This inequality only holds for nonnegative random variables.
However, for a transient system, we can make the assumption that
all costs are nonnegative, without any loss of generality.
To reduce the bias that might arise from using a small num-
ber of hand-crafted test cases, we have instead used a large
number of randomly-generated constrained MDPs. All of the
generated problems shared some common properties, among
which the most interesting ones are the following (the values
for our main experiments are given in parentheses):
The total number of states, actions, and re-
sources, respectively. (20, 20, 2)
Maximum reward. Rewards are assigned
from a uniform distribution on
Mc = max(C) Maximum action cost. Resource costs are
assigned to state-action pairs from
distribution of R and (C, R) (described below). (10)
Correlation between rewards and resource
costs; better actions are typically more costly.
Upper bounds on resource
amounts are assigned according to a uniform distribu-
tion from this range. ([200, 300])
Dissipation of probability - the probability that the agent
exits the system at each time step. ([0.95,0.99])
The last parameter was used to ensure a transient chain. In-
stead of providing a small number of sink states, we have
chosen to use a uniform dissipation of probability, in order
to avoid additional randomization in our experiments, as the
latter choice provides a more stable domain.
Our main concern was the behavior of our approxima-
tion, as a function of the probability threshold
we have run a number of experiments for various values of
To be more precise, we have gradually increased
from 0 to 1 in increments of 0.05, and for each value, gen-
erated 50 random models and solved them using the three
methods: 1) an unconstrained MDP without any regard for
resource limitations, 2) a standard CMDP with constraints on
the expected usage of resources, and 3) our CMDP with con-
straints on the probability of overutilizing the resources (eq.
6). We then evaluated (using a Monte-Carlo simulation) each
of the three solutions (policies) in terms of expected reward
and probability of overutilizing the resources.
Figure 1 shows a plot of the actual probability of overuti-
lizing the resources for the policies obtained via the three
methods as a function of the probability threshold
data points are averaged over the runs for a particular value
of po. The curve that corresponds to the method that bounds
the overutilization probability also shows the standard devia-
tion for the runs. The other data have very similar variance,
so we will use the plots of means (averaged over the runs for
a given po) for our analyses.
Obviously, po has no effect on the unconstrained and the
standard CMDP (which maintain more or less a constant
probability of overutilization), but it does affect to a large ex-
tent the solutions to the problem with constraints on overuti-
lization probability. One can see that the overutilization prob-
ability for the solutions produced by our approach is always
below (as it should be). Also, it is worth noting that when
approaches 1, our approximation yields the same results as
the other methods, which is good, since setting
not constrain the space of feasible solutions.
(10)
based on a
Therefore,
. The
should
1110
RESOURCE-BOUNDED REASONING
Page 5
Figure 1: Probability of resource overutilization.
Figure 2: Average rewards for solutions to the three problems.
The rewards obtained by these policies are shown in Fig-
ure 2. These actual rewards do not necessarily equal the ex-
pected rewards (which are used during the optimization pro-
cess). This is because only the runs that did not overutilize
the resources were included in the average, and policies were
not penalized for violating the resource constraints. This also
explains why the total rewards received by solutions to the
standard CMDPs were sometimes greater than the ones ob-
tained by solutions to the unconstrained problems.
However, realistically, an agent always incurs a penalty for
overutilizing a critical resource, where the penalty amount is
based on the consequences of overutilization. For example,
if the agent is flying a plane, and the resource in question is
fuel, the consequences of trying to use too much of that re-
source are catastrophic, so the penalty is very high. If we
take this into account by assigning a fixed penalty to poli-
cies that overutilize the resources, we can update the graph in
Figure 2 to get a more realistic picture. Figure 3 shows the
average rewards for solutions obtained via the three meth-
ods, recalculated to reflect the penalties: new rewards are
where
probability, R is the average reward for successful runs of the
given policy (as in Figure 2), and W is the penalty for overuti-
lization. We see that, if we take the penalty into account, there
is an interval of po where the conservative policy produced
by our linear approximation outperforms the other policies.
is the overutilization
Figure 3: Average rewards with penalties for overutilization.
Moreover, for large penalties (IV = -220), the conservative
policy outperforms the others for all values of; . Note that
here, the policy is re-evaluated post-factum. Section 6 briefly
discusses an approach that explicitly models the penalties in
the optimization program.
As we mentioned in section 4, the linear approximation
should be no harder to solve than the standard constrained
MDP, because both are formulated as linear programs with
the same number of constraints.
this, we have timed the runs of all our experiments. Figure
4 contains a plot of the times that it took to solve the prob-
lems in all our experiments. One can see that the running
times for all three methods are not appreciably different.5 In
particular, the average ratio of the running time for the stan-
dard CMDP to the running time of the unconstrained method
is 1.25; the ratio of the running time of our linear approxi-
mation with constraints on overutilization probability to the
running time of the unconstrained method is 1.06. The slight
downward curvature of the plot of the running time of our
approximation method appears to be a consequence of the
specific implementation of the linear programming algorithm
that we used in our experiments.
To experimentally verify
6 Discussion and Future Work
Our experiments substantiate the claims that our approxima-
tion provides an effective and efficient method that agents
can use to formulate policies that not only consider limita-
tions on execution resources, but that also explicitly bound
the probability of resource overutilization. Our new approxi-
mation achieves the constraints on overutilization, and is es-
sentially no more expensive to use than more standard CMDP
and unconstrained MDP methods. Because our method con-
structs policies that are more careful about avoiding resource
overutilization, the rewards associated with its policies when
resources are not overutilized tend to be less than the re-
wards for the other methods' policies when resources are not
overutilized. However, as we illustrated in Figure 3, when
overutilization incurs penalties our new method can outper-
5Here we have to note that we used the linear programming ap-
proach to solving the unconstrained problem. A different algorithm
such as value or policy iteration might be more efficient.
RESOURCE-BOUNDED REASONING
1111
Page 6
Figure 4: Running time for the three methods
form previous techniques. Thus, our new method is particu-
larly suited to agents engaged in mission-critical domains.
Furthermore, if penalties for resource overutilization are
known at design-time and can be expressed in the same units
as the rewards, an interesting modification of our LP formu-
lation is to include the penalties in the policy evaluation cri-
terion, as opposed to modeling the constraints on the overuti-
lization probability. This would yield an LP with the follow-
ing objective function:
(7)
where Wk is the penalty (in units of ria) incurred for overuti-
lizing resource k. The maximization is subject to just the
standard "conservation of probability" constraints as in (eq.
3). A benefit of this formulation is that, for certain initial
probability distributions, deterministic policies are optimal.
A downside is that the formulation does not allow one to ex-
plicitly control the acceptable overutilization probabilities.
As can be seen in Figure 1, our linear approximation is in
general overly conservative. For example, given permission
to overutilize the resource 20% of the time (p0 = 0.2), the
method generates a policy that overutilizes the resource only
about 1% of the time. Since rewards and resource usage are
typically correlated, we would expect that a policy that under-
shoots the permitted resource overutilization probability by a
lower amount would also yield a higher expected reward. To-
ward this end, we have formulated a quadratic programming
approximation, based on the Chebyshev inequality, which al-
lows us to put a better upper bound on the probability of re-
source overutilization:
(8)
where
amount of used resource
allowable regions for the amounts of used resources.
We are also investigating another formulation of the opti-
mization program that should give a more accurate estimate
of the resource overutilization probability. The method is
based on a polynomial approximation of the pdf of the to-
tal resource-usage cost and uses the moments of the cost as
is the variance in the
specify the widths of the and
the optimization variables. This approximation and the one
based on the Chebyshev inequality are more costly to com-
pute than the linear one. Our current efforts are to encode the
approximations and evaluate their strengths and weaknesses.
7
This paper is based upon work supported by DARPA/ITO and
the Air Force Research Laboratory under contract F30602-
00-C-0017 as a subcontractor through Honeywell Laborato-
ries. The authors thank Kang Shin, Haksun Li, and David
Musliner for their valuable contributions, as well as one of
the anonymous reviewers whose comments inspired (eq. 7).
Acknowledgments
References
[Altman, 1999] Eitan Altman. Constrained Markov Deci-
sion Processes. Chapman and HALL/CRC, 1999.
[Boutilier et al, 1999] Craig Boutilier, Thomas Dean, and
Steve Hanks. Decision-theoretic planning: Structural as-
sumptions and computational leverage. Journal of Artifi-
cial Intelligence Research, 11:1-94, 1999.
lDeane/a/., 1993] Thomas Dean, Leslie Pack Kaelbling,
Jak Kirman, and Ann Nicholson. Planning with deadlines
in stochastic domains. In Proc. of the Eleventh National
Confi on Artificial Intelligence, pages 574-579, CA, 1993.
[Howard, 1960] R. Howard.
Markov Processes. MIT Press, Cambridge, 1960.
[Huang and Kallenberg, 1994] Y. Huang and L.C.M. Kallen-
berg. On finding optimal policies for Markov decision
chains. Math, of Operations Research, 19:434-448, 1994.
iKallenberg, 1983] L.C.M. Kallenberg.
ming and Finite Markovian Control Problems. Mathcma-
tisch Centrum, Amsterdam, 1983.
[Musliner etal, 1995] David John Musliner, James Hendler,
Ashok K. Agrawala, Edmund H. Durfee, Jay K. Stros-
nider, and C. J. Paul. The challenges of real-time AI. IEEE
Computer, 28(1), 1995.
[Puterman, 1995] M. L. Puterman. Markov decision pro-
cesses. New York, 1995. John Wiley & Sons.
[Ross and Chen, 1988] K.W. Ross and B. Chen. Optimal
scheduling of interactive and non-interactive traffic in
telecommunication systems. IEEE Transactions on Auto
Control, 33:261-267, 1988.
[Ross and Varadarajan, 1989] K.W. Ross and R. Varadara-
jan. Markov decision processes with sample path con-
straints: the communicating case. OR, 37:780-790, 1989.
[Ross and Varadarajan, 1991] K.W. Ross and R. Varadara-
jan. Multichain Markov decision processes with a sam-
ple path constraint: A decomposition approach. Math, of
Operations Research, 16:195-207,1991.
[Russell and Subramanian, 1995] S. J. Russell and D. Subra-
manian. Provably bounded optimal agents. JAIR, 2:575-
609, 1995.
[Sobel, 1985] M.J. Sobel. Maximal mean/standard deviation
ratio in undiscounted mdp. OR Letters, 4:157-188, 1985.
Dynamic Programming and
Linear Program-
1112 RESOURCE-BOUNDED REASONING
Download full-text