Page 1

Approximating Optimal Policies for Agents with Limited Execution Resources

Dmitri A. Dolgov and Edmund H. Durfee

Department of Electrical Engineering and Computer Science

University of Michigan

Ann Arbor, MI 48109

{ddolgov, durfee}@umich.edu

Abstract

An agent with limited consumable execution re-

sources needs policies that attempt to achieve good

performance while respecting these limitations.

Otherwise, an agent (such as a plane) might fail

catastrophically (crash) when it runs out of re-

sources (fuel) at the wrong time (in midair). We

present a new approach to constructing policies for

agents with limited execution resources that builds

on principles of real-time Al, as well as research

in constrained Markov decision processes. Specif-

ically, we formulate, solve, and analyze the pol-

icy optimization problem where constraints are im-

posed on the probability of exceeding the resource

limits. We describe and empirically evaluate our

solution technique to show that it is computation-

ally reasonable, and that it generates policies that

sacrifice some potential reward in order to make the

kinds of precise guarantees about the probability of

resource overutilization that are crucial for mission-

critical applications.

1 Introduction

Optimality is the gold standard in rational decision making

(e.g., [Russell and Subramanian, 1995]), and, consequently,

the problem of constructing optimal policies for autonomous

agents has received considerable attention over the years.

Traditionally, this problem has been viewed separately from

the problem of actually carrying out the policies. However,

real agents have limitations as to what they can execute, and,

clearly, a policy is less useful if an agent might run out of

resources while carrying out the policy.

In this paper, we present a new approach for construct-

ing policies for agents that have limited consumable re-

sources where running out of the resources can have neg-

ative consequences. Whereas AI research has mostly fo-

cused on (PO)MDP [Boutilier et ai, 1999; Dean et ai,1993;

Howard, 1960; Puterman, 1995] methods for formulating

policies for agents without emphasizing constraints on their

execution resources, the Operations Research literature has

developed constrained MDP (CMDP) [Altaian, 1999; Put-

erman, 1995] techniques that can account for resource con-

straints. CMDP methods are particularly useful for domains

where the current resource amounts are unobservable and

cannot be easily estimated by the agent, or where modeling

resource amounts in the state description is computationally

infeasible. In an aircraft scenario, some resources and situa-

tions where such methods arc beneficial could include an air-

plane with a broken fuel gauge (fuel amount is unobservable),

pilot fatigue (attention is a resource that cannot be easily es-

timated), or a combination of non-critical resources (ex. var-

ious refreshments) that should nevertheless not be exhausted,

but explicit modeling of which unnecessarily complicates the

optimization problem and increases policy size. In the rest of

the paper, we will use the fuel example, simply because it is

a very intuitive instance of a consumable resource.

However, as we will explain later, standard risk-neutral

CMDP optimization techniques are not applicable to prob-

lems where violating the constraints can have negative or, in

the limit, catastrophic1 consequences. The main contribution

of this work is that it extends the standard CMDP techniques

to handle the types of hard constraints that naturally arise in

problems involving critical resources. In particular, we for-

mulate an optimization problem where constraints are im-

posed on improbability of resource overutilization, and show

how the problem can be solved using standard linear pro-

gramming algorithms. Our formulation yields sub-optimal

solutions to the constrained problem because it sacrifices po-

tential reward to make guarantees about the probability of

violating the resource constraints. As we later show, when

violating constraints incurs dire costs, these guarantees are

worthwhile ([Musliner et al, 1995] and references therein).

We introduce our model in section 2, where we review

Markov models, introduce notation, and specify our assump-

tions about the problem domain. Section 3 describes and

compares several candidate solutions for addressing aspects

of our problem. We then present in section 4 our new method,

and empirically evaluate it in section 5. We conclude with a

discussion about the strengths and limitations of our results,

and about our future directions.

2 The Model

The stochastic properties of the environments in the prob-

lems that we are addressing lead us to formulate our op-

!The term catastrophic is, of course, relative. We assume that the

system designer is willing to accept certain risks to receive payoff.

RESOURCE-BOUNDED REASONING

1107

Page 2

timization problem as a stationary, discrete-time, Markov

decision process. In this section, we review some well-

known facts from the theory of standard [Puterman, 1995;

Boutilier et a/., 1999] and constrained [Altman, 1999] fully-

observable Markov decision processes and also discuss the

assumptions that are particular to the class of problems that

we address in this work. This section provides the neces-

sary background for the subsequent sections, where we dis-

cuss resource-constrained optimization problems.

2.1 Markov Decision Processes

A standard Markov decision process can be defined as a tuple

where S is a finite set of states that the agent

can be in, is a finite set of actions that the agent can execute,

R defines the transition function

is the probability that the agent will go to state if it executes

action a in state i), and

function (agent receives a reward of ria for executing action

in state

Clearly, the total probability of transitioning out of a

state, given a particular action, cannot be greater than 1, i.e.

As we discuss below, we are actually interested

in domains where there exist states for which

A policy is defined as a procedure for selecting an ac-

tion in each state. A policy that makes its choices accord-

ing to a probability distribution over the set of actions is

called randomized and can be described as a mapping of

states to probability distributions over actions:

A deterministic policy that always chooses

the same action for a state is, of course, a special case

of a randomized policy. It can be seen (similarly to the

case of standard CMDPs, as described in [Kallenberg, 1983;

Puterman, 1995]) that under our optimization criterion and

constraints (section 4), deterministic policies can be subop-

timal. Therefore, in this work, we focus on approximating

optimal policies in the class of randomized ones.

If, at time 0, the agent has an initial probability distribution

over the state space, a Markov system follows the

following trajectory:

is the reward

(1)

where is the probability distribution at time

2.2

Typically, Markov decision processes are divided into two

categories: finite-horizon problems, where the total number

of steps that the agent spends in the system is finite and

is known a priori, and infinite-horizon problems, where the

agent is assumed to stay in the system forever (see [Puterman,

1995] for a detailed discussion of both types of models).

In this work we concentrate on dynamic real-time domains,

where agents have tasks to accomplish. For example, con-

sider an agent flying a plane, whose goal is to safely get to its

destination and land there. This example does not naturally

correspond to a finite-horizon problem, because the duration

of executing various policies is not predetermined (unless we

artificially impose such a finite duration, which is not easily

Assumptions

justifiable). On the other hand, the problem does not natu-

rally fit the definition of the infinite-horizon model, because

the plane obviously cannot keep on flying forever.

This leads us to make a slightly different and less common

(although, certainly, not novel) assumption about how much

time the agent spends executing its policy. We assume that

there is no predefined number of steps that the agent spends

in the system, but that optimal policies always yield transient

Markov processes (decision problems of this type were exten-

sively studied by Kallenberg f 1983]). A policy is said to yield

a transient Markov process if the agent executing that policy

will eventually leave the corresponding Markov chain, after

spending a finite number of time steps in it. Given a finite

state space, this assumption implies that there has to be some

"leakage" of probability out of the system, i.e. there have

to exist some state-action pairs

One particular case where the above assumption holds is in

a system in which all trajectories lead to absorbing states.

Once an agent enters an absorbing state, it has finished (or

failed to finish) some task and has nowhere else to go, i.e. the

probability of transitioning out of an absorbing state i is zero:

In the plane-flying example, all trajectories

lead to either a safe landing or a crash, and once the agent

enters one of these states, the probability of transitioning to

other states is zero.

The transient nature of our problems leads us to adopt the

expected total reward as the policy evaluation criterion. Given

that an agent receives a reward whenever it executes an action,

the total expected utility of a policy can be expressed as:

for which

(2)

where T is the number of steps during which the agent accu-

mulates utility. For a transient system with bounded rewards,

the above sum converges for any T.

3 Related Work

In this section, we briefly survey several approaches (based

on well-known techniques) to solving the problem of finding

optimal policies for agents with limited resources, and point

out the assumptions, strengths, and limitations of these meth-

ods. This establishes a landscape of solution algorithms, to

which we can compare our method (presented in the next sec-

tion) in terms of complexity, efficiency, and solution quality.

3.1 Fully Observable MDPs

The most straightforward way of handling resource con-

straints in a MDP framework is to explicitly model the re-

sources by making their current amounts a part of the state

description. This yields a standard FOMDP, which can be

solved by a wide variety of efficient methods. The benefit

An alternative way of handling these states is to treat them as

infinitely-recurrent, i.e. once the agent gets there, it infinitely tran-

sitions back to itself. We do not adopt this model, because it is less

natural for our domains and also leads to unnecessary complications

in the optimization problems.

1108 RESOURCE-BOUNDED REASONING

Page 3

of this approach is that it allows one to make use of all rele-

vant information to construct the best possible policy, as the

agent is able to base its action choice on the current state and

resource amounts. The downside of this approach is that it

requires an a priori discretization of resource amounts to be

made when the world model is constructed. Also, there is an

additional burden of specifying the rewards and state transi-

tions as functions of current resource amounts. Furthermore,

in this model, the size of the state space and, consequently,

the policy size, explodes exponentially in the number of re-

sources, as compared to the state space where the resource

amounts are not folded into the state description.

The FOMDP approach relies on the fact that the agent can

observe the exact amounts of all resources at runtime. How-

ever, this may not hold, especially in multiagent domains with

shared resources, when an agent does not know what other

agents have been doing and how much of the shared resources

they have been consuming.

3.2

An alternative to the FOMDP approach described above is to

treat the problem as a constrained Markov decision process

[Altman, 1999], where the resources are not explicitly mod-

eled, but rather are treated as constraints that are imposed on

the space of feasible solutions (policies).

A constrained MDP for a resource-limited agent differs

from a standard MDP in that the agent, besides getting re-

wards for executing actions, also incurs resource costs. Con-

sequently, a constrained MDP (CMDP) can be described as a

tuple (S, A, P, R, C, Q), where C = _

defines a vector of actions' costs

are used when action

and Q is the vector of amounts of available

resources (there is units of resource

The benefit of the CMDP is that it does not require one

to explicitly model how the resources affect the world. In-

deed, if all policies satisfy the resource constraints, one does

not have to worry about what happens when the resources

are overutilized. Consequently, the state space and the re-

sulting policies are exponentially smaller than the ones in the

FOMDP model. The standard CMDP formulation constrains

the expected usage of all resources to be below a certain limit

and can be formulated as the following linear program:

Constrained MDPs

units of resource

is executed in state i),

initially available).

(3)

where the variables

number of times action a is executed in state i.

A weakness of this approach is that it yields policies that

can be suboptimal, as compared to the ones constructed by

the FOMDP method, because the agent does not base its de-

cision on the current resource amounts, but rather completely

ignores that aspect of the state. However, as mentioned in the

previous section, in domains where the resources are not ob-

servable, or if the policy size is of vital importance (for exam-

ple if the agent's architecture imposes constraints on policies

that it can store), this approach could prove fruitful.

However, the biggest problem with this approach (as

pointed out, for example, by Ross and Chen in the telecom-

have the interpretation of the expected

RESOURCE-BOUNDED REASONING

munication domain [1988]) is that a standard CMDP imposes

constraints on the expected amounts. Clearly, this method

does not work for critical resources, whose overutilization has

negative consequences. Indeed, an agent that pilots aircraft

would not be satisfied with a policy that on average uses an

amount of fuel that does not exceed the tank capacity.

3.3

As just mentioned, Ross and Chen pointed out the weak-

ness of the CMDP approach with constraints on the expected

amounts. As a possible solution, Ross and Varadarajan [1989;

1991] propose an approach where constraints are placed on

the actual sample-path costs. In their work, the space of feasi-

ble solutions is constrained to the set of policies whose proba-

bility of violating the constraints (overutilizing the resources)

in the long run is zero. However, their work concentrates on

the average per-step costs and rewards, whereas we are inter-

ested in the total amounts, whose distributions are not easily

calculable. The approach of Ross and Varadarajan has the

same benefits as the standard CMDP method, i.e. no explicit

modeling of resources is required, and the state space and

policies are small. In addition, unlike the standard CMDP,

this method is suitable for problems with critical resources.

However, a weakness of this approach is that for some prob-

lems it might be too restrictive, in that it allows no possibility

of overutilizing the resources. Indeed, policies produced by

the sample path method might have significantly lower pay-

off, as compared to policies that have some probability of

resource overutilization. Furthermore, for some domains it

might be desirable for the system designer to be able to con-

trol the probability of resource overutilization, as a means of

balancing optimality and risk.

Sample Path MDPs

3.4 MDPs With Constraints on Variance

Another approach to handling deviations from the expecta-

tion in Markov models is to impose additional constraints on

(or to assign additional cost to) the variance. Sobel [1985]

proposed to constrain the expected cost and to maximize the

mean-variance ratio of the reward. Huang and Kallenberg

[1994] proposed a unified approach to handling variance via

an algorithm based on parametric linear programming. These

approaches have the same benefits as the standard CMDP and

the sample-path methods (compared to the FOMDP formula-

tion) in terms of state space and policy size, as well as the

complexity of constructing the initial world model. Addi-

tionally, they allow one to somewhat balance payoff and the

deviation from the expected. However, these methods do not

allow one to make hard guarantees about the probability of

overutilizing the resources.

4

As hinted at in the previous sections, we would like to be able

to constrain the feasible solution space to the set of policies

whose probability of overutilizing the resources is below a

user-specified threshold

would like to be able to solve the following math program:

Linear Approximation

In other words, we

(4)

1109

Page 4

where

policy, and

The trouble is that the optimization is in the space of

which can be interpreted as the expected number of times for

executing the actions in the corresponding states. However,

it is difficult to express

the optimization variables

information about the probability distribution of the random

variable of the number of times the action is actually executed

in the corresponding state - only the expected values.3 In

this section, we present a linear approximation to the above

program, based on the Markov inequality:4

is the total amount of resource

is the upper bound on that resource (as before).

that is used by the

as a simple function of

because the latter contain no

(5)

Using this inequality and the fact that the expected resource

usage can be expressed as

mization problem can be formulated as a linear program:

the opti-

A potential weakness of this approach is that the Markov

inequality gives a very rough upper bound, and the policies

that correspond to solutions to this LP can be too conserva-

tive, in that their real probability of overutilizing the resources

can be significantly lower than

guarantees about the probability of overutilizing the resources

is of vital importance, this method might prove valuable, as it

yields policies that, in general, would have higher payoff than

the ones obtained by the sample-path method (as the latter is

often too restrictive). On the other hand, unlike the standard

CMDPs that impose constraints on the expected resource us-

age, and the MDPs that constrain the variance, this method

allows one to explicitly bound the allowable probability of re-

source overutilization, and to make precise guarantees about

the behavior of the system in that respect.

It is also worth noting that, unlike the sample-path method

or the methods that constrain on variance, this method re-

lies on solving a standard linear program, whereas the for-

mer require solving quadratic or parametric linear programs.

Therefore, the above formulation appears to be a reasonable

approximation, because it should be no harder to solve than

the standard CMDP (see section 5 for experimental results),

while providing a means of balancing solution quality with

the precisely quantifiable risk of resource overutilization.

However, if making hard

5

To verify our hypotheses about the properties of the approxi-

mation described in the previous section, we have performed

a set of numerical experiments that compare its behavior to

a standard CMDP with constraints on the expected resource

amounts (section 3), and to an unconstrained MDP.

Evaluation

3Generally speaking, the total cost is a sum of a random num-

ber of differently-distributed random variables, and calculating its

probability distribution is a nontrivial task.

4This inequality only holds for nonnegative random variables.

However, for a transient system, we can make the assumption that

all costs are nonnegative, without any loss of generality.

To reduce the bias that might arise from using a small num-

ber of hand-crafted test cases, we have instead used a large

number of randomly-generated constrained MDPs. All of the

generated problems shared some common properties, among

which the most interesting ones are the following (the values

for our main experiments are given in parentheses):

The total number of states, actions, and re-

sources, respectively. (20, 20, 2)

Maximum reward. Rewards are assigned

from a uniform distribution on

Mc = max(C) Maximum action cost. Resource costs are

assigned to state-action pairs from

distribution of R and (C, R) (described below). (10)

Correlation between rewards and resource

costs; better actions are typically more costly.

Upper bounds on resource

amounts are assigned according to a uniform distribu-

tion from this range. ([200, 300])

Dissipation of probability - the probability that the agent

exits the system at each time step. ([0.95,0.99])

The last parameter was used to ensure a transient chain. In-

stead of providing a small number of sink states, we have

chosen to use a uniform dissipation of probability, in order

to avoid additional randomization in our experiments, as the

latter choice provides a more stable domain.

Our main concern was the behavior of our approxima-

tion, as a function of the probability threshold

we have run a number of experiments for various values of

To be more precise, we have gradually increased

from 0 to 1 in increments of 0.05, and for each value, gen-

erated 50 random models and solved them using the three

methods: 1) an unconstrained MDP without any regard for

resource limitations, 2) a standard CMDP with constraints on

the expected usage of resources, and 3) our CMDP with con-

straints on the probability of overutilizing the resources (eq.

6). We then evaluated (using a Monte-Carlo simulation) each

of the three solutions (policies) in terms of expected reward

and probability of overutilizing the resources.

Figure 1 shows a plot of the actual probability of overuti-

lizing the resources for the policies obtained via the three

methods as a function of the probability threshold

data points are averaged over the runs for a particular value

of po. The curve that corresponds to the method that bounds

the overutilization probability also shows the standard devia-

tion for the runs. The other data have very similar variance,

so we will use the plots of means (averaged over the runs for

a given po) for our analyses.

Obviously, po has no effect on the unconstrained and the

standard CMDP (which maintain more or less a constant

probability of overutilization), but it does affect to a large ex-

tent the solutions to the problem with constraints on overuti-

lization probability. One can see that the overutilization prob-

ability for the solutions produced by our approach is always

below (as it should be). Also, it is worth noting that when

approaches 1, our approximation yields the same results as

the other methods, which is good, since setting

not constrain the space of feasible solutions.

(10)

based on a

Therefore,

. The

should

1110

RESOURCE-BOUNDED REASONING

Page 5

Figure 1: Probability of resource overutilization.

Figure 2: Average rewards for solutions to the three problems.

The rewards obtained by these policies are shown in Fig-

ure 2. These actual rewards do not necessarily equal the ex-

pected rewards (which are used during the optimization pro-

cess). This is because only the runs that did not overutilize

the resources were included in the average, and policies were

not penalized for violating the resource constraints. This also

explains why the total rewards received by solutions to the

standard CMDPs were sometimes greater than the ones ob-

tained by solutions to the unconstrained problems.

However, realistically, an agent always incurs a penalty for

overutilizing a critical resource, where the penalty amount is

based on the consequences of overutilization. For example,

if the agent is flying a plane, and the resource in question is

fuel, the consequences of trying to use too much of that re-

source are catastrophic, so the penalty is very high. If we

take this into account by assigning a fixed penalty to poli-

cies that overutilize the resources, we can update the graph in

Figure 2 to get a more realistic picture. Figure 3 shows the

average rewards for solutions obtained via the three meth-

ods, recalculated to reflect the penalties: new rewards are

where

probability, R is the average reward for successful runs of the

given policy (as in Figure 2), and W is the penalty for overuti-

lization. We see that, if we take the penalty into account, there

is an interval of po where the conservative policy produced

by our linear approximation outperforms the other policies.

is the overutilization

Figure 3: Average rewards with penalties for overutilization.

Moreover, for large penalties (IV = -220), the conservative

policy outperforms the others for all values of; . Note that

here, the policy is re-evaluated post-factum. Section 6 briefly

discusses an approach that explicitly models the penalties in

the optimization program.

As we mentioned in section 4, the linear approximation

should be no harder to solve than the standard constrained

MDP, because both are formulated as linear programs with

the same number of constraints.

this, we have timed the runs of all our experiments. Figure

4 contains a plot of the times that it took to solve the prob-

lems in all our experiments. One can see that the running

times for all three methods are not appreciably different.5 In

particular, the average ratio of the running time for the stan-

dard CMDP to the running time of the unconstrained method

is 1.25; the ratio of the running time of our linear approxi-

mation with constraints on overutilization probability to the

running time of the unconstrained method is 1.06. The slight

downward curvature of the plot of the running time of our

approximation method appears to be a consequence of the

specific implementation of the linear programming algorithm

that we used in our experiments.

To experimentally verify

6 Discussion and Future Work

Our experiments substantiate the claims that our approxima-

tion provides an effective and efficient method that agents

can use to formulate policies that not only consider limita-

tions on execution resources, but that also explicitly bound

the probability of resource overutilization. Our new approxi-

mation achieves the constraints on overutilization, and is es-

sentially no more expensive to use than more standard CMDP

and unconstrained MDP methods. Because our method con-

structs policies that are more careful about avoiding resource

overutilization, the rewards associated with its policies when

resources are not overutilized tend to be less than the re-

wards for the other methods' policies when resources are not

overutilized. However, as we illustrated in Figure 3, when

overutilization incurs penalties our new method can outper-

5Here we have to note that we used the linear programming ap-

proach to solving the unconstrained problem. A different algorithm

such as value or policy iteration might be more efficient.

RESOURCE-BOUNDED REASONING

1111

Page 6

Figure 4: Running time for the three methods

form previous techniques. Thus, our new method is particu-

larly suited to agents engaged in mission-critical domains.

Furthermore, if penalties for resource overutilization are

known at design-time and can be expressed in the same units

as the rewards, an interesting modification of our LP formu-

lation is to include the penalties in the policy evaluation cri-

terion, as opposed to modeling the constraints on the overuti-

lization probability. This would yield an LP with the follow-

ing objective function:

(7)

where Wk is the penalty (in units of ria) incurred for overuti-

lizing resource k. The maximization is subject to just the

standard "conservation of probability" constraints as in (eq.

3). A benefit of this formulation is that, for certain initial

probability distributions, deterministic policies are optimal.

A downside is that the formulation does not allow one to ex-

plicitly control the acceptable overutilization probabilities.

As can be seen in Figure 1, our linear approximation is in

general overly conservative. For example, given permission

to overutilize the resource 20% of the time (p0 = 0.2), the

method generates a policy that overutilizes the resource only

about 1% of the time. Since rewards and resource usage are

typically correlated, we would expect that a policy that under-

shoots the permitted resource overutilization probability by a

lower amount would also yield a higher expected reward. To-

ward this end, we have formulated a quadratic programming

approximation, based on the Chebyshev inequality, which al-

lows us to put a better upper bound on the probability of re-

source overutilization:

(8)

where

amount of used resource

allowable regions for the amounts of used resources.

We are also investigating another formulation of the opti-

mization program that should give a more accurate estimate

of the resource overutilization probability. The method is

based on a polynomial approximation of the pdf of the to-

tal resource-usage cost and uses the moments of the cost as

is the variance in the

specify the widths of the and

the optimization variables. This approximation and the one

based on the Chebyshev inequality are more costly to com-

pute than the linear one. Our current efforts are to encode the

approximations and evaluate their strengths and weaknesses.

7

This paper is based upon work supported by DARPA/ITO and

the Air Force Research Laboratory under contract F30602-

00-C-0017 as a subcontractor through Honeywell Laborato-

ries. The authors thank Kang Shin, Haksun Li, and David

Musliner for their valuable contributions, as well as one of

the anonymous reviewers whose comments inspired (eq. 7).

Acknowledgments

References

[Altman, 1999] Eitan Altman. Constrained Markov Deci-

sion Processes. Chapman and HALL/CRC, 1999.

[Boutilier et al, 1999] Craig Boutilier, Thomas Dean, and

Steve Hanks. Decision-theoretic planning: Structural as-

sumptions and computational leverage. Journal of Artifi-

cial Intelligence Research, 11:1-94, 1999.

lDeane/a/., 1993] Thomas Dean, Leslie Pack Kaelbling,

Jak Kirman, and Ann Nicholson. Planning with deadlines

in stochastic domains. In Proc. of the Eleventh National

Confi on Artificial Intelligence, pages 574-579, CA, 1993.

[Howard, 1960] R. Howard.

Markov Processes. MIT Press, Cambridge, 1960.

[Huang and Kallenberg, 1994] Y. Huang and L.C.M. Kallen-

berg. On finding optimal policies for Markov decision

chains. Math, of Operations Research, 19:434-448, 1994.

iKallenberg, 1983] L.C.M. Kallenberg.

ming and Finite Markovian Control Problems. Mathcma-

tisch Centrum, Amsterdam, 1983.

[Musliner etal, 1995] David John Musliner, James Hendler,

Ashok K. Agrawala, Edmund H. Durfee, Jay K. Stros-

nider, and C. J. Paul. The challenges of real-time AI. IEEE

Computer, 28(1), 1995.

[Puterman, 1995] M. L. Puterman. Markov decision pro-

cesses. New York, 1995. John Wiley & Sons.

[Ross and Chen, 1988] K.W. Ross and B. Chen. Optimal

scheduling of interactive and non-interactive traffic in

telecommunication systems. IEEE Transactions on Auto

Control, 33:261-267, 1988.

[Ross and Varadarajan, 1989] K.W. Ross and R. Varadara-

jan. Markov decision processes with sample path con-

straints: the communicating case. OR, 37:780-790, 1989.

[Ross and Varadarajan, 1991] K.W. Ross and R. Varadara-

jan. Multichain Markov decision processes with a sam-

ple path constraint: A decomposition approach. Math, of

Operations Research, 16:195-207,1991.

[Russell and Subramanian, 1995] S. J. Russell and D. Subra-

manian. Provably bounded optimal agents. JAIR, 2:575-

609, 1995.

[Sobel, 1985] M.J. Sobel. Maximal mean/standard deviation

ratio in undiscounted mdp. OR Letters, 4:157-188, 1985.

Dynamic Programming and

Linear Program-

1112 RESOURCE-BOUNDED REASONING