Page 1

MARKOV DECISION PROCESSES WITH AVERAGE-VALUE-AT-RISK

CRITERIA

NICOLE B¨AUERLE∗AND JONATHAN OTT‡

Abstract. P We investigate the problem of minimizing the Average-Value-at-Risk (AV aRτ)

of the discounted cost over a finite and an infinite horizon which is generated by a Markov

Decision Process (MDP). We show that this problem can be reduced to an ordinary MDP with

extended state space and give conditions under which an optimal policy exists. We also give

a time-consistent interpretation of the AV aRτ. At the end we consider a numerical example

which is a simple repeated casino game. It is used to discuss the influence of the risk aversion

parameter τ of the AV aRτ-criterion.

Key words: Markov Decision Problem, Average-Value-at-Risk, Time-consistency,

Risk aversion.

AMS subject classifications: 90C40, 91B06.

1. Introduction

Risk-sensitive optimality criteria for Markov Decision Processes (MDPs) have been considered

by various authors over the years. In contrast to risk neutral optimality criteria which simply

minimize expected discounted cost, risk-sensitive criteria often lead to non-standard MDPs which

cannot be solved in a straightforward way by using the Bellman equation. This property is

often called time-inconsistency. For example Howard & Matheson (1972) introduced the notion

of risk-sensitive MDPs by using an exponential utility function.

moments of total discounted cost as an optimality criterion.

investigated the target level criterion where the aim is to maximize the probability that the total

discounted reward exceeds a given target value. The related target hitting criterion is studied

in Boda et al. (2004) where the aim is to minimize the probability that the total discounted

reward does not exceed a given target value. A quite general problem is investigated in Collins

& McNamara (1998). There the authors deal with a finite horizon problem which looks like

an MDP, however the classical terminal reward is replaced by a strictly concave functional of

the terminal distribution. Other probabilistic criteria, mostly in combination with long-run

performance measures, can be found in the survey of White (1988).

Another quite popular risk-sensitive criterion is the mean-variance criterion, where the aim

is to minimize the variance, given the expected reward exceeds a certain target. Since it is

not possible to write down a straightforward Bellman equation it took some time until Li & Ng

(2000) managed to solve these kind of problems in a multiperiod setting using MDP methods. In

the last decade risk measures have become popular and the simple variance has been replaced by

more complicated risk measures like Value-at-Risk (V aRτ) or Average-Value-at-Risk (AV aRτ).

Clearly when risk measures are used as optimization criteria, we cannot expect multiperiod

problems to become time-consistent. In B¨ auerle & Mundt (2009) a mean-AV aRτ problem has

been solved for an investment problem in a binomial financial market.

Jaquette (1973) considers

Later e.g. Wu & Lin (1999)

‡The underlying projects have been funded by the Bundesministerium f¨ ur Bildung und Forschung of Germany

under promotional reference 03BAPAC1. The authors are responsible for the content of this article.

∗

Institute for Stochastics,Karlsruhe Institute of Technology,

nicole.baeuerle@kit.edu.

D-76128 Karlsruhe, Germany,e-mail:

c ?0000 (copyright holder)

1

Page 2

2 N. B¨AUERLE AND J. OTT

Some authors now tackled the problem of formulating time-consistent risk-sensitive multi-

period optimization problems. For example in Boda & Filar (2006) a time-consistent AV aRτ-

problem has been given by restricting the class of admissible policies. Bj¨ ork & Murgoci (2010)

tackle the general problem of defining time-consistent controls using game theoretic considera-

tions. A different notion of time-consistency has been discussed in Shapiro (2009). He calls a

policy time-consistent if the current optimal action does not depend on paths which are known

cannot happen in the future. In Shapiro (2009) it is shown that the AV aRτis not time-consistent

w.r.t. this definition but an alternative formulation of a time-consistent criterion is given. Fur-

ther time-consistency considerations for risk measures can e.g. be found in Artzner et al. (2007)

or Bion-Nadal (2008).

In this paper we investigate the problem of minimizing the AV aRτof the discounted cost over

a finite and an infinite horizon which is generated by a Markov Decision Process. We show that

this problem can be reduced to an ordinary MDP with extended state space and give conditions

under which an optimal policy exists. In particular it is seen that the optimal policy depends on

the history only through a certain kind of ‘sufficient statistic’. In the case of an infinite horizon

we show that the minimal value can be characterized as the unique fixed point of a minimal

cost operator. Further we give a time-consistent interpretation of the AV aRτ. At the end we

also consider a numerical example which is a simple repeated casino game. It is used to discuss

the influence of the risk aversion parameter τ of the AV aRτ. For τ → 0 the AV aRτ coincides

with the risk neutral optimization problem and for τ → 1 it coincides with the Worst-Case risk

measure. We see that with increasing τ the distribution of the final capital narrows and the

probability of getting ruined is decreasing.

The paper is organized as follows: In Section 2 we explain the joint state-cost process and

the admissible policies. In Section 3 we solve the finite horizon AV aRτ problem and give a

time-consistent interpretation. Next, in Section 4 we consider and solve the infinite horizon

problem and Section 5 contains the numerical example.

2. A Markov Decision Process with Average-Value-at-Risk Criteria

We suppose that a controlled Markov state process (Xn) in discrete time is given with values in

a Borel set E, together with a non-negative cost process (Cn). All random variables are defined

on a common probability space (Ω,F,P). The evolution of the system is as follows: suppose

that we are in state Xn= x at time n. Then we are allowed to choose an action a from an

action space A which is an arbitrary Borel space. In general we assume that not all actions

from the set A are admissible. We denote by D ⊂ E × A, the set of all admissible state-action

combinations. The set D(x) := {a ∈ A : (x,a) ∈ D} gives the admissible actions in state x for

all states x ∈ E. When we choose the action a ∈ D(x) at time n, a random cost Cn≥ 0 is

incurred and a transition to the next state Xn+1takes place. The distribution of Cnand Xn+1

is given by a transition kernel Q (see below). When we denote by Anthe (random) action which

is chosen at time n, then we assume that Anis Fn= σ(X0,A0,C0,...,Xn)-measurable, i.e. at

time n we are allowed to use the complete history of the state process for our decision. Thus we

introduce recursively the sets of histories:

H0:= E,Hk+1:= Hk× A × R × E

where hk = (x0,a0,c0,x1,...,ak−1,ck−1,xk) ∈ Hk gives a history up to time k. A history-

dependent policy π = (gk)k∈N0is given by a sequence of mappings gk : Hk → A such that

gk(hk) ∈ D(xk). We denote the set of all such policies by Π. A policy π ∈ Π induces a

probability measure Pπon (Ω,F). We suppose that there is a joint (stationary) transition

kernel Q from E × A to E × R such that

Pπ(Xn+1∈ Bx,Cn∈ Bc| X0,g0(X0),C0,...,Xn,gn(X0,A0,C0,...,Xn))

=

Pπ(Xn+1∈ Bx,Cn∈ Bc| Xn,gn(X0,A0,C0,...,Xn))

=

Q(Bx× Bc| Xn,gn(X0,A0,C0,...,Xn))

Page 3

MARKOV DECISION PROCESSES WITH AVERAGE-VALUE-AT-RISK CRITERIA3

for measurable sets Bx⊂ E and Bc⊂ R. There is a discount factor β ∈ [0,1] and we will either

consider a finite planning horizon N ∈ N0or an infinite planning horizon. Thus we will either

consider the cost

N

?

We will always assume that the random variables Ckare non-negative and bounded from above

by a constant¯C. Instead of minimizing the expected cost we will now use the non-standard

criterion of minimizing the so-called Average-Value-at-Risk which is defined as follows (note that

we assume here that large values of X are bad and small values of X are good):

CN:=

k=0

βkCk

orC∞:=

∞

?

k=0

βkCk.

Definition 2.1. Let X ∈ L1(Ω,F,P) be a real-valued random variable and let τ ∈ (0,1).

a) The Value-at-Risk of X at level τ, denoted by V aRτ(X) is defined by

V aRτ(X) = inf{x ∈ R : P(X ≤ x) ≥ τ}.

b) The Average-Value-at-Risk of X at level τ, denoted by AV aRτ(X) is defined by

AV aRτ(X) =

1

1 − τ

?1

τ

V aRt(X)dt.

Note that, if X has a continuous distribution, then the AV aRτ(X) can be written in the more

intuitive form:

AV aRτ(X) = E[X|X ≥ V aRτ(X)],

see e.g. Acerbi & Tasche (2002). The aim now is to find for fixed τ ∈ (0,1):

inf

π∈ΠAV aRπ

τ(CN|X0= x),(2.1)

inf

π∈ΠAV aRπ

τ(C∞|X0= x), (2.2)

where AV aRπ

π∗is called optimal for the finite horizon problem if

τindicates that the AV aRτ is taken w.r.t. the probability measure Pπ. A policy

inf

π∈ΠAV aRπ

τ(CN|X0= x) = AV aRπ∗

τ(CN|X0= x)

and a policy π∗is called optimal for the infinite horizon problem if

inf

π∈ΠAV aRπ

τ(C∞|X0= x) = AV aRπ∗

τ(C∞|X0= x).

Note that this problem is no longer a standard Markov Decision Problem since the Average-

Value-at-Risk is a convex risk measure. However, if we let τ → 0 then we obtain the usual

expectation, i.e.

lim

τ→0AV aRπ

τ(CN|X0= x) = Eπ

x[CN]

where Eπ

policy π and conditioned on X0= x. On the other hand, if we let τ → 1, then we obtain in the

limit the Worst-Case risk measure which is defined by

xis the expectation with respect to the probability measure Pπ

xwhich is induced by

WC(CN) := sup

ω

CN(ω).

Hence the parameter τ can be seen as a kind of degree of risk aversion. For a discussion of the

task of minimizing the Average-Value-at-Risk of the average cost limsupN→∞

Ott (2010), Chapter 8.

1

N+1

?N

k=0Ck, see

Page 4

4 N. B¨AUERLE AND J. OTT

3. Solution of the finite Horizon Problem

For the solution of the problem it is important to note that the Average-Value-at-Risk can

be represented as the solution of a convex optimization problem. More precisely, the following

lemma is given in Rockafellar & Uryasev (2002).

Lemma 3.1. Let X ∈ L1(Ω,F,P) be a real-valued random variable and let τ ∈ (0,1). Then it

holds:

?

and the minimum-point is given by s∗= V aRτ(X).

AV aRτ(X) = min

s∈R

s +

1

1 − τE[(X − s)+]

?

.

Hence we obtain for the problem with finite time horizon:

inf

π∈ΠAV aRπ

τ(CN|X0= x)= inf

π∈Πinf

inf

s∈Rinf

inf

s∈R

s∈R

?

?

s +

1

1 − τEπ

1

1 − τEπ

1 − τinf

x[(CN− s)+]

?

?

?

=

π∈Π

?

s +

x[(CN− s)+]

=s +

1

π∈ΠEπ

x[(CN− s)+].

In what follows we will investigate the inner optimization problem and show that it can be

solved with the help of a suitably defined Markov Decision Problem. For this purpose let us

denote for n = 0,1,...,N

wnπ(x,s)

wn(x,s)

:=

:=

Eπ

inf

π∈Πwnπ(x,s),

x[(Cn− s)+],x ∈ E,s ∈ R,π ∈ Π,

x ∈ E,s ∈ R. (3.1)

We consider a Markov Decision Model which is given by a 2-dimensional state space˜E := E×

R, action space A and admissible actions in D. The interpretation of the second component of the

state (x,s) ∈˜E will become clear later. It captures the relevant information of the history of the

process (see Remark 3.3). Further, there are disturbance variables Zn= (Z1

with values in E×R+which influence the transition. If the state of the Markov Decision Process

is (x,s) at time n and action a is chosen, then the distribution of Zn+1is given by the transition

kernel Q(· | x,a). The transition function F :˜E × A × E × R+→˜E which determines the new

state, is given by

F?(x,s),a,(z1,z2)?=?z1,s − z2

The first component of the right-hand side is simply the new state of our original state process

and the necessary information update takes place in the second component. There is no running

cost and the terminal cost function is given by V−1π(x,s) := V−1(x,s) := s−. We consider

here decision rules f :˜E → A such that f(x,s) ∈ D(x) and denote by ΠMthe set of Markov

policies σ = (f0,f1,...) where fn are decision rules. Note that ‘Markov’ refers here to the

fact that the decision at time n depends only on x and s. For convenience we denote for

v ∈ M(˜E) := {v :˜E → R+: v is measurable } the operators

?

and

?

The minimal cost operator of this Markov Decision Model is given by

n,Z2

n) = (Xn,Cn−1)

β

?.

Lv(x,s,a) := βv?x?,s − c

v?x?,s − c

β

?Q?dx?× dc|x,a?,

?Q?dx?× dc|x,f(x,s)?,

(x,s) ∈˜E,a ∈ D(x)

Tfv(x,s) := β

β

(x,s) ∈˜E.

Tv(x,s) = inf

a∈D(x)Lv(x,s,a).

Page 5

MARKOV DECISION PROCESSES WITH AVERAGE-VALUE-AT-RISK CRITERIA5

For a policy σ = (f0,f1,f2,...) ∈ ΠMwe will denote by ? σ = (f1,f2,...) the shifted policy. We

define for σ ∈ ΠMand n = −1,0,1,...N:

Vn+1σ

:=Tf0Vn? σ,

Vn+1

:=inf

σVn+1σ= TVn.

A decision rule f∗

nwith the property that Vn= Tf∗

that we have ΠM⊂ Π in the following sense: For every σ = (f0,f1,...) ∈ ΠMwe find a

π = (g0,g1,...) ∈ Π such that (the variable s is considered as a global variable)

g0(x0) :=

nVn−1is called minimizer of Vn. Next note

f0(x0,s)

?x1,s − c0

...

g1(x0,a0,c0,x1) :=f1

β

?

...

:=

With this interpretation wnσis also defined for σ ∈ ΠM. Note that a policy σ = (f0,f1,...) ∈ ΠM

also depends on the history of our process, however in a weak sense. The only necessary in-

formation at time n of the history hn= (x0,a0,c0,x1,...,an−1,cn−1,xn) is xnand the value

s−c0

βn −

policies π which cannot be represented as a Markov policy σ ∈ ΠM. However, it will be shown

in Theorem 3.2 that indeed the optimal policy π∗of problem (3.1) (if it exists) can be found

among the smaller class ΠM.

The connection of the MDP to the optimization problem in (3.1) is stated in the next theorem.

c1

βn−1−...−cn−1

β. Also note that Π is strictly larger than ΠM: There are history-dependent

Theorem 3.2. It holds for n = 0,1,...,N that

a) wnσ= Vnσfor σ ∈ ΠM.

b) wn= Vn.

If there exist minimizers f∗

optimal for problem (3.1).

nof Vn on all stages, then the Markov policy σ∗= (f∗

N,...,f∗

0) is

Proof. We first prove that wnσ= Vnσfor all σ ∈ ΠM. This is done by induction on n. For n = 0

we obtain

V0σ(x,s)=Tf0V−1(x,s)

βV−1

??s − c

(c − s)+Q?dx?× dc|x,f0(x,s)?

Eπ

=

?

?x?,s − c

β

β

?Q?dx?× dc|x,f0(x,s)?

?−Q?dx?× dc|x,f0(x,s)?

=β

?

=

=

x[(C0− s)+] = w0σ(x,s).

Next we assume that the statement is true for n and show that it also holds for n+1. We obtain

Vn+1σ(x,s)=Tf0Vn? σ(x,s)

?

β

E? σ

?

Eσ

=βVn? σ

?x?,s − c

β

?Q?dx?× dc|x,f0(x,s)?

β

=

?

x???Cn−s − c

?+?Q?dx?× dc|x,f0(x,s)?

=

E? σ

x???c + βCn− s?+?Q?dx?× dc|x,f0(x,s)?

x[(Cn+1− s)+] = wn+1σ(x,s).=

Histories of the Markov Decision Process˜hn= (x0,s0,a0,c0,x1,s1,a1,...,xn,sn) contain the

history hn= (x0,a0,c0,x1,a1,...,xn,). We denote by˜Π the history dependent policies of the