Page 1
MARKOV DECISION PROCESSES WITH AVERAGE-VALUE-AT-RISK
CRITERIA
NICOLE B¨AUERLE∗AND JONATHAN OTT‡
Abstract. P We investigate the problem of minimizing the Average-Value-at-Risk (AV aRτ)
of the discounted cost over a finite and an infinite horizon which is generated by a Markov
Decision Process (MDP). We show that this problem can be reduced to an ordinary MDP with
extended state space and give conditions under which an optimal policy exists. We also give
a time-consistent interpretation of the AV aRτ. At the end we consider a numerical example
which is a simple repeated casino game. It is used to discuss the influence of the risk aversion
parameter τ of the AV aRτ-criterion.
Key words: Markov Decision Problem, Average-Value-at-Risk, Time-consistency,
Risk aversion.
AMS subject classifications: 90C40, 91B06.
1. Introduction
Risk-sensitive optimality criteria for Markov Decision Processes (MDPs) have been considered
by various authors over the years. In contrast to risk neutral optimality criteria which simply
minimize expected discounted cost, risk-sensitive criteria often lead to non-standard MDPs which
cannot be solved in a straightforward way by using the Bellman equation. This property is
often called time-inconsistency. For example Howard & Matheson (1972) introduced the notion
of risk-sensitive MDPs by using an exponential utility function.
moments of total discounted cost as an optimality criterion.
investigated the target level criterion where the aim is to maximize the probability that the total
discounted reward exceeds a given target value. The related target hitting criterion is studied
in Boda et al. (2004) where the aim is to minimize the probability that the total discounted
reward does not exceed a given target value. A quite general problem is investigated in Collins
& McNamara (1998). There the authors deal with a finite horizon problem which looks like
an MDP, however the classical terminal reward is replaced by a strictly concave functional of
the terminal distribution. Other probabilistic criteria, mostly in combination with long-run
performance measures, can be found in the survey of White (1988).
Another quite popular risk-sensitive criterion is the mean-variance criterion, where the aim
is to minimize the variance, given the expected reward exceeds a certain target. Since it is
not possible to write down a straightforward Bellman equation it took some time until Li & Ng
(2000) managed to solve these kind of problems in a multiperiod setting using MDP methods. In
the last decade risk measures have become popular and the simple variance has been replaced by
more complicated risk measures like Value-at-Risk (V aRτ) or Average-Value-at-Risk (AV aRτ).
Clearly when risk measures are used as optimization criteria, we cannot expect multiperiod
problems to become time-consistent. In B¨ auerle & Mundt (2009) a mean-AV aRτ problem has
been solved for an investment problem in a binomial financial market.
Jaquette (1973) considers
Later e.g. Wu & Lin (1999)
‡The underlying projects have been funded by the Bundesministerium f¨ ur Bildung und Forschung of Germany
under promotional reference 03BAPAC1. The authors are responsible for the content of this article.
∗
Institute for Stochastics,Karlsruhe Institute of Technology,
nicole.baeuerle@kit.edu.
D-76128 Karlsruhe, Germany,e-mail:
c ?0000 (copyright holder)
1
Page 2
2 N. B¨AUERLE AND J. OTT
Some authors now tackled the problem of formulating time-consistent risk-sensitive multi-
period optimization problems. For example in Boda & Filar (2006) a time-consistent AV aRτ-
problem has been given by restricting the class of admissible policies. Bj¨ ork & Murgoci (2010)
tackle the general problem of defining time-consistent controls using game theoretic considera-
tions. A different notion of time-consistency has been discussed in Shapiro (2009). He calls a
policy time-consistent if the current optimal action does not depend on paths which are known
cannot happen in the future. In Shapiro (2009) it is shown that the AV aRτis not time-consistent
w.r.t. this definition but an alternative formulation of a time-consistent criterion is given. Fur-
ther time-consistency considerations for risk measures can e.g. be found in Artzner et al. (2007)
or Bion-Nadal (2008).
In this paper we investigate the problem of minimizing the AV aRτof the discounted cost over
a finite and an infinite horizon which is generated by a Markov Decision Process. We show that
this problem can be reduced to an ordinary MDP with extended state space and give conditions
under which an optimal policy exists. In particular it is seen that the optimal policy depends on
the history only through a certain kind of ‘sufficient statistic’. In the case of an infinite horizon
we show that the minimal value can be characterized as the unique fixed point of a minimal
cost operator. Further we give a time-consistent interpretation of the AV aRτ. At the end we
also consider a numerical example which is a simple repeated casino game. It is used to discuss
the influence of the risk aversion parameter τ of the AV aRτ. For τ → 0 the AV aRτ coincides
with the risk neutral optimization problem and for τ → 1 it coincides with the Worst-Case risk
measure. We see that with increasing τ the distribution of the final capital narrows and the
probability of getting ruined is decreasing.
The paper is organized as follows: In Section 2 we explain the joint state-cost process and
the admissible policies. In Section 3 we solve the finite horizon AV aRτ problem and give a
time-consistent interpretation. Next, in Section 4 we consider and solve the infinite horizon
problem and Section 5 contains the numerical example.
2. A Markov Decision Process with Average-Value-at-Risk Criteria
We suppose that a controlled Markov state process (Xn) in discrete time is given with values in
a Borel set E, together with a non-negative cost process (Cn). All random variables are defined
on a common probability space (Ω,F,P). The evolution of the system is as follows: suppose
that we are in state Xn= x at time n. Then we are allowed to choose an action a from an
action space A which is an arbitrary Borel space. In general we assume that not all actions
from the set A are admissible. We denote by D ⊂ E × A, the set of all admissible state-action
combinations. The set D(x) := {a ∈ A : (x,a) ∈ D} gives the admissible actions in state x for
all states x ∈ E. When we choose the action a ∈ D(x) at time n, a random cost Cn≥ 0 is
incurred and a transition to the next state Xn+1takes place. The distribution of Cnand Xn+1
is given by a transition kernel Q (see below). When we denote by Anthe (random) action which
is chosen at time n, then we assume that Anis Fn= σ(X0,A0,C0,...,Xn)-measurable, i.e. at
time n we are allowed to use the complete history of the state process for our decision. Thus we
introduce recursively the sets of histories:
H0:= E,Hk+1:= Hk× A × R × E
where hk = (x0,a0,c0,x1,...,ak−1,ck−1,xk) ∈ Hk gives a history up to time k. A history-
dependent policy π = (gk)k∈N0is given by a sequence of mappings gk : Hk → A such that
gk(hk) ∈ D(xk). We denote the set of all such policies by Π. A policy π ∈ Π induces a
probability measure Pπon (Ω,F). We suppose that there is a joint (stationary) transition
kernel Q from E × A to E × R such that
Pπ(Xn+1∈ Bx,Cn∈ Bc| X0,g0(X0),C0,...,Xn,gn(X0,A0,C0,...,Xn))
=
Pπ(Xn+1∈ Bx,Cn∈ Bc| Xn,gn(X0,A0,C0,...,Xn))
=
Q(Bx× Bc| Xn,gn(X0,A0,C0,...,Xn))
Page 3
MARKOV DECISION PROCESSES WITH AVERAGE-VALUE-AT-RISK CRITERIA3
for measurable sets Bx⊂ E and Bc⊂ R. There is a discount factor β ∈ [0,1] and we will either
consider a finite planning horizon N ∈ N0or an infinite planning horizon. Thus we will either
consider the cost
N
?
We will always assume that the random variables Ckare non-negative and bounded from above
by a constant¯C. Instead of minimizing the expected cost we will now use the non-standard
criterion of minimizing the so-called Average-Value-at-Risk which is defined as follows (note that
we assume here that large values of X are bad and small values of X are good):
CN:=
k=0
βkCk
orC∞:=
∞
?
k=0
βkCk.
Definition 2.1. Let X ∈ L1(Ω,F,P) be a real-valued random variable and let τ ∈ (0,1).
a) The Value-at-Risk of X at level τ, denoted by V aRτ(X) is defined by
V aRτ(X) = inf{x ∈ R : P(X ≤ x) ≥ τ}.
b) The Average-Value-at-Risk of X at level τ, denoted by AV aRτ(X) is defined by
AV aRτ(X) =
1
1 − τ
?1
τ
V aRt(X)dt.
Note that, if X has a continuous distribution, then the AV aRτ(X) can be written in the more
intuitive form:
AV aRτ(X) = E[X|X ≥ V aRτ(X)],
see e.g. Acerbi & Tasche (2002). The aim now is to find for fixed τ ∈ (0,1):
inf
π∈ΠAV aRπ
τ(CN|X0= x),(2.1)
inf
π∈ΠAV aRπ
τ(C∞|X0= x), (2.2)
where AV aRπ
π∗is called optimal for the finite horizon problem if
τindicates that the AV aRτ is taken w.r.t. the probability measure Pπ. A policy
inf
π∈ΠAV aRπ
τ(CN|X0= x) = AV aRπ∗
τ(CN|X0= x)
and a policy π∗is called optimal for the infinite horizon problem if
inf
π∈ΠAV aRπ
τ(C∞|X0= x) = AV aRπ∗
τ(C∞|X0= x).
Note that this problem is no longer a standard Markov Decision Problem since the Average-
Value-at-Risk is a convex risk measure. However, if we let τ → 0 then we obtain the usual
expectation, i.e.
lim
τ→0AV aRπ
τ(CN|X0= x) = Eπ
x[CN]
where Eπ
policy π and conditioned on X0= x. On the other hand, if we let τ → 1, then we obtain in the
limit the Worst-Case risk measure which is defined by
xis the expectation with respect to the probability measure Pπ
xwhich is induced by
WC(CN) := sup
ω
CN(ω).
Hence the parameter τ can be seen as a kind of degree of risk aversion. For a discussion of the
task of minimizing the Average-Value-at-Risk of the average cost limsupN→∞
Ott (2010), Chapter 8.
1
N+1
?N
k=0Ck, see
Page 14
14N. B¨AUERLE AND J. OTT
050 100150
0.0
0.2
0.4
0.6
τ = 0.1230
Final capital
Relative frequency
051015202530
0.0
0.1
0.2
0.3
0.4
τ = 0.2845
Final capital
Relative frequency
0510152025
0.0
0.1
0.2
0.3
0.4
0.5
τ = 0.3770
Final capital
Relative frequency
0510152025
0.0
0.2
0.4
0.6
τ = 0.4920
Final capital
Relative frequency
0510152025
0.0
0.2
0.4
0.6
τ = 0.5840
Final capital
Relative frequency
024681012
0.0
0.2
0.4
0.6
0.8
τ = 0.6610
Final capital
Relative frequency
0246810
0.0
0.2
0.4
0.6
0.8
τ = 0.7455
Final capital
Relative frequency
0246810
0.0
0.2
0.4
0.6
0.8
τ = 0.8145
Final capital
Relative frequency
02468
0.0
0.2
0.4
0.6
0.8
1.0
τ = 0.8530
Final capital
Relative frequency
02468
0.0
0.2
0.4
0.6
0.8
τ = 0.8760
Final capital
Relative frequency
01234567
0.0
0.2
0.4
0.6
0.8
1.0
τ = 0.9205
Final capital
Relative frequency
0123456
0.0
0.2
0.4
0.6
0.8
1.0
τ = 0.9750
Final capital
Relative frequency
Figure 4. Histograms of the final capital for AV aRτ-optimal policies.
References
Acerbi, C. & Tasche, D. (2002). On the coherence of expected shortfall. Journal of Banking and
Finance 26, 1487–1503.
Artzner, P., Delbaen, F., Eber, J., Heath, D. & Ku, H. (2007). Coherent multiperiod risk
adjusted values and Bellman’s principle. Annals of Oper. Res. 152, 5–22.
B¨ auerle, N. & Mundt, A. (2009). Dynamic mean-risk optimization in a binomial model. Math.
Methods Oper. Res. 70, 219–239.
B¨ auerle, N. & Rieder, U. (2011). Markov Decision Processes with applications to finance.
Springer.
Bertsekas, D. P. & Shreve, S. E. (1978). Stochastic optimal control. Academic Press, New York.
Bion-Nadal, J. (2008). Dynamic risk measures: Time consistency and risk measures from BMO
martingales. Finance and Stochastics 12, 219–244.
Bj¨ ork, T. & Murgoci, A. (2010). A general theory of Markovian time inconsistent stochastic
control problems. Available at SSRN: http://ssrn.com/abstract=1694759 1–39.
Boda, K. & Filar, J. (2006). Time consistent dynamic risk measures. Mathematical Methods of
Operations Research 63, 169–186.
Boda, K., Filar, J. A., Lin, Y. & Spanjers, L. (2004). Stochastic target hitting time and the
problem of early retirement. IEEE Trans. Automat. Control 49, 409–419.
Collins, E. & McNamara, J. (1998). Finite-horizon dynamic optimisation when the terminal
reward is a concave functional of the distribution of the final state. Advances in Applied
Probability 30, 122–136.
Howard, R. & Matheson, J. (1972). Risk-sensitive Markov Decision Processes. Management
Science 18, 356–369.
Jaquette, S. (1973). Markov Decision Processes with a new optimality criterion: discrete time.
Ann. Statist. 1, 496–505.
Page 15
MARKOV DECISION PROCESSES WITH AVERAGE-VALUE-AT-RISK CRITERIA 15
Li, D. & Ng, W.-L. (2000). Optimal dynamic portfolio selection: multiperiod mean-variance
formulation. Math. Finance 10, 387–406.
Ott, J. (2010). A Markov decision model for a surveillance application and risk-sensistive Markov
decision processes. Ph.D. thesis, Karlsruhe Institute of Technology, http://digbib.ubka.uni-
karlsruhe.de/volltexte/1000020835.
Rockafellar, R. T. & Uryasev, S. (2002). Conditional Value-at-Risk for general loss distributions.
Journal of Banking and Finance 26, 1443–1471.
Shapiro, A. (2009). On a time consistency concept in risk averse multistage stochastic program-
ming. Operations Research Letters 37, 143–147.
White, D. J. (1988). Mean, variance, and probabilistic criteria in finite Markov Decision Pro-
cesses: a review. J. Optim. Theory Appl. 56, 1–29.
Wu, C. & Lin, Y. (1999). Minimizing risk models in Markov Decision Processes with policies
depending on target values. J. Math. Anal. Appl. 231, 47–67.
(N. B¨ auerle) Institute for Stochastics, Karlsruhe Institute of Technology, D-76128 Karlsruhe,
Germany
E-mail address: nicole.baeuerle@kit.edu
(J. Ott) Institute for Stochastics, Karlsruhe Institute of Technology, D-76128 Karlsruhe, Ger-
many
E-mail address: jonathan.ott@kit.edu