Content uploaded by Ben Black
Author content
All content in this area was uploaded by Ben Black on Nov 23, 2022
Content may be subject to copyright.
Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
Robust Markov decision processes under parametric transition
distributions
Ben Black∗†, Trivikram Dokka‡, Christopher Kirkbride§
Abstract
This paper considers robust Markov decision processes under parametric transition dis-
tributions. We assume that the true transition distribution is uniquely specified by some
parametric distribution, and explicitly enforce that the worst-case distribution from the
model is uniquely specified by a distribution in the same parametric family. After formu-
lating the parametric robust model, we focus on developing algorithms for carrying out the
robust Bellman updates required to complete robust value iteration. We first formulate
the update as a linear program by discretising the ambiguity set. Since this model scales
poorly with problem size and requires large amounts of pre-computation, we develop two
additional algorithms for solving the robust Bellman update. Firstly, we present a cutting
surface algorithm for solving this linear program in a shorter time. This algorithm requires
the same pre-computation, but only ever solves the linear program over small subsets of the
ambiguity set. Secondly, we present a novel projection-based bisection search algorithm that
completely eliminates the need for discretisation and does not require any pre-computation.
We test our algorithms extensively on a dynamic multi-period newsvendor problem under bi-
nomial and Poisson demands. In addition, we compare our methods with the non-parametric
phi-divergence based methods from the literature. We show that our projection-based al-
gorithm completes robust value iteration significantly faster than our other two parametric
algorithms, and also faster than its non-parametric equivalent.
Keywords: Uncertainty modelling, Markov processes, robust Markov decision processes, newsvendor
problems.
1 Introduction
Markov decision processes (Puterman, 1994) (MDPs) are a mathematical framework for mod-
elling dynamic decision making problems under uncertainty. Under the MDP framework, at
each decision epoch in a finite or infinite time horizon, a decision maker utilises information
about the current state of a system in order to select an action that yields them a reward. The
∗STOR-i Centre for Doctoral Training, Lancaster University, United Kingdom. Email:
b.black1@lancaster.ac.uk
†Corresponding author.
‡Advanced Analytics Group, Air Products Plc, United Kingdom. Email: Trivikram.Dokka@yahoo.co.uk
§Department of Management Science, Lancaster University Management School, United Kingdom. Email:
c.kirkbride@lancaster.ac.uk.
1
action taken can affect the next state of the system, which is stochastically governed by a set
of transition probabilities. The goal of the decision maker is to make decisions at each epoch in
order to maximise the total (discounted) expected reward that they receive over the entire hori-
zon. A solution of an MDP is understood as a policy, which provides an action or a distribution
over the set of actions to be taken in each state of the system. The policy is found prior to
any decisions being made, and in practice the decision maker can instantaneously generate their
action from the policy at any given epoch. Policies are usually found from algorithms based
on dynamic programming and Bellman’s optimality equations (Bellman, 1966), which are based
around the concept of value functions. Value functions give the expected total future reward
from starting in each state and following an optimal policy thereafter.
In classical MDPs, it is assumed that all parameters of the model (rewards, transition prob-
abilities, etc.) are known exactly. However, in practice it can be difficult to determine these
parameters exactly and they must often be replaced with estimates. However, it has been found
that replacing true parameters with estimates thereof can lead to policies that fail drastically
when implemented, due to errors in estimation (Le Tallec, 2007, Wiesemann et al., 2013) and
that the resulting value function estimates can have large variance and bias (Mannor et al.,
2007). Due to these issues, robust MDPs (Wiesemann et al., 2013) (RMDPs) have been pro-
posed to explicitly represent uncertainty in model parameters. RMDPs do not assume that all
parameters are known, but that they are known to lie in some pre-determined set. The decision
maker then aims to find a policy with the best worst-case total reward over all parameters in
the set. This limits the potential hazards of poor estimation.
In the case where only the transition probabilities are not known, we refer to this set as an
ambiguity set. Ambiguity sets are designed so that the decision maker can be confident that
the true transition distribution lies within the set. There are many ways of constructing an
ambiguity set. Early sets placed bounds on each transition probability (Satia and Lave, 1973,
Givan et al., 2000). In more recent papers, it has become more common to bound the distance
between any distribution in the set and some nominal distribution. For example, one can use
the Kullback-Leibler divergence, modified χ2-distance or L1-norms (Iyengar, 2005), or more
general classes of distance measures such as ϕ-divergence functions (Ho et al., 2022). The choice
of ambiguity set strongly affects the tractability of the resulting RMDP model. For general
ambiguity sets, it is known that RMDPs are NP-hard. However, this paper considers a special
type of ambiguity set called s-rectangular ambiguity sets (Le Tallec, 2007). Such ambiguity sets
allow the transition distributions for each state to be chosen independently of one another. The
resulting RMDP is solvable in polynomial time via robust value iteration (Wiesemann et al.,
2013).
Robust value iteration starts with some initial estimate of the value functions, then iteratively
updates these estimates until Bellman’s optimality equations are satisfied. We refer to the pro-
cess of finding the next value function estimates as solving a robust Bellman update. Much of the
recent RMDP literature has focused on developing fast algorithms for solving the robust Bell-
man update in s-rectangular RMDPs. Due to the fact that only the value estimates themselves
2
(and not the optimal policies responsible) are required to complete robust value iteration, many
of these algorithms employ simple methods like bisection search (Grand-Cl´ement and Kroer,
2021, Ho et al., 2022) to solve the update. In this paper, we will focus on developing algorithms
for carrying out robust Bellman updates, with one key difference from the existing literature.
In particular, we will focus on RMDPs where the true state transition distribution either lies
in some parametric family or is specified by some external random variable that lies in some
parametric family (e.g. demand, service times, failure rates). In the existing RMDP litera-
ture, transition distributions are assumed to be non-parametric, and ambiguity sets typically
contain non-parametric distributions. However, in the case where the transition distribution is
indeed parametric, non-parametric ambiguity sets necessarily contain distributions that cannot
be equal to the true distribution. Our models use parametric ambiguity sets to enforce that every
potential distribution in the set lies in the same parametric family as the true distribution.
Constructing an RMDP in this fashion has a number of benefits. Firstly, it means that we only
need to find the worst-case parameters and not the entire worst-case distribution. The worst-
case parameter is typically of much smaller dimension than the worst-case distribution, meaning
that finding it can be much less cumbersome. As such, instead of ambiguity sets for the true
distribution, we use ambiguity sets for the true parameters. Secondly, explicitly using ambiguity
sets for the true parameters and using the corresponding parametric distributions in the model
means that every worst-case distribution generated by the model will lie in the correct parametric
family. In addition, we can make use of maximum likelihood estimation to build confidence sets
for the true parameters and use these as ambiguity sets in our models. Finally, parametric dis-
tributions are natural models for random variables affecting MDP state transitions in a number
of problems. An example of such a problem is a dynamic multi-period newsvendor problem (Ar-
row et al., 1958). In newsvendor models, demand is often considered as a parametric random
variable. More specifically, newsvendor demand has been modelled as normal (Nahmias, 1994),
negative binomial (Agrawal and Smith, 1996), lognormal and gamma (Gallego et al., 2007), and
exponential (Siegel and Wagner, 2021). In addition, for such problems it has been shown that
assuming that parameter estimates are truth can lead to poor cost estimation (Rossi et al., 2014,
Siegel and Wagner, 2021). Hence, a parametric ambiguity set provides a way to hedge against
parameter uncertainty while ensuring the worst-case distribution is also parametric.
This paper extends the concept of parametric ambiguity sets from Black et al. (2022), who stud-
ied a static multi-period resource planning problem under binomial demand, into the RMDP
literature. We formulate s-rectangular parametric ambiguity sets and solve the resulting RMDP
via robust value iteration. Under such ambiguity sets, the robust Bellman update is a para-
metric distributionally robust optimisation problem (Black et al., 2022). As a benchmark, we
reformulate the robust Bellman update as a linear program (LP) by discretising the ambiguity
set. Since this LP can become very slow for large problems, we develop two additional algo-
rithms for carrying out the update. The first is a cutting surface (Mehrotra and Papp, 2014)
(CS) algorithm that iteratively solves the LP over increasing subsets of the ambiguity set. The
second algorithm is a parametric projection-based bisection search algorithm. This algorithm
3
does not rely on any discretisation of the ambiguity set, and we will show that this means that
it solves the robust Bellman updates orders of magnitude faster than both CS and LP.
In summary, the contributions of this paper are as follows:
1. We extend the concept of parametric ambiguity sets from Black et al. (2022) into the
RMDP literature. Such ambiguity sets have only been used for static distributionally
robust optimisation (DRO) problems in the past. Since the DRO model used in the
robust Bellman update must be solved multiple times in an iterative fashion, scalability
of algorithms and computation is even more of a challenge in RMDPs.
2. We develop a fast projection-based bisection search algorithm for solving a robust Bellman-
update, that does not rely on any discretisation of the parametric ambiguity set.
3. We apply our methods to a dynamic multi-period newsvendor model under binomial and
Poisson demands. The results show that the parametric robust value iteration is tractable
and can be solved faster than its non-parametric equivalent.
2 Literature review
2.1 Robust Markov decision processes
RMDPs are a framework for modelling MDPs with unknown parameters, which has become
common in recent years due to the fact that MDPs are extremely sensitive to small changes
in their parameters (Mannor et al., 2007). RMDPs have been studied in the literature since
the 1970s, where the first example of an RMDP used ambiguity sets based on assigning upper
and lower bounds to each transition probability (Satia and Lave, 1973). Such ambiguity sets
were common in the early MDP literature. Givan et al. (2000) also studied bounded parameter
RMDPs, which were solved by solving a collection of exact MDPs. Later, Bagnell et al. (2001)
generalised the concepts of RMDPs to a variety of other ambiguity sets. Their only assumption
was that the sets were convex and compact, meaning that the class they considered covered
interval ambiguity sets as a special case. In addition, finite horizon RMDPs were also studied,
bringing forth a robust version of dynamic programming (DP) (Nilim and El Ghaoui, 2005).
These authors further developed the ambiguity sets used to encompass distance-based sets,
such as those based on the Kullback-Leibler divergence. Using such ambiguity sets allowed the
Bellman optimality equations to be reformulated using dualisation and hence solved exactly
or via bisection. Iyengar (2005) formalised these concepts further, studying finite and infinite
horizon RMDPs with a variety of ambiguity sets. For example, they studied ambiguity sets built
using the Kullback-Leibler divergence, modified χ2-distance and L1norm.
Since these early papers, s-rectangular ambiguity sets have become very common in RMDPs.
An s-rectangular ambiguity set (Le Tallec, 2007) is one arising from the situation in which
the state transitions for each state are independent of one another. Hence, the worst-case
distributions for each state can be extracted independently of one another. Solving an RMDP
with an s-rectangular ambiguity set is equivalent to finding a fixed point of the robust Bellman
4
operator (Wiesemann et al., 2013), hence allowing a robust value iteration algorithm to solve the
infinite horizon case. Many recent papers have developed fast algorithms for solving the robust
Bellman updates required by robust value iteration. For example, Behzadian et al. (2021) studied
s-rectangular ambiguity sets defined by the L∞norm. They developed a homotopy method that
was implemented within a bisection search algorithm for solving the robust Bellman update. Ho
et al. (2021) applied a similar concept to weighted L1norm ambiguity sets, although they used
a partial policy iteration algorithm instead of value iteration. For ellipsoidal and Kullback-
Leibler ambiguity sets, Grand-Cl´ement and Kroer (2021) proposed a first order method that is
embedded in robust value iteration. Their algorithm is based on the observation that solving the
robust Bellman update is equivalent to solving Sbilinear saddle point problems. Ho et al. (2022)
studied ϕ-divergence ambiguity sets, and showed that solving the update in this case corresponds
to solving a set of highly structured simplex projection problems. They used dualisation to
represent each projection problem as a univariate convex optimisation problem. Different to
these algorithms, Derman et al. (2021) showed that solving an s-rectangular RMDP with reward
uncertainty is equivalent to solving a regularised MDP.
In general, s-rectangular ambiguity sets are common due to the tractability of the resulting
RMDP. However, since the state transitions for different states are not always independent,
more general ambiguity sets have also been presented in the literature. Tirinzoni et al. (2018)
state that s-rectangular ambiguity sets can lead to conservative policies, and does not facilitate
knowledge transfer between states or across different decision processes. They instead use non-
rectangular ambiguity sets that bound the moments of state-action features, which are taken
over entire MDP trajectories and not just those for one state. These RMDPs are solved by
finding the optimal policy for a mixture of non-robust MDPs. Following a similar argument
with regards to the conservativeness of s-rectangular ambiguity set, Goyal and Grand-Clement
(2022) develop a new class of non-rectangular ambiguity sets: factor matrix ambiguity sets.
Each distribution in such an ambiguity set is a convex combination of a set of common feature
vectors. This ambiguity set allows for the modelling of dependence across states and for the
RMDP to be efficiently solved by a hybrid value iteration-policy improvement algorithm.
This paper studies RMDPs with s-rectangular ambiguity sets, but with one key difference from
those discussed here. We study transition distributions that are parametric, and our ambiguity
sets contain only distributions that lie in the same parametric family as the true transition distri-
bution. This represents, for example, problems in which the state transitions are defined by some
external random variables such as demand, and that these random variables take parametric
distributions. Despite the fact that the underlying transition distributions may be parametric,
in RMDPs, ambiguity sets always contain non-parametric distributions. However, any distri-
bution in the set that is not part of the same family as the true transition distribution cannot
be equal to the true distribution. As a result of this, we formulate parametric RMDPs, where
the ambiguity sets used contain potential parameters of the transition distribution, not poten-
tial distributions. This allows us to ensure that the worst-case distribution lies in the correct
parametric family, and in addition we only need to find the worst-case parameter, not the entire
5
distribution. To solve the resulting RMDP, we present three algorithms that are used inside a
robust value iteration. The first two are based on discretising the ambiguity set of parameters
and formulating the update as a linear program with one constraint for each parameter. The
second is a fast bisection search algorithm that solves simplex projection problems to compute
the update, similar to the approach of Ho et al. (2022).
2.2 Newsvendor models
The model that we will use to illustrate our methods is the newsvendor model (Arrow et al.,
1951). The newsvendor model is a classical model in inventory and operations management that
describes a retailer deciding on how much stock to purchase in order to meet uncertain future
demand as closely as possible. The newsvendor model has the distinguishing feature that failing
to meet demand in any way is penalised. If too much stock is purchased, the newsvendor pays
a holding cost in order to keep that stock for future customers. If demand is not met by the
stock purchased, the newsvendor pays a backorder cost in order to meet the unmet demand.
Due to this, demand uncertainty and correctly modelling said uncertainty plays a strong role in
maximising profits. Since the initial model of Arrow et al. (1951), the newsvendor model has
been extended in many ways. The extension most relevant to this paper is the multi-period
newsvendor model (Arrow et al., 1958). This is the natural extension of the problem to the
case where the newsvendor needs to meet demand in multiple time periods, and is able to make
separate orders for each.
Although some papers study static newsvendor models (Matsuyama, 2006, Chen et al., 2017,
Ullah et al., 2019), where the newsvendor must commit to their order quantities prior to the
selling period, it is more common in the literature to consider dynamic newsvendor models. In
a dynamic newsvendor model, at the start of each period in the horizon, the newsvendor selects
their order quantity for that day. This way, they have exact knowledge of the amount of inventory
remaining at the time of ordering, as opposed to static models where future inventory levels must
be estimated beforehand. Early dynamic models were finite horizon discrete DP models where
base-stock policies were optimal. An example of this comes from Bouakiz and Sobel (1992),
who considered the case where the demand random variables are independent and identically
distributed with a known distribution. A continuous time version of the dynamic model was later
solved by Kogan and Lou (2003), who showed it to be equivalent to solving a set of discrete-time
problems. Soon after its introduction, papers on the dynamic multi-period model considered
more complex situations with respect to demand behaviour and knowledge about its distribution.
Levi et al. (2007) developed policies based on only samples from the true demand distribution,
with no assumptions being made about the distribution itself. Other extensions include models
with partially observable demand (Bensoussan et al., 2007), non-stationary demand (Kim et al.,
2015) and service-dependent demand (Deng et al., 2014). These papers illustrate the importance
of accurate demand modelling in dynamic newsvendor models, and highlight that it is very
common in such problems for demand information to be incomplete.
Another important extension of the newsvendor problem is the distribution free (DF) newsvendor
6
model (Scarf, 1957). This model represents situations where the true demand distribution is
not known exactly, but some of its moments are known exactly. The model then maximises
the worst-case profit over the ambiguity set containing all distributions with said moments.
Since the work of Scarf (1957) for the single-period single-product DF model, the DF concept
has received significant attention in the newsvendor literature. Early extensions include models
with multiple products and random yield (Gallego and Moon, 1993), balking (Moon and Choi,
1995), uncertainty in cost parameters (Ouyang and Chang, 2002), shortage penalty costs and
budget constraints (Alfares and Elmorra, 2005). Later models included additional complexities
such as advertising and the costs thereof (Lee and Hsu, 2011), risk- and ambiguity-aversion (Han
et al., 2014) and carbon emissions (Liu et al., 2015). Due to the fact that these models are not
dynamic, they can usually be solved by either KKT conditions or Cauchy-Schwarz bounds on
the worst-case cost. Although much less common, the DF concept has also been applied to
the multi-period model. For example, Ahmed et al. (2007) studied a DF model arising from
using coherent risk measures in the objective function. The model was solved as a finite horizon
DP, and it was shown that a base-stock policy was optimal. Levina et al. (2010) considered a
model where the only distributional information came from aggregating the opinions of multiple
experts. This work was later extended to the case with shortage penalty costs by Zhang et al.
(2017). These authors framed the problem as online learning with expert advice, as opposed to
an MDP model. Ullah et al. (2019) found optimal policies for static multi-period distribution
free models with moment-based ambiguity sets. As far as we are aware, Ahmed et al. (2007) is
the only example of an MDP-based DF newsvendor model.
It is clear from the literature on the DF model that newsvendor models commonly lack distribu-
tional information, but many of the MDP models for multi-period newsvendor problems do not
account for this. With the recent advancements in RMDPs combined with the fact that many
multi-period newsvendor models are formulated as MDPs, this problem is a very appropriate
application of our methods. Our research differs from the existing newsvendor literature in two
key ways. Where the majority of the DF newsvendor literature focuses on the case where some
moments of the demand distribution are known, we do not make any such assumption. Using
DF methods usually entails estimating the moments that are assumed to be known, but studies
have found that this can lead to various complications. For example, it has been found that
this can lead to overly conservative solutions (Wang et al., 2016), suboptimal solutions (Lee
et al., 2021) and poor estimates of the true cost function (Rossi et al., 2014). As such, our
approach is closer to the more recent papers in RMDPs (Grand-Cl´ement and Kroer, 2021, Ho
et al., 2022), where ambiguity sets contain all distributions that can be considered to be close
to some nominal distribution. In addition, unlike these two papers, we consider parametric
ambiguity sets. This allows us to model cases where the newsvendor demand is parametric, and
enforce that the worst-case distribution lies in the same parametric family as the true demand
distribution. As discussed earlier, it is very common to assume that demand distributions are
parametric (Nahmias, 1994, Agrawal and Smith, 1996, Gallego et al., 2007, Siegel and Wagner,
2021), but DF models do not incorporate this. Our methodology allows parametric distributions
to be used, but without the pitfalls of assuming that parameter estimates are truth.
7
3 Modelling and algorithms
In this section, we define our model and present the algorithms used to solve it. The general
robust formulation is presented in Section 3.1. The robust value iteration algorithm is presented
in Section 3.2. Following this, Section 3.3 presents ϕ-divergence based non-parametric ambiguity
sets and Section 3.4 details how the resulting RMDP is solved. We detail these methods since
they will act as benchmarks for our parametric methods. In Section 3.5 we formulate our
parametric ambiguity sets, and in Section 3.6 we detail our solution algorithms.
3.1 General robust model
The RMDP we consider is formulated as follows. The state and action spaces are defined as
S={1, . . . , S}and A={1, . . . , A}, respectively. Decisions are made at each epoch t∈ T =N.
The state at time tis a random variable, denoted by St. Similarly, we denote by atthe action
taken at time t. The reward for selecting action a∈ A when in state s∈ S and transitioning
to state s′∈ S is given by rs,a,s′∈R+. We denote by ∆nthe probability simplex in Rn:
∆n={P∈Rn
+:Pn
i=1 Pi= 1}. The distribution of the initial state S0, i.e. the state at
time t= 0, is denoted by Q∈∆S. The distribution of St+1 given that action ais taken in
state sat time tis given by the unknown distribution P0
s,a = (P0
s,a,1, . . . , P 0
s,a,S )∈∆S. Here,
P0
s,a,s′=P(St+1 =s′|St=s, at=a) for any t≥0. Similarly, we write P0
sto denote a matrix
where the element on the ath row and s′th column is P0
s,a,s′. Denote by Π = (∆A)Sthe set of
all stationary, randomised policies. A policy πis a matrix π= (πs,a )s∈S,a∈A ∈Π such that πs,a
gives the probability of taking action awhen in state sfor each (s, a)∈ S × A under policy π.
Denote by P ⊆ (∆S)S×Aan ambiguity set for P0. Each P∈ P and π∈Π induce a stochastic
process {(st, at)}∞
t=0 on the space (S × A)∞of sample paths, and EP,πrefers to the expectation
w.r.t. this process. Then, the robust MDP problem is given by:
max
π∈Πmin
P∈P
EP,π"∞
X
t=0
γtrst,at,st+1
S0∼Q#,
where γ∈(0,1) is a discount factor. We consider s-rectangular ambiguity sets, which are of the
form:
P=P1×. . . × Ps,Ps⊆(∆S)A∀s∈ S .
3.2 Statewise Bellman equations and robust value iteration
Given an initial estimate v0
s∀s∈ S, robust value iteration is performed by iteratively updating
the estimates using the robust Bellman equation (1) for n= 0,1, . . . :
vn+1
s= max
πs∈∆A
min
Ps∈PsX
a∈A
πs,a X
s′∈S
Ps,a,s′(rs,a,s′+γvn
s′)∀s∈ S.(1)
Adapting the pseudocode by Powell (2007), this leads to the following robust value iteration
algorithm:
1. Initialise n= 0, ∆ = 0, v0=0, and select ε.
8
2. While ∆ ≥εγ
1−2γ:
(a) For each s∈ S, solve (1) to find the value of vn+1
s.
(b) Set ∆ = ||vn+1 −vn|| where ||v||= maxs∈S vs.
(c) Set n=n+ 1.
3. Set v∗=vnand let the policy that solves (1) under vn=v∗be π∗.
4. Return π∗and compute the optimal total reward under π∗as Ps∈SQsv∗
s.
Step 2(a) is referred to as solving a robust Bellman update.
3.3 ϕ-divergence ambiguity sets
The most common ambiguity sets in RMDPs are non-parametric, i.e. they do not make use of
any information about the family of distributions in which the true distribution lies. Common
non-parametric ambiguity sets are distance-based (Grand-Cl´ement and Kroer, 2021, Ho et al.,
2022). Such ambiguity sets contain only distributions that lie within a pre-prescribed maximum
distance from a nominal or estimated distribution ˆ
Ps. In other words, a non-parametric distance-
based ambiguity set is of the form given in (2).
Ps=(Ps∈(∆A)S:X
a∈A
da(Ps,a,ˆ
Ps,a)≤κ)∀s∈ S.(2)
Here, da: ∆S×∆S→R+is a distance measure. We will consider cases where dais a ϕ-
divergence, i.e. it satisfies:
da(Ps,a,ˆ
Ps,a) =
S
X
s′=1
ˆ
Ps,a,s′ϕ Ps,a,s′
ˆ
Ps,a,s′!,
where ϕ:R+→R+is a ϕ-divergence function. With different choices of ϕ, the class of
ϕ-divergences encompasses many distances measures, such as the Kullback-Leibler divergence
(KLD), χ2distance, and Burg entropy. As described by Ben-Tal et al. (2013), one benefit of
such ambiguity sets is that we can choose κsuch that Psis an approximate confidence set for
the true distribution. Suppose that the true distribution for state s,P0
s, lies in a parameterised
set {Pθ
s|θs∈Θs}, and let the true parameter be θ0
s. We will assume that only θsis required to
compute Pθ
sand that Pθ= (Pθ
1,...,Pθ
S) is parameterised by θ= (θ1,...,θS). Also suppose
that the distributions P0
s,a are independent. Then, for each (s, a)∈ S × A,P0
s,a is a distribution
parameterised by θ0
s,a. Suppose that we take Nsample transitions from each P0
s,a and use these
to create a maximum likelihood estimate (MLE) ˆ
θs,a of θ0
s,a. Then, if we choose κaccording
to (3), the set Psis an approximate 100(1 −α)% confidence set for P0
saround ˆ
Ps=Pˆ
θ
s.
κ=ϕ′′(1)
2Nχ2
oA,1−α.(3)
In (3), ois the dimension of Θsand χ2
o,1−αis the 100(1 −α)th percentile of the χ2distribution
with odegrees of freedom. Note that, while Ben-Tal et al. (2013) use odegrees of freedom, we
use oA in (3) since Pa∈A da(Ps,a,ˆ
Ps,a) is the sum of Aindependent χ2
orandom variables; i.e. it
is a χ2
oA random variable.
9
3.4 Solving the robust Bellman update
In this section, we detail the algorithms that we will use to solve the robust Bellman update
under non-parametric ambiguity sets, that will act as benchmarks for our methods. In Sec-
tion 3.4.1, we describe how to reformulate the update using the conjugate of the ϕ-divergence
function. In Section 3.4.2, we describe the projection-based bisection search algorithm of Ho
et al. (2022).
3.4.1 Solution via reformulation
The robust Bellman update problem under ϕ-divergence ambiguity sets can be reformulated
using the convex conjugate of a ϕ-divergence function:
ϕ∗(z) = sup
τ≥0
{zτ −ϕ(τ)}.
Using this definition, following the steps given by Ben-Tal et al. (2013), we dualise the inner
problem of (1) to arrive at the following reformulation (with v=vn):
max
πs∈∆A,ν,η (ν−ηκ −X
a∈A X
s′∈S
ηˆ
Ps,a,s′ϕ∗νa−πs,a(rs,a,s′+γvs′)
η:ν∈RA, η ∈R+)(4)
where νa∈Ris the Lagrange multiplier for the constraint Ps′∈S Ps,a,s′= 1 for each a∈ A,
ν=Pa∈A νaand η∈R+is the Lagrange multiplier for the constraint Pa∈A da(Ps,a,ˆ
Ps,a)≤κ.
For a derivation of this reformulation, see Appendix A.1.
The model requires different approaches for different ϕfunctions, due to the different forms ϕ∗
can take. As an example, for the modified χ2divergence, this model can be reformulated as the
following conic quadratic program:
max
πs(ν+η(A−κ)−1
4X
a∈A X
s′∈S
ˆ
Ps,a,s′us,a,s′)
s.t. q4ζ2
s,a,s′+ (η−us,a,s′)2≤(η+us,a,s′)∀a∈ A ∀ s′∈ S
ζs,a,s′≥2η+νa−πs,a(rs,a,s′+γvs′)∀a∈ A ∀ s′∈ S
ζs,a,s′≥0∀a∈ A ∀ s′∈ S
X
a∈A
πs,a = 1
πs,a ≥0∀a∈ A
η≥0
ν∈RA.
(5)
For more details on the derivation of this reformulation, see Appendix A.2.
3.4.2 Projection-based bisection search algorithms
Model (4) can become large when Aand Sare large, and so it is not always reasonable to
solve it in every step of the value iteration algorithm. Hence, Ho et al. (2022) presented a fast
10
projection-based algorithm for solving the corresponding robust Bellman update. We define a
simplex projection problem as follows:
P(ˆ
Ps,a;b, β) =
minPs,a da(Ps,a,ˆ
Ps,a)
s.t. Ps′∈S bs′Ps,a,s′≤β
Ps,a ∈∆S
.(6)
Then, the outline of the algorithm presented by Ho et al. (2022) is as follows. In each iteration n
of the value iteration algorithm, for each s∈ S, the Bellman update is solved via bisection search
on the value of vn+1
s. This is done via the following algorithm, which we will call non-parametric
bisection search (NBS):
1. Initialise ϵand define v0
s=¯
Rs(vn) = max(a,s′)∈A×S rs,a,s′
1−γ, v0
s= maxa∈A mins′∈S rs,a,s′+γvn
s′
and δ=ϵκ
2A+¯
Rs(vn)+Aϵ .
2. For each i= 0, . . . :
(a) Set β=vi
s+vi
s
2.
(b) For each a∈ A,
i. If P(ˆ
Ps,a;rs,a +γvn, β) is infeasible, i.e. mins′rs,a,s′+γvn
s′> β, then set da
and daequal to κ+ 1. Go to step 2(c).
ii. Otherwise, solve the projection problem to δ-optimality to obtain parameter
action-wise upper and lower bounds da, daon its objective value.
(c) Use these bounds to update vi
sand vi
s:
(vi+1
s, vi+1
s) =
(vi
s, β) if Pa∈A da≤κ,
(β, vi
s) if Pa∈A da> κ
]
(d) vi+1
s−vi+1
s< ϵ or κ∈[Pa∈A da,Pa∈A da) then go to step 3.
3. Return β=vi+1
s+vi+1
s
2.
This generates the updated value estimates vn+1
sfor s∈ S. Now, in step 2(b), the projection
problem is also usually solved via bisection search. The logic behind step 2(b)i is that, if the
problem is infeasible then there is no Ps∈ Psthat achieves an objective value of β. Hence, this
scenario should be treated the same as when the problem is feasible and gives Pa∈A da> κ. Since
the actual value of the objective function does not matter as long as this inequality holds, we set
it to κ+ 1. For ambiguity sets defined by ϕ-divergences such as the Kullback-Leibler divergence
and χ2-distance, Ho et al. (2022) showed how to solve the projection problem efficiently. For
the modified χ2-distance, their method involves first dividing the projection problem into S+
1 subproblems, and then reformulating each one as a univariate optimisation problem with
at most 3 potential solutions that can be found analytically. Solving the subproblem then
corresponds to evaluating each of these potential solutions, and choosing the best of those that
are feasible. Then, the subproblems’ solutions are compared and the best one is selected. Details
11
of this algorithm can be found in Appendix B.1. Following the completion of the value iteration
algorithm, a policy must be retrieved. Since the algorithm of Ho et al. (2022) does not return
a policy, it must be extracted from solving (4), using v=v∗.
3.5 Parametric ambiguity sets
We now present our formulation for the RMDP under parametric transition distributions. Sup-
pose that the true transition distribution P0is uniquely defined by the probability mass function
(PMF) and/or cumulative distribution function (CDF) of a parametric probability distribution.
In this section, we detail how our model allows us to enforce that the worst-case distribution
maintains this structure.
3.5.1 Formulation
Suppose that, for each (s, a)∈ S × A,P0
s,a is uniquely defined by the distribution of some
exogenous random variable Xs,a with support Xs,a. Let fXs,a and FXs,a be the PMF and CDF
of Xs,a, which are parameterised by the parameter θ0
s,a = (θ0
s,a,1, . . . , θ0
s,a,o). Assume that the
current state St=sand action at=aare given. We assume that the next state St+1 is specified
by some simple, known function gof the exogenous random variable Xs,a:
St+1 =g(Xs,a|s, a).
In other words, for a given realisation xof Xs,a, we can compute the next state as st+1 =g(x|s, a).
We define the set of all realisations of Xs,a that lead to St+1 =s′as:
Xs,a(s′) = x∈ Xs,a :g(x|s, a) = s′.
Then, the transition matrix corresponding to the parameter θ0is given by:
P0
s,a,s′=P(St+1 =s′|St=s, at=a)
=P(g(Xs,a|s, a) = s′)
=X
x∈Xs,a(s′)
fXs,a (x|θ0
s,a)∀s′∈ S.
Since gis known, in this case the value of P0is uniquely specified by θ0. Therefore, the only
unknown element required to find the true distribution is θ0. Hence, given that the worst-case
distribution should maintain the structure of P0, we can simply construct ambiguity sets for
θ0. More specifically, we consider ambiguity sets of the form:
Θs⊆Ro∀s∈ S,Θ = Θ1×. . . ×ΘS.
We can then reformulate the RMDP as:
max
π∈Πmin
θ∈Θ
Eθ,π"∞
X
t=0
γtrst,at,st+1
S0∼Q#.
Let Pθrepresent the transition probabilities corresponding to θ. Similarly, for any s∈ S and
θs∈Θs, write Pθ
s= (Pθ
s,a,s′)a∈A,s′∈S . Note that, although the superscript for Pθ
sis θ, only
12
θsis required to compute it and by rectangularity we can obtain Pθsimply by obtaining Pθ
s
for all s∈ S. Similarly, only θs,a is required to compute Pθ
s,a = (Pθ
s,a,s′)s′∈S . Now, using the
information about P0’s structure, we compute Pθ
saccording to:
Pθ
s,a,s′=X
x∈Xs,a(s′)
fXs,a (x|θs,a)∀(a, s′)∈ A × S.
The robust state-wise Bellman equation can then be written as:
vn+1
s= max
πs∈∆A
min
θs∈ΘsX
a∈A
πs,a X
s′∈S
Pθ
s,a,s′(rs,a,s′+γvn
s′)∀s∈ S.(7)
As discussed by Black et al. (2022), the non-linearities of the PMFs as functions of the parameters
mean that above model is not tractable as a mathematical program if the parameters are treated
as decision variables. One way to find vn+1
sapproximately is to use a discretisation Θ′
sof the
ambiguity set Θs. This allows us to reformulate the problem in (7) as:
vn+1
s= max
πs∈∆A(ϑ:ϑ≤X
a∈A
πs,a X
s′∈S
Pθ
s,a,s′(rs,a,s′+γvn
s′)∀θs∈Θ′
s)∀s∈ S.(8)
This problem can be solved as an LP with |Θ′
s|+1 constraints. Due to this, if a fine discretisation
of Θ′
sis used, this model can be very slow to solve. While this is the approach used in parametric
DRO prior to this paper, in robust value iteration we only need to compute vn+1
sand not the
optimal policy. Hence, in certain cases, no mathematical programming formulation is necessary.
We will discuss this in more detail in Section 3.6.2. However, please note that solving (8) is
currently the only way to extract the optimal policy and worst-case probabilities, to the best of
our knowledge.
3.5.2 Confidence sets for the true parameter
We assume that we have access to Nsamples from the true distribution of X, i.e. the distribution
that characterises P0. This allows us to create an MLE ˆ
θof the true parameter θ0. In addition,
by standard results in maximum likelihood theory (Millar, 2011) we have:
ˆ
θs,a −θ0
s,aTIEθ0
s,aˆ
θs,a −θ0
s,a∼χ2
o
approximately, for large N. Here, IEθ0
s,ais the expected Fisher information matrix, which is
defined by (9). In (9), ℓis the log-likelihood function for the observed data.
IE(θs,a) = −EXs,a ∂
∂θs,a,i ∂θs,a,j
ℓ(θs,a)i,j =1,...,o
.(9)
By independence of the random variables Xs,a for a∈ A, we have that:
X
a∈A ˆ
θs,a −θ0
s,aTIEθ0
s,aˆ
θs,a −θ0
s,a∼χ2
oA.
Since the two are asymptotically equivalent, we can replace IEθ0
s,awith IEˆ
θs,a. Therefore,
an approximate 100(1 −α)% confidence set for θsis given by:
Θα
s=(θs∈RA×Ro:X
a∈A ˆ
θs,a −θs,aTIEˆ
θs,aˆ
θs,a −θs,a≤χ2
oA,1−α).
13
In our experiments, we will use Θα
sas an ambiguity set for our parametric model, for each s∈ S.
We will refer to a discretisation of this set as (Θα
s)′.
3.6 Solving the parametric robust Bellman update
It is often cited (e.g., by Ho et al. (2022)) that solving an infinite-horizon RMDP efficiently boils
down to being able to solve the robust Bellman update efficiently. In our parametric formulation,
if we use the LP approximation, then the model that we solve in each iteration for each state
s∈ S is the LP (8), which has |(Θα
s)′|+1 constraints. However, depending on the fineness of
the discretisation used to construct (Θα
s)′, this set can impose thousands of constraints on the
model. Hence, (8) can be slow to solve. For this reason, we develop two algorithms for solving
the robust Bellman update with parametric transition distributions.
3.6.1 A cutting surface algorithm
In our previous paper on DRO (Black et al., 2022), cutting surface (CS) algorithms have per-
formed very well at solving parametric DRO problems that are formulated using discrete ambi-
guity sets in the same way as (8). Hence, we now describe the CS algorithm that we will use
for the RMDP. The idea behind the CS algorithm is as follows. Suppose we are at iteration n
of the value iteration algorithm and currently solving for state s∈ S. Start with some initial
singleton subset Θ1
s={θinit
s}. Solve (8) using Θs= Θ1
sto generate a policy π1
s. Next, solve
the distribution separation problem (10) with k= 1 to find the worst-case parameter θ1
sfor the
policy π1
s. Set Θ2
s= Θ1
s∪ {θ1
s}and repeat until stopping criteria are met.
min
θs∈(Θα
s)′X
a∈A
πk
s,a X
s′∈S
Pθ
s,a,s′(rs,a,s′+γvn
s′) (10)
The appeal of this algorithm is that it only ever solves the approximate robust Bellman up-
date (8) over some small subset Θk
sof (Θα
s)′, meaning the LP concerned only has |Θk
s|+1 = k+1
constraints at iteration k. Typically, in our previous research, we found that this algorithm typ-
ically never runs for more than k= 5 iterations. A formal description of the algorithm for
iteration nof the value iteration algorithm for state sis given below.
1. Initialise Θ1
s={θinit
s}for some θinit
s∈(Θα
s)′, set k= 1.
2. While k≤kmax :
(a) Solve the LP (8) using Θs= Θk
sto obtain policy πk
s, which has a worst-case reward
of ˜
Rkover Θk
s.
(b) Evaluate the worst-case rewards:
R(πk
s|θs) = X
a∈A
πk
s,a X
s′∈S
Pθ
s,a,s′(rs,a,s′+γvn
s′)∀θs∈(Θα
s)′,
and find θk
s= argmaxθs∈(Θα
s)′R(πk
s|θs). Set Rk=R(πk
s|θk
s).
(c) If ˜
Rk≤Rk+ε
2or θk
s∈Θk
sthen set k=kmax + 1.
14
3. Return πk
swith worst-case parameter θk
sand worst-case reward Rk.
For this paper, this algorithm will serve as a method for solving the approximate robust Bellman
update (8). It will therefore be embedded into step 2(a) of the robust value iteration algorithm
in Section 3.2.
3.6.2 A projection-based algorithm for single parameter distributions
The main algorithms of Ho et al. (2022) are based around solving the robust Bellman update
using bisection search, Within each iteration of the bisection algorithm, a set of |A| simplex
projection problems (6) are solved to generate the next upper and lower bounds on the value
function. The benefit of this is that the projection problem, in the non-parametric case with
ϕ-divergence ambiguity sets, can often be reformulated as a univariate convex optimisation prob-
lem. Solving the projection problem P(ˆ
Ps,a;b, β) corresponds to finding the closest distribution
to ˆ
Ps,a that yields an objective value that is no larger than β, when action ais taken in state
s. In the case of distributions where P0
s,a is parametrised by only one parameter (such as when
Xs,a is binomial with a fixed number of trials, or Poisson), the parametric equivalent of this
problem can be stated as:
˜
P(ˆ
θs,a;b, β) =
minθs,a ˆ
θs,a −θs,a2IEˆ
θs,a
s.t. Ps′∈S bs′Pθ
s,a,s′≤β
θs,a ∈[θmin
s,a , θmax
s,a ]
,(11)
If Ps′∈S bs′Pθ
s,a,s′≤βthen the model is trivially solved by θs,a =ˆ
θs,a with an objective value of
0. Therefore, suppose that Ps′∈S bs′Pθ
s,a,s′> β. Without any type of reformulation, the model
is a univariate optimisation problem that can be solved via bisection. The only complication
in solving this problem via bisection is the constraint Ps′∈S bs′Pθ
s,a,s′≤β. Note that, since
I−1
E(ˆ
θs,a) is the asymptotic variance of the MLE ˆ
θs,a, we have that IE(ˆ
θs,a)≥0. Hence, since
IE(ˆ
θs,a) is constant in θs,a, the objective of (11) is equivalent to:
min
θs,a
|ˆ
θs,a −θs,a|.
Therefore, it is clear that the optimal solution to (11) is the closest θs,a to ˆ
θs,a in terms of absolute
value that satisfies Ps′∈S bs′Pθ
s,a,s′≤β. Since Ps′∈S bs′Pθ
s,a,s′> β, the optimal solution must
satisfy Ps′∈S bs′Pθ
s,a,s′=β. To see this, observe that any feasible solution with Ps′∈S bs′Pθ
s,a,s′<
βmust be further left or right of ˆ
θs,a than a solution with Ps′∈S bs′Pθ
s,a,s′=β. Suppose that
the problem is feasible and let θmin
s,a and θmax
s,a be global lower and upper bounds on θs,a. Then,
there must be at least one θs,a ∈[θmin
s,a , θmax
s,a ] such that Ps′∈S bs′Pθ
s,a,s′=β. Based on this, we
have three potential scenarios as discussed below:
1. There exists a root of Ps′∈S bs′Pθ
s,a,s′=βin [θmin
s,a ,ˆ
θs,a]. Let θl
s,a be the closest root of
Ps′∈S bs′Pθ
s,a,s′=βto ˆ
θs,a in the interval [θmin
s,a ,ˆ
θs,a].
2. There exists a root of Ps′∈S bs′Pθ
s,a,s′=βin [ˆ
θs,a, θmax
s,a ]. Let θu
s,a be the closest root of
Ps′∈S bs′Pθ
s,a,s′=βto ˆ
θs,a in the interval [ˆ
θs,a, θmax
s,a ].
15
3. θl
s,a and θu
s,a both exist as defined above.
Solving the projection problem then amounts to finding θl
s,a and θu
s,a, and checking which is
closest to ˆ
θs,a. Given this, we solve our projection problem to δ-optimality for a given s, a using
the following algorithm:
1. Initialise a gap ˜ϵ, the set of root containing intervals as ρ=∅, and upper and lower bounds
on θs,a as θmin
s,a , θmax
s,a .
2. Find interval containing closest left root:
(a) Initialise θs,a =ˆ
θs,a,E=β.
(b) While E≥βand θs,a =θmin
s,a :
i. Set θs,a = max{θs,a −˜ϵ, θmin
s,a }.
ii. Compute Pθ
s,a and set E=Ps′∈S bs′Pθ
s,a,s′.
(c) If E≤βthen set ρ=ρ∪ {[θs,a , θs,a + ˜ϵ]}.
3. Find interval containing closest right root:
(a) Initialise θs,a =ˆ
θs,a,E=β.
(b) While E≥βand θs,a =θmax
s,a :
i. Set θs,a = min θs,a + ˜ϵ, θmax
s,a .
ii. Compute Pθ
s,a and set E=Ps′∈S bs′Pθ
s,a,s′.
(c) If E≤βthen set ρ=ρ∪ {[θs,a −˜ϵ, θs,a]}.
4. Carry out a bisection search in interval in ρto find the roots θl
s,a and θu
s,a, stopping once the
difference between the upper and lower bounds on the objective function in the bisection
interval is no larger than δ. Store the intervals hθx
s,a, θx
s,aifor x∈ {l, u}.
5. Return the interval hθ∗
s,a, θ∗
s,aiwhose midpoint is closest to ˆ
θs,a in terms of absolute value.
We use an iterative procedure starting from ˆ
θs,a in steps 2 and 3 in order to reduce the number
of times we need to compute Pθ
s,a. Since we are only interested in the closest roots to ˆ
θs,a,
there is no need to enumerate all intervals of width ˜ϵ. Note that, in some cases, θs,a may not
have both a global lower and upper bound. For example, if θs,a is a Poisson parameter, then
it has no upper bound. However, if the solution to the projection problem does not lie in the
ambiguity set, then it is treated the same as if the problem is infeasible. Hence, we are only
interested in roots inside the ambiguity set and so in such cases we can use the bounds from
the ambiguity set. If θs,a is a binomial parameter then we can pick either θmin
s,a , θmax
s,a to be 0,1
or min((Θα
s)′),max((Θα
s)′). Since we will typically split the interval [θmin
s,a , θmax
s,a ] into an equal
number of sub-intervals and hence each choice results in the same amount of computation, which
upper and lower bounds we pick are not of particular importance.
16
Given the above, we adapt the non-parametric bisection search algorithm from Section 3.4 into
the following parametric algorithm, which we call parametric bisection search (PBS):
1. Initialise ϵand define v0
s=¯
Rs(vn) = max(a,s′)∈A×S rs,a,s′
1−γ, v0
s= maxa∈A mins′∈S rs,a,s′+γvn
s′
and δ=ϵκ
2A+¯
Rs(vn)+Aϵ .
2. For each i= 0, . . . :
(a) Set β=vi
s+vi
s
2.
(b) For each a∈ A,
i. If ¯
P(ˆ
θs,a;rs,a +γvn, β) is infeasible, i.e. mins′rs,a,s′+γvn
s′> β, then set ca
and caequal to χ2
oA,1−α+ 1. Go to step 2(c).
ii. Otherwise, solve the projection problem to δ-optimality to obtain parameter
action-wise upper and lower bounds ca, caon its objective value. If projection
algorithm returns no solutions, set both to χ2
oA,1−α+ 1.
(c) Use these bounds to update vi
sand vi
s:
(vi+1
s, vi+1
s) =
(vi
s, β) if Pa∈A ca≤χ2
oA,1−α,
(β, vi
s) if Pa∈A ca> χ2
oA,1−α
(d) vi+1
s−vi+1
s< ϵ or χ2
oA,1−α∈[Pa∈A ca,Pa∈A ca) then go to step 3.
3. Return β=vi
s+vi
s
2.
Given this algorithm, we can efficiently carry out value iteration without ever needing a solver.
However, after this is complete, the optimal policy will need to be retrieved by solving the
approximate MIP reformulation (8) of the robust Bellman update. This can be done using the
cutting surface algorithm of Section 3.6.1.
4 A capacitated dynamic multi-period newsvendor problem
As an example problem, we consider a dynamic multi-period newsvendor problem. This version
of the problem has discrete demands and actions, and a capacity limiting the amount that can
be held in inventory for any given period. In Section 4.1, we describe the model in detail. Then,
in Section 4.2, we formulate the model under binomial demands and perform computational
experiments to test our algorithms in this case. Finally, in Section 4.3, we formulate the model
and test our algorithms under Poisson demands.
4.1 Model
Suppose that Strepresents the amount of inventory in a system of some product affected by
uncertain demand. Let, atbe the amount of this product to order at the start of period t, to be
sold during period t. Products are delivered immediately. We assume that there is a capacity
Cfor holding stock in inventory, so that S={0, . . . , C}. Given that action ais taken in state
17
s, if s+a > C then any excess product is lost as it cannot be stored. Although the newsvendor
could technically order infinite stock, they have no reason to. Hence, A={0, . . . , C}, and so
S=|S| =C+ 1 and A=|A| =C+ 1. We assume that every unit of stock that must be held
for a period incurs a holding cost of h, and if the newsvendor runs out of stock then they pay
a stockout cost of b′. Furthermore, assume that one unit of stock sells for cand is purchased
for w < c. Let the demand for the product, Xs,a, be a random variable whose distribution is
parameterised by the unknown parameter θ0
s,a for each (s, a)∈ S × A. Then, given St=sand
at=a, we have:
St+1 = max{0,min{s+a, C} − Xs,a}.
For shorthand, let ¯s= min{s+a, C}be the post-action pre-demand state. Then, we have that
g(x|s, a) = max{0,¯s−x}, and therefore:
Xs,a(s′) =
{¯s, ¯s+ 1, . . . , C −1, C }if s′= 0
¯s−s′if s′>0.
Therefore, the transition distribution satisfies:
P0
s,a,s′=
PC
x=¯sfXs,a(x|θ0
s,a) if s′= 0,
fXs,a (¯s−s′|θ0
s,a) if s′>0.
=
1−P¯s−1
x=0 fXs,a (x|θ0
s,a) if s′= 0,
fXs,a (¯s−s′|θ0
s,a) if s′>0.
The reward for taking action ain state sand moving to state s′is given by the following. Define
the event of a stockout as 1{s′= 0}. Then, the rewards are given by:
rs,a,s′=cmax{¯s−s′,0} − wa −h(¯s−max{¯s−s′,0})−b′1{s′= 0}.
The term b′1{s′= 0}will charge the newsvendor a flat cost of b′whenever they miss demand.
It is more common in the newsvendor literature to incur a backorder cost for every unit of
missed demand, representing the newsvendor paying an additional cost to meet this demand
after initially not meeting it. This would involve adding cost of b′max{Xs,a −¯s, 0}instead of
b′1{s′= 0}. Hence, the rewards would depend on Xs,a and we need to formulate the robust
Bellman update (1) in a different fashion. The main change would be that the distribution of
Xs,a would be required to calculate the expected one-stage rewards as opposed to simply the
transition matrix. Hence, we would replace the inner minimisation over Pswith a minimisation
over candidates P′
s∈ P′
sfor the true distribution of demand Xs. We would then replace the
inner expected value with respect to the next state with an expectation with respect to Xs,a .
Since each P′
s,a has dimension |Xs,a|, this would remain tractable in the non-parametric case
for finite support demand random variables. However, since we do not know any moments of
the distribution of Xs,a, it would result in an infinite number of decision variables for infinite
support demand random variables. This would not affect the parametric model, however, which
would still find the worst-case parameter directly. For more details on the reformulations in the
case of a backorder cost, see Appendix C.
18
Considering a stockout cost instead of a backorder cost means that the robust Bellman update
can be computed via an expectation over the finite set S, regardless of whether or not Xs,a
is finite. The downside of this formulation is that it can penalise the newsvendor for meeting
demand exactly. However, if this is a concern then one can set w+h > b′to ensure that the
newsvendor would still prefer to meet demand exactly and pay a stockout cost than to purchase
too much stock and hold one item for the following period. Also, it is important to note that
shortage costs are implicitly represented in this model via missed profits and the newsvendor
can see how much demand was lost after the period is complete.
4.2 Numerical experiments with binomial demands
To examine the efficacy of the algorithms described in this paper, we now carry out numerical
experiments on the dynamic newsvendor problem. Firstly, in Section 4.2.1, we describe the
binomial ambiguity sets used. Then, in Section 4.2.2, we describe the parameters used. Following
this, in Section 4.2.3, we discuss the times taken by each algorithm to finish value iteration and
compute the optimal policy. Finally, in Section 4.2.4, we compare the parametric and non-
parametric value functions and resulting policies.
4.2.1 Ambiguity sets
Suppose that Xs,a ∼Bin(C, p0
s,a) for (s, a)∈ S × A, and hence:
fXs,a (x|p0
s,a) = C
x(p0
s,a)x(1 −p0
s,a)C−x(x∈ {0, . . . , C }).
We set the number of trials as Cfor the following reasons. Since a binomial random variable
is bounded above by the number of trials, binomial demands might correspond to a scenario
where a restriction is placed on demand by the newsvendor. In this case, the number of trials
represents the maximum demand allowed by the newsvendor before no more orders are allowed.
The number of trials provides a way for the newsvendor to limit the amount of unmet demand
that is possible. Since any demand above Cis guaranteed to be unmet regardless of the current
stock levels, it is not reasonable for the number of trials to be set above C. This would not
have any benefit for the newsvendor or the customers. Another logical choice for the number
of trials may be min{s+a, C}. However, this would imply that the newsvendor would need
to update the upper bound on demand after every order, and this information would need to
be conveyed to customers. In addition, it suggests that the newsvendor is always able to meet
demand exactly, which is not a realistic modelling assumption. Also, the newsvendor would
be unable to observe how much demand was lost or if the demand met the capacity, which
is inconvenient for improving their decision making and capacity levels. When the maximum
demand is C, the newsvendor can infer whether or not more capacity is required from how often
a demand of Coccurs. Similarly, they can decide if they have too much capacity if, for example,
the demand is always less than the capacity.
Since the number of trials is fixed, the distribution of Xis uniquely parameterised by p0=
(p0
s,a)(s,a)∈S ×A. In the notation of Section 3.5.2, this means that o= 1. Suppose that we take a
19
sample xs,a = (x1
s,a, . . . , xN
s,a) from the distribution of Xs,a for each (s, a)∈ S × A. Then, the
MLE ˆps,a of p0
s,a is given by:
ˆps,a =PN
j=1 xj
s,a
NC ∀(s, a)∈ S × A.
In addition, the Fisher information is given by:
IE(ˆps,a) = NC
ˆps,a(1 −ˆps,a).
Therefore, our approximate 100(1 −α)% confidence set for p0
sis given by:
Θα
s=(ps∈[0,1]A:X
a∈A
NC(ps,a −ˆps,a)2
ˆps,a(1 −ˆps,a)≤χ2
A,1−α)(12)
As discussed in Section 3.5.1, in order for the parametric robust Bellman update (7) to be
tractable, we consider discrete ambiguity sets. Since (12) is a multivariate set, it is difficult to
discretise directly. Therefore, we will construct a set Θbase
ssuch that Θα
s⊆Θbase
sand discretise
Θbase
sinstead. Then, we construct a discretisation of Θα
sby extracting all elements of Θbase
sthat
also lie in Θα
s. Observe that the definition of Θα
simplies that every ps∈Θα
ssatisfies:
ps,a ∈pI
s,a =
max
0,ˆps,a −sχ2
A,1−αˆps,a(1 −ˆps,a)
NC
,min
1,ˆps,a +sχ2
A,1−αˆps,a(1 −ˆps,a)
NC
for all a∈ A. Therefore, defining:
Θbase
s=pI
s,1×. . . ×pI
s,A,
we have Θα
s⊆Θbase
s. Furthermore, define pl
s,a and pu
s,a as the lower and upper bounds of pI
s,a
for each (s, a)∈ S × A. We can then find discretisations of each pI
s,a containing Mpoints as
follows:
˜pI
s,a =(pl
s,a +mpu
s,a −pl
s,a
M−1).
Then, a discretisation of Θbase
sis given by (Θbase
s)′= ˜pI
s,1×. . . טpI
s,A. Finally, a discretisation
of Θα
sis given by (Θα
s)′= (Θbase
s)′∩Θα
s.
4.2.2 Experimental design
We now detail the experiments used to test our algorithms on the dynamic newsvendor problem.
The parameters used were as follows. We considered w, h, b′, c ∈ {1,5,10}such that w > c. The
capacities we considered we C∈ {1,2,3,7,9,14}. This leads to |S| =|A| ∈ {2,3,4,8,10,15}.
We used a discount parameter of γ= 0.5 in all cases. For each algorithm, the value iteration
algorithm was run for a maximum of nmax = 1000 iterations. With regard to ambiguity sets,
we always used α= 0.05, the discretisation parameter was M∈ {3,5,10}and we took N∈
{10,50}samples to create the MLEs. Each algorithm was given a maximum time of 4 hours
to complete value iteration and find the optimal policy after value iteration ended. In addition,
the parametric algorithms were given a maximum of 4 hours to complete their precomputation,
20
i.e. computing the discrete ambiguity set and corresponding transition probabilities. Note that
this is not required for solving value iteration with PBS, but it is required to compute the
optimal policy after value iteration ends. If an algorithm ran for 4 hours and the model was
not solved, then the algorithm is said to have timed out for this instance. Both the parametric
and non-parametric models used 100(1 −α)% confidence sets as ambiguity sets. The parametric
model used (12) or a discretisation thereof, and the non-parametric model used (2) where κis
defined by (3). In addition, we used a value iteration tolerance of ε= 10−6and we initialised the
value functions as v0=0. In PBS, we used a gap of ˜ϵ= 0.01 with (θmin
s,a , θmax
s,a ) = (0,1) for all
(s, a)∈ S × A. Finally, the bisection search tolerance used for PBS and NBS was ϵ= 10−7.
The above inputs generated 810 instances. We ran value iteration on each instance using 5
different algorithms, where each one is defined by how it solves each robust Bellman update.
The algorithms and how they solve the update are as follows:
1. PBS: the parametric projection-based bisection search algorithm of Section 3.6.2.
2. CS: the cutting surface algorithm of Section 3.6.1.
3. LP: using Gurobi to solve the approximate LP reformulation (8) of the parametric update.
4. QP: using Gurobi to solve the CQP reformulation (5) of the non-parametric update (2).
5. NBS: the non-parametric projection-based bisection search algorithm of Section 3.4.2.
Value iteration was run until either nmax iterations had been completed, 4 hours of run time
had been used, or the algorithm converged. After value iteration ended, for LP, CS and QP
the policy was returned. For PBS, the policy was extracted using CS. For NBS, the policy was
extracted using QP.
4.2.3 Times taken
In this section, we summarise the times taken by the algorithms. We first present the number of
times that each algorithm timed out. Firstly, LP and CS timed out while running value iteration
56 and 54 times respectively. No other algorithm timed out while running value iteration.
Secondly, although PBS never timed out while running value iteration, it timed out twice while
computing the optimal policy. As we will show, PBS is a fast algorithm in itself, and these two
timeouts are a result of the slowness of CS in instances with large ambiguity sets.
Due to the above result, we present the times taken to run value iteration separately from the
times taken to compute the policy. Table 1 summarises the amount of time that each algorithm
spent running value iteration. This table shows that PBS took 31 seconds on average to finish
value iteration, while CS took 17 minutes 30 seconds and LP took 26 minutes. It is therefore
clear that PBS results in greatly reduced times to complete value iteration compared to these
solver-based algorithms. CS also saves approximately 12 minutes per iteration compared with
LP on average. Note that LP and CS’s average times per iteration are large because, when they
timed out, they usually timed out after only one iteration. In addition, NBS took 43 seconds on
average to complete value iteration, which is 33% slower than PBS. On average, NBS is much
21
faster than its solver-based equivalent QP, which took over 6 minutes on average to finish value
iteration. However, it is important to note that QP led to convergence issues in our experiments.
While all other algorithms always finished value iteration in around 31 iterations, when using
QP value iteration failed to converge in 378 instances. This was likely due to Gurobi being
unable to provide precise enough optimal objective values. In addition, QP was also the fastest
algorithm per iteration, and was only slow overall due to value iteration’s failure to converge
when using this algorithm.
Algorithm Mean Time Max Time Mean Time Per Iteration
PBS 0:00:31.09 0:04:29.37 0:00:01.07
CS 0:17:30.90 4:00:00 0:04:25.77
LP 0:26:00.73 4:00:00 0:16:20.66
QP 0:06:16.24 0:55:59.19 0:00:00.41
NBS 0:00:43.52 0:09:21.02 0:00:01.48
Table 1: Summary of times taken to run value iteration (binomial)
It is clear from this table that CS and LP can both become very slow. The main reason for
this is M, the parameter defining the fineness of the discretisation of Θα
sused by the parametric
solver-based algorithms. We confirm this with Figure 1, which shows boxplots of CS and LP’s
value iteration run times by M. Figures 1a and 1b show that both CS and LP scale poorly with
Min terms of value iteration run times. However, the effect of Mis not particularly noticeable
until M= 10, as was the case for PBS’s policy times. CS scales better than LP, but it still
becomes slow for instances with large Mor large C. However, since PBS does not return a
policy, an algorithm like CS is required to generate the optimal policy. Please note that, unlike
the CS algorithm of Black et al. (2022), this CS algorithm is the optimal version which finds the
worst-case parameter over the entire ambiguity set in every iteration. This explains why it does
not offer the same level of time savings when compared with LP as the CS algorithm of Black
et al. (2022).
(a) (b)
Figure 1: Boxplots of value iteration run times of (a) CS and (b) LP, by M.
22
Since PBS and NBS do not rely on a discrete ambiguity set, their value iteration times are not
affected by M. Therefore, the main parameter affecting their value iteration times is C. We
present boxplots of PBS and NBS’s value iteration times by Cin Figure 2. Figures 2b and 2b
show similar increases in times as Cincreases, but it is clear that PBS generally scaled better
with Cthan NBS. For C= 14, PBS typically took no longer than 250 seconds to complete value
iteration, while NBS typically took no longer than 450 seconds. On the other hand, NBS was
slightly faster than PBS for small C. The reason for the difference in scaling is likely because
NBS solves S+ 1 = C+ 2 sub-problems in order to solve a projection problem, whereas PBS
always carries out a 3-step procedure to solve its projection problems. Hence, the number of
steps involved in solving a projection problem increases with Cfor NBS, but not for PBS.
(a) (b)
Figure 2: Boxplots of value iteration run times of (a) PBS and (b) NBS, by C(binomial).
Although PBS was faster than NBS in running value iteration, it generally took longer for the
parametric optimal policy to be computed in the parametric case than the non-parametric.
This is because computing the policy is not part of PBS, and so CS had to be used for this. On
average, it took 8 minutes 27 seconds for CS to compute the policy for PBS’s value function,
and only 0.43 for QP to compute NBS’s. However, the parametric average is greatly affected
by a small selection of very slow instances. Figure 3a shows a boxplot of the times taken to
compute the policy for PBS and NBS’s value functions after value iteration ended. Note that
this boxplot does not show outliers, which are defined as any data that are further than 1.5
times the interquartile range above the 75th percentile or below the 25th percentile. Figure 3
shows that while the average time to compute the policy for PBS’s values was 8 minutes 27
seconds, the median time was only 0.128 seconds. The 25th and 75th percentiles of the time
taken to compute the parametric policy are 0.0325 and 1.34 seconds.
Figure 3b explains why the average time to compute the policy for PBS was large. In particular,
it shows that this starts to take a very long time when M= 10. This parameter defines the
fineness of the discretisation of Θα
sused by CS when computing the policy. Since PBS uses Θα
s
directly in value iteration, this parameter does not affect its value iteration run times. Hence,
the slow times to compute a policy for PBS are a reflection on CS’s scaling with respect to
M, not PBS’s. This can be confirmed by comparing Figure 3b with Figure 1a, and observing
23
(a) (b)
Figure 3: Boxplots of times taken to compute optimal policy for (a) both BS algorithms and
(b) PBS by M(binomial).
that the exact same pattern is present in both. Due to this, it is a downside that PBS does
not provide an optimal policy. The same applies to NBS, but since QP is faster than CS in
generating a policy, the effect is not so severe. Comparing the times taken to compute optimal
policies is therefore not comparing the run times of PBS and NBS, but comparing the run times
of CS and QP.
4.2.4 Comparison of value functions, distributions and policies
In this section, we compare the values, policies and worst-case transition distributions from the
5 algorithms tested. We first discuss the effect of the discretisation of Θα
son the value functions
from LP and CS. Following this, we compare the outputs from PBS and NBS in order to assess
the benefits of incorporating additional distributional information into the model.
Let vybe the value function generated by running value iteration with algorithm y∈ Y =
{PBS,CS,LP,QP,NBS}. Similarly, define πyas the policy and Pyas the worst-case transition
distribution from using algorithm yin the value iteration algorithm. Then, we can summarise
the differences between LP’s approximate value functions and PBS’s optimal value functions via
Figure 4. Figures 4a and 4b show boxplots of the mean difference between vLP and vPBS over
all instances where LP did not time out, by Cand M, respectively. The mean differences are
calculated as: 1
|S|X
s∈S vPBS
s−vLP
s.
We see from Figure 4a that the average difference between vPBS and vLP was always negative,
with the magnitude of the difference growing larger as Cincreases. This means that LP’s value
function estimates are higher than PBS’s optimal values. This is intuitive, since LP uses a
discrete subset of Θα
sand therefore cannot find the true worst-case parameter for any given π,
in general. Therefore, LP’s value function estimates overestimate the worst-case reward for a
given policy. As is reflected in Figure 4b, the two value functions get closer as Mincreases,
with average differences that are less than 2.5 in absolute value for M= 10. These plots suggest
two results. Firstly, the effect of the discretisation increases as Cincreases. In other words,
24
for larger C, the value function estimates resulting from the discrete approximations are less
accurate. Secondly, the value function estimates resulting from the discretised ambiguity set
appear to converge to their optimal values over the complete (not discretised) ambiguity set.
(a) (b)
Figure 4: Boxplots of mean difference between vLP and vPBS by (a) Cand (b) M(binomial).
We also compare the values from NBS with those from PBS in Figure 5. The quantities plotted
here are the mean differences between vPBS and vNBS, calculated using:
1
|S|X
s∈S vPBS
s−vNBS
s.
Figure 5 shows that the value functions were generally quite close. The smallest and largest
mean difference between vPBS and vNBS were −2.24 and 1.56 respectively. However, there is a
clear pattern in the value function differences as Cincreases. For small C, Figure 5 shows that
vNBS and vPBS were very close, with vPBS’s mean value (taken over s) being slightly higher.
However, as Cincreases past 3 we see a clear pattern of vNBS’s mean value becoming larger
than vPBS’s. The magnitude of this difference grows as Cincreases. This indicates that the
worst-case distributions resulting from the parametric ambiguity set can be worse than those
from the non-parametric set, i.e. they can lead to lower worst-case rewards. This result can be
Figure 5: Boxplot of mean difference between vNBS and vPBS by C(binomial).
25
explained by differences between the parametric and non-parametric ambiguity sets. Recall that
the non-parametric ambiguity set Psis defined as containing all distributions Psthat satisfy the
inequality Pa∈A dϕPs,a,ˆ
Ps,a≤κ. In contrast, the parametric ambiguity set Θα
sis defined
using an inequality restricting the distance from ˆ
θsthat θscan take. This does not restrict
the distance from ˆ
Psthat the parametric worst-case can take in the same way as the non-
parametric ambiguity set does. To understand this, we evaluate maxs∈S Pa∈A dϕPy
s,a,ˆ
Ps,a
for y∈ {PBS,NBS}, for every instance solved. We provide boxplots of these values in Figure 6.
Figure 6 shows that, for every value of C, the parametric worst-case distribution was allowed to
be further from ˆ
Psthan the non-parametric worst-case. As Cincreases, difference between the
maximum distances for the parametric and non-parametric worst-case distributions increases,
explaining why the value functions are more different for large C. This happens since larger C
occurs when Ais larger, meaning the LHS of the inequalities defining the ambiguity sets are
sums of more terms, and also χ2
oA,1−αis increasing in A.
(a) (b)
Figure 6: Boxplots of maxs∈S Pa∈A dϕPy
s,a,ˆ
Ps,afor (a) y= PBS and (b) y= NBS (binomial).
In general, this means that a distribution being binomial may lead to its inclusion in the paramet-
ric confidence set even though it is further from ˆ
Pthan any distribution in the non-parametric
confidence set. Similarly, distributions that are not binomial need to be much closer to ˆ
Pin
order to be considered as candidates for the true distribution. The result in Figure 6 suggests
that, for this problem, the parametric ambiguity sets are more risk-averse. It is worth noting,
however, that NBS’s values being slightly higher on average does not necessarily mean that
more long-run reward would be obtained under the non-parametric model. This depends on the
initial distribution Q. For example, studying the value functions we see that vPBS
0≥vNBS
0was
true in 66% of instances, vPBS
1≥vNBS
1was true in 65% of instances, and vPBS
2≥vNBS
2was true
in 50.4% of the instances with C≥2. If the initial distribution satisfied, for example, Qs>0
only for s≤2, then the non-parametric approach would typically not achieve more long-term
expected reward.
We now compare the policies from PBS and NBS. The first characteristic we study is deter-
minism. In these experiments, we find that PBS’s optimal policy was deterministic in 42% of
instances. NBS’s policy was deterministic in 16% of instances. This indicates that the para-
26
metric model is more likely to yield deterministic policies than the non-parametric model. In
addition, we studied the expected actions for each state in order to determine how conservative
each model is. Specifically, we calculated Pa∈A aπy
s,a for each s, for y∈ {PBS,NBS}for every
instance that we solved. Generally speaking, the expected actions under PBS and NBS were
quite similar. However, we found two main results. Firstly, when the inventory level sis below
13, PBS will typically purchase slightly more stock. For the very smallest states, PBS only or-
dered between 5% and 6% more stock than NBS. However, for s= 10 and s= 11 PBS ordered
31% and 24% more respectively. Secondly, when the stock level is high, i.e. 13 or 14, NBS will
typically purchase more stock. This was most noticeable for s= 14, where NBS ordered 38%
more than PBS on average. From this, we can conclude that PBS’s policies are typically less
conservative for the majority of states, but that NBSs’s are slightly less conservative for the
largest 2 states.
It may seem odd that either algorithm makes positive orders when s=C= 14 since any
stock above Cis lost. This occurs due to differences in the worst-case distributions for different
actions. For example, when in state Cit may be the case that ordering a non-zero amount
of stock leads to a much higher worst-case probability of then transitioning to state zero (and
hence selling all stock). For example, this may happen when ˆpC,a is much larger for a > 0 than
for a= 0, which can occur simply due to sampling variation. In some cases, spending some
additional multiple of wto purchase wasted stock results in a higher expected reward due to the
fact that the newsvendor is then much more likely to sell all of their stock. Since NBS ordered
more in the higher states, clearly this was of more benefit under NBS’s worst-case distributions
than PBS’s. If this is not something that the newsvendor would like to allow, the policy can
always be constrained to enforce that πs,a = 0 for all a∈ A such that s+a>C.
4.3 Numerical experiments with Poisson demands
We now carry out the same experiments as in Section 4.2, but where Xs,a ∼Pois(λ0
s,a) for
(s, a)∈ S × A. In Section 4.3.1 we formulate the Poisson ambiguity sets. Following this, we
describe the results of the experiments in Sections 4.3.3 and 4.3.4.
4.3.1 Ambiguity sets
Suppose that Xs,a ∼Pois(λ0
s,a) for (s, a)∈ S × A, and therefore:
fXs,a (x|λ0
s,a) = (λ0
s,a)xexp(−λ0
s,a)
x!(x∈N0).
The distribution of Xis uniquely parameterised by λ0= (λ0
s,a)(s,a)∈S ×A. Similarly to in Sec-
tion 4.2.1, we have o= 1 and suppose that we take the sample xs,a = (x1
s,a, . . . , xN
s,a) from Xs,a
for each (s, a)∈ S × A. Then, the MLE ˆ
λs,a of λ0
s,a is given by:
ˆ
λs,a =PN
j=1 xj
s,a
N∀(s, a)∈ S × A.
27
The Fisher information is now given by:
IE(ˆ
λs,a) = N
ˆ
λs,a
.
Hence, an approximate 100(1 −α)% confidence set for λ0
s(for large N) is given by:
Θα
s=(λs∈RA
+:X
a∈A
N(λs,a −ˆ
λs,a)2
ˆ
λs,a
≤χ2
A,1−α).
As in Section 4.2.1, we will construct a discretisation of Θα
sby creating a set Θbase
swith Θα
s⊆
Θbase
sand discretising this set. Then, we extract elements of this discrete set that also lie in Θα
s.
The definition of Θα
simplies that every λs∈Θα
ssatisfies:
λs,a ∈λI
s,a =
max
0,ˆ
λs,a −sχ2
A,1−αˆ
λs,a
N
,ˆ
λs,a +sχ2
A,1−αˆ
λs,a
N
for all a∈ A. Hence, we define Θbase
s=λI
s,1×. . . ×λI
s,A and we have Θα
s⊆Θbase
s. Furthermore,
define λl
s,a and λu
s,a as the lower and upper bounds of λI
s,a for each (s, a)∈ S × A. We calculate
the following discretisations of each λI
s,a, containing Mpoints, as follows:
˜
λI
s,a =(λl
s,a +mλu
s,a −λl
s,a
M−1).
Then, a discretisation of Θbase
sis given by (Θbase
s)′=˜
λI
s,1×. . . ט
λI
s,A. Finally, a discretisation
of Θα
sis given by (Θα
s)′= (Θbase
s)′∩Θα
s.
4.3.2 Experimental design
All of the main parameters for these experiments are the same as in Section 4.2.2. The only
differences are with respect to the parameters used in the parametric bisection search algo-
rithm of Section 3.6.2. We again use θmin
s,a = 0 for all (s, a)∈ S × A, but since λs,a is techni-
cally not bounded from above, there is no obvious value for θmax
s,a . However, since any root of
Ps′∈S Pθ
s,a,s′bs′=βthat has λs,a > λu
s,a cannot be an element of a λthat lies in the ambiguity
set, we set θmax
s,a =λu
s,a for all (s, a)∈ S × A. Since this creates a wider range for the potential
roots than for the binomial case, so we set ˜ϵ=θmax
s,a −θmin
s,a
100 . This ensures that the same number
of intervals are used here as in the binomial case, where we used ˜ϵ= 0.01 =1−0
100 .
4.3.3 Times taken
We now present the results of our experiments for Poisson demands. Of the 810 instances ran,
we found that LP and CS timed out in 54. PBS timed out in 2, although this was again during
the final step of finding the optimal policy with CS. PBS never timed out during value iteration.
QP and NBS did not time out in any instance, but QP again resulted in convergence issues.
When using QP to solve the Bellman updates, value iteration failed to converge in 384 instances.
Table 2 summarises the times taken to run value iteration. Similar results to the binomial case
28
Algorithm Mean Time Max Time Mean Time Per Iteration
PBS 0:00:24.44 0:03:59.44 0:00:00.85
CS 0:17:26.57 4:00:00 0:04:20.96
LP 0:23:45.70 4:00:00 0:16:16.30
QP 0:05:37.69 0:58:13.20 0:00:00.38
NBS 0:00:42.01 0:10:52.69 0:00:01.45
Table 2: Summary of times taken to run value iteration (Poisson)
can be found here, with PBS being faster than NBS, CS being slightly faster than LP, and QP
being fast per iteration.
Figure 7 compares the value iteration run times of PBS and NBS more closely. It shows that,
while NBS was slightly faster for small C, PBS scales much better with large C. For C= 14,
PBS typically took no more than 3 minutes to finish value iteration. However, NBS took up to
10 minutes.
(a) (b)
Figure 7: Boxplots of value iteration run times of (a) PBS and (b) NBS, by C(Poisson).
While PBS is fast at finding the optimal values, it also took much longer to find a policy after
running value iteration with PBS than with NBS. However, as before, this is due to CS being
slow for large instances, and has nothing to do with PBS itself. On average, it took 8 minutes
and 13 seconds for CS to find a policy for PBS’s values, as opposed to 0.38 seconds for QP to
compute the policy for NBS. However, the parametric times were skewed by large instances;
the median time to compute the policy for PBS was 0.14 seconds. Figure 8 shows boxplots of
the times taken to find the policy for PBS’s and NBS’s values. While Figure 8a suggests that
the speeds were similar for PBS and NBS’s values for the majority of instances, Figure 8b show
the drastic times taken under the parametric model for M= 10. As before, this is due to CS’s
slowness, not PBS’s. It is likely that if PBS were to be used in practice, a heuristic algorithm
could be used in CS’s place that would drastically speed up these times.
29
(a) (b)
Figure 8: Boxplots of times taken to compute optimal policy for (a) both BS algorithms and
(b) PBS by M(Poisson).
4.3.4 Comparison of value functions, distributions and policies
We now compare the outputs from the parametric and non-parametric models. Firstly, we
compare the value functions from PBS with those from LP and NBS. Boxplots comparing
these values are shown in Figure 9. As a reminder, these plots show 1
|S|Ps∈S vPBS
s−vy
sfor
y∈ {LP,NBS}. Figure 9a shows the convergence of LP’s values to PBS’s as Mincreases. As for
the binomial model, we see that LP’s values are always higher, and they grow closer to PBS’s
on average as Mincreases. In addition, Figure 9b shows that PBS and NBS’s values are similar
for small C, but NBS’s values are typically higher than PBS’s for large C.
(a) (b)
Figure 9: Boxplots of mean difference between vPBS and (a) vLP by Mand (b) vNBS by C
(Poisson).
In order to confirm that the same property of the ambiguity sets is responsible for this as for
the binomial model, we plot the maximum distances from ˆ
Psattained by PPBS
sand PNBS
s
in Figure 10. This plot again suggests that the parametric worst-case distributions are much
further from ˆ
Ps, especially for large C.
Finally, we compare the policies from the PBS and NBS. Similarly to the binomial model, we
find that PBS’s policies were more often deterministic. Specifically, 32% of PBS’s policies were
deterministic where only 14% of NBS’s were. In addition, we also find that the parametric
30
(a) (b)
Figure 10: Boxplots of maxs∈S Pa∈A dϕPy
s,a,ˆ
Ps,afor (a) y= PBS and (b) y= NBS (Poisson).
policies were slightly less conservative in terms of purchasing, for most states. However, for the
largest 2 states, i.e. s= 13,14, the non-parametric policies were less conservative. NBS’s polices
ordered 30% more stock in these final two states than PBS’s, whereas PBS’s policies ordered
between 5% and 25% more in the lower states.
5 Conclusions and further research
In this paper, we studied robust Markov decision processes under parametric transition distri-
butions. We focused on robust value iteration for s-rectangular ambiguity sets in particular.
Based on a fast projection-based bisection search algorithm found in the literature for robust
MDPs with ϕ-divergence ambiguity sets, we created a projection-based bisection algorithm for
the parametric model in the case where the transition distribution is parametrised by one pa-
rameter. We also presented two other algorithms for solving the robust Bellman update, a linear
programming algorithm and a cutting surface algorithm. These algorithms both discretise the
ambiguity set for the true parameter in order to create a linear programming reformulation of
the robust Bellman update. In addition, we showed how to use maximum likelihood estimation
to create confidence sets for use as ambiguity sets in the parametric model.
In order to test our algorithms, we presented a dynamic multi-period newsvendor model and
applied them to it. In particular, we carried out numerical experiments on the case where the
demands in the newsvendor problem are binomial and Poisson. In both cases, we solved the
non-parametric model in addition to the parametric model, in order to compare run times and
solutions. We found a number of main results. Firstly, our parametric bisection search algorithm
was very fast at finishing value iteration. In fact, it was faster than its non-parametric equivalent
and offered significant time savings in comparison with the linear programming and cutting
surface algorithms. This is due to the fact that the bisection algorithm does not rely on any
discretisation of the ambiguity set, and hence does not need to carry out the large amount of pre-
computation required by the solver-based algorithms. However, since our bisection algorithm
does not return an optimal policy, one of the two solver-based algorithms had to be used to
generate the policy after value iteration ended. This meant that using the parametric model
31
sometimes resulted in large overall run times. Comparing the solutions from the parametric
and non-parametric models, we found that the non-parametric value functions were typically
higher than the parametric ones on average. This was due to the result that the parametric
confidence sets allowed the corresponding worst-case distribution to be much further from the
nominal distribution than was allowed by the non-parametric set. However, we also found that
the parametric policies were less conservative than the non-parametric, for all states apart from
the largest two.
There are two main directions for future research arising from this paper. The most obvious
direction is with regards to computing the optimal policy. Since both of our solver-based algo-
rithms were very slow at this, it would be beneficial to find a faster way to extract the policy
once our bisection algorithm finishes value iteration. Another potential area for further research
is with regards to discretisation. As is the case in parametric distributionally robust optimi-
sation, discretisation of the ambiguity set is required to create a tractable linear programming
model that approximates the true problem. This means that the transition matrices for every
parameter in the ambiguity set must be computed prior to building the model, and also that the
resulting model becomes very large for large ambiguity sets. Hence, we would like to study ways
in which to circumvent the need for discretisation and reduce overall solution times.
Acknowledgements
We would like to acknowledge the support of the Engineering and Physical Sciences Research
Council funded (EP/L015692/1) STOR-i Centre for Doctoral Training. We would like to thank
BT for their funding, and Russell Ainslie, Mathias Kern and Gilbert Owusu from BT for their
support.
Bibliography
Agrawal, N. and Smith, S. A. (1996). Estimating negative binomial demand for retail inventory
management with unobservable lost sales. Naval Research Logistics (NRL), 43(6):839–861.
Ahmed, S., C¸ akmak, U., and Shapiro, A. (2007). Coherent risk measures in inventory problems.
European Journal of Operational Research, 182(1):226–238.
Alfares, H. K. and Elmorra, H. H. (2005). The distribution-free newsboy problem: Extensions
to the shortage penalty case. International Journal of Production Economics, 93-94:465–477.
Proceedings of the Twelfth International Symposium on Inventories.
Arrow, K. J., Harris, T., and Marschak, J. (1951). Optimal inventory policy. Econometrica,
19(3):250–272.
Arrow, K. J., Karlin, S., and Scarf, H. E. (1958). Studies in the mathematical theory of inventory
and production. The Mathematical Gazette, 44(348):156–156.
Bagnell, J. A., Ng, A. Y., and Schneider, J. G. (2001). Solving uncertain Markov decision
processes. Technical Report.
32
Behzadian, B., Petrik, M., and Ho, C. P. (2021). Fast algorithms for l∞-constrained s-rectangular
robust MDPs. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W.,
editors, Advances in Neural Information Processing Systems, volume 34, pages 25982–25992.
Curran Associates, Inc.
Bellman, R. (1966). Dynamic programming. Science, 153(3731):34–37.
Ben-Tal, A., den Hertog, D., Waegenaere, A. D., Melenberg, B., and Rennen, G. (2013). Robust
solutions of optimization problems affected by uncertain probabilities. Management Science,
59(2):341–357.
Bensoussan, A., C¸ akanyildirim, M., and Sethi, S. P. (2007). A multiperiod newsvendor problem
with partially observed demand. Mathematics of Operations Research, 32(2):322–344.
Black, B., Ainslie, R., Dokka, T., and Kirkbride, C. (2022). Distributionally robust resource
planning under binomial demand intakes. European Journal of Operational Research. Advance
online publication.
Bouakiz, M. and Sobel, M. J. (1992). Inventory control with an exponential utility criterion.
Operations Research, 40(3):603–608.
Chen, X. A., Wang, Z., and Yuan, H. (2017). Optimal pricing for selling to a static multi-period
newsvendor. Operations Research Letters, 45(5):415–420.
Deng, T., Shen, Z.-J. M., and Shanthikumar, J. G. (2014). Statistical learning of service-
dependent demand in a multiperiod newsvendor setting. Operations Research, 62(5):1064–
1076.
Derman, E., Geist, M., and Mannor, S. (2021). Twice regularized MDPs and the equivalence
between robustness and regularization. Advances in Neural Information Processing Systems,
34:22274–22287.
Gallego, G., Katircioglu, K., and Ramachandran, B. (2007). Inventory management under highly
uncertain demand. Operations Research Letters, 35(3):281–289.
Gallego, G. and Moon, I. (1993). The distribution free newsboy problem: Review and extensions.
The Journal of the Operational Research Society, 44(8):825–834.
Givan, R., Leach, S., and Dean, T. (2000). Bounded-parameter Markov decision processes.
Artificial Intelligence, 122(1):71–109.
Goyal, V. and Grand-Clement, J. (2022). Robust Markov decision processes: Beyond rectangu-
larity. Mathematics of Operations Research. Advance online publication.
Grand-Cl´ement, J. and Kroer, C. (2021). Scalable first-order methods for robust MDPs. Pro-
ceedings of the AAAI Conference on Artificial Intelligence, 35(13):12086–12094.
Han, Q., Du, D., and Zuluaga, L. F. (2014). Technical note—a risk- and ambiguity-averse
extension of the max-min newsvendor order formula. Operations Research, 62(3):535–542.
33
Ho, C. P., Petrik, M., and Wiesemann, W. (2021). Partial policy iteration for l1-robust Markov
decision processes. Journal of Machine Learning Research, 22:275–1.
Ho, C. P., Petrik, M., and Wiesemann, W. (2022). Robust phi-divergence MDPs. arXiv preprint
arXiv:2205.14202.
Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research,
30(2):257–280.
Kim, G., Wu, K., and Huang, E. (2015). Optimal inventory control in a multi-period newsvendor
problem with non-stationary demand. Advanced Engineering Informatics, 29(1):139–145.
Kogan, K. and Lou, S. (2003). Multi-stage newsboy problem: A dynamic model. European
Journal of Operational Research, 149(2):448–458. Sequencing and Scheduling.
Le Tallec, Y. (2007). Robust, risk-sensitive, and data-driven control of Markov decision processes.
PhD thesis, Massachusetts Institute of Technology.
Lee, C.-M. and Hsu, S.-L. (2011). The effect of advertising on the distribution-free newsboy
problem. International Journal of Production Economics, 129(1):217–224.
Lee, S., Kim, H., and Moon, I. (2021). A data-driven distributionally robust newsvendor model
with a wasserstein ambiguity set. Journal of the Operational Research Society, 72(8):1879–
1897.
Levi, R., Roundy, R. O., and Shmoys, D. B. (2007). Provably near-optimal sampling-based poli-
cies for stochastic inventory control models. Mathematics of Operations Research, 32(4):821–
839.
Levina, T., Levin, Y., McGill, J., Nediak, M., and Vovk, V. (2010). Weak aggregating algorithm
for the distribution-free perishable inventory problem. Operations Research Letters, 38(6):516–
521.
Liu, B., Holmbom, M., Segerstedt, A., and Chen, W. (2015). Effects of carbon emission regu-
lations on remanufacturing decisions with limited information of demand distribution. Inter-
national Journal of Production Research, 53(2):532–548.
Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. N. (2007). Bias and variance approximation
in value function estimates. Management Science, 53(2):308–322.
Matsuyama, K. (2006). The multi-period newsboy problem. European Journal of Operational
Research, 171(1):170–188.
Mehrotra, S. and Papp, D. (2014). A cutting surface algorithm for semi-infinite convex program-
ming with an application to moment robust optimization. arXiv preprint. arXiv:1306.3437.
Millar, R. B. (2011). Maximum Likelihood Estimation and Inference: With Examples in R, SAS
and ADMB, volume 112 of Statistics in practice. Wiley, New York, 1. aufl. edition.
34
Moon, I. and Choi, S. (1995). The distribution free newsboy problem with balking. The Journal
of the Operational Research Society, 46(4):537–542.
Nahmias, S. (1994). Demand estimation in lost sales inventory systems. Naval Research Logistics
(NRL), 41(6):739–757.
Nilim, A. and El Ghaoui, L. (2005). Robust control of Markov decision processes with uncertain
transition matrices. Operations Research, 53(5):780–798.
Ouyang, L.-Y. and Chang, H.-C. (2002). A minimax distribution free procedure for mixed
inventory models involving variable lead time with fuzzy lost sales. International Journal of
Production Economics, 76(1):1–12.
Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curses of Dimension-
ality. John Wiley & Sons.
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Program-
ming. John Wiley & Sons.
Rossi, R., Prestwich, S., Tarim, S. A., and Hnich, B. (2014). Confidence-based optimisation for
the newsvendor problem under binomial, Poisson and exponential demand. European Journal
of Operational Research, 239(3):674–684.
Satia, J. K. and Lave, R. E. (1973). Markovian decision processes with uncertain transition
probabilities. Operations Research, 21(3):728–740.
Scarf, H. E. (1957). A Min-Max Solution of an Inventory Problem. RAND Corporation, Santa
Monica, CA.
Siegel, A. F. and Wagner, M. R. (2021). Profit estimation error in the newsvendor model under
a parametric demand distribution. Management Science, 67(8):4863–4879.
Tirinzoni, A., Petrik, M., Chen, X., and Ziebart, B. (2018). Policy-conditioned uncertainty sets
for robust Markov decision processes. In Bengio, S., Wallach, H., Larochelle, H., Grauman,
K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing
Systems, volume 31. Curran Associates, Inc.
Ullah, M., Khan, I., and Sarkar, B. (2019). Dynamic pricing in a multi-period newsvendor under
stochastic price-dependent demand. Mathematics, 7(6):520.
Wang, Z., Glynn, P., and Ye, Y. (2016). Likelihood robust optimization for data-driven problems.
Computational Management Science, 13(2):241–261.
Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust Markov decision processes. Mathe-
matics of Operations Research, 38(1):153–183.
Zhang, Y., Yang, X., and Li, B. (2017). Distribution-free solutions to the extended multi-period
newsboy problem. Journal of Industrial and Management Optimization, 13(2):633–647.
35
Appendices
A Derivation of reformulation of robust Bellman update
A.1 General reformulation
The inner problem of (1) can be written as:
min
PsX
a∈A
πs,a X
s′∈S
Ps,a,s′Bs,a,s′
s.t. X
a∈A
da(Ps,a,ˆ
Ps,a)≤κ
X
s′∈S
Ps,a,s′= 1 ∀a∈ A,
with Bs,a =rs,a +γvand v=vn. The Lagrangian of this problem is given by:
L(π,ν, η) = X
a∈A
πs,a X
s′∈S
Ps,a,s′Bs,a,s′+η X
a∈A
da(Ps,a,ˆ
Ps,a)−κ!+X
a∈A
νa 1−X
s′∈S
Ps,a,s′!
=−κη +ν+X
a∈A X
s′∈S "πs,aPs,a,s′Bs,a,s′+ηˆ
Ps,a,s′ϕ Ps,a,s′
ˆ
Ps,a,s′!−νaPs,a,s′#,
with ν=Pa∈A νa. Therefore, the objective of the dual of the inner problem is given by:
g(ν, η) = inf
Ps≥0(−κη +ν+X
a∈A X
s′∈S "πs,aPs,a,s′Bs,a,s′+ηˆ
Ps,a,s′ϕ Ps,a,s′
ˆ
Ps,a,s′!−νaPs,a,s′#)
=−κη +ν+X
a∈A X
s′∈S
inf
Ps,a,s′≥0(πs,aPs,a,s′Bs,a,s′+ηˆ
Ps,a,s′ϕ Ps,a,s′
ˆ
Ps,a,s′!−νaPs,a,s′)
=−κη +ν+X
a∈A X
s′∈S
ηˆ
Ps,a,s′inf
Ps,a,s′≥0(Ps,a,s′
ˆ
Ps,a,s′
πs,aBs,a,s′
η+ϕ Ps,a,s′
ˆ
Ps,a,s′!−νa
η
Ps,a,s′
ˆ
Ps,a,s′)
=−κη +ν+X
a∈A X
s′∈S
ηˆ
Ps,a,s′inf
Ps,a,s′≥0(Ps,a,s′
ˆ
Ps,a,s′
πs,aBs,a,s′−νa
η+ϕ Ps,a,s′
ˆ
Ps,a,s′!)
=−κη +ν−X
a∈A X
s′∈S
ηˆ
Ps,a,s′sup
Ps,a,s′≥0(Ps,a,s′
ˆ
Ps,a,s′
νa−πs,aBs,a,s′
η−ϕ Ps,a,s′
ˆ
Ps,a,s′!)
=−κη +ν−X
a∈A X
s′∈S
ηˆ
Ps,a,s′ϕ∗νa−πs,aBs,a,s′
η.
Here, we used inf(A) = −sup(−A) for any set Ain order to replace inf with sup. Therefore,
the dual of the inner problem is given by:
max
η,ν(−κη +ν−X
a∈A X
s′∈S
ηˆ
Ps,a,s′ϕ∗νa−πs,aBs,a,s′
η:η∈R+,ν∈RA).
Combining with the outer problem, our reformulation is given by:
max
πs,η,ν(−κη +ν−X
a∈A X
s′∈S
ηˆ
Ps,a,s′ϕ∗νa−πs,aBs,a,s′
η)
36
s.t. X
a∈A
πs,a = 1
η∈R+,ν∈RA.
A.2 Reformulation for modified χ2divergence
For the modified χ2distance, we have:
ϕ∗(z) = max n1 + z
2,0o2−1.
Hence, we have:
ηϕ∗νa−πs,aBs,a,s′
η=ηmax 1 + νa−πs,aBs,a,s′
2η,02
−η
=1
4ηmax 2η+νa−πs,aBs,a,s′,02−η.
We can reformulate this using conic quadratic constraints as follows. Firstly, define the dummy
variables ζs,a,s′for a∈ A and s′∈ S using the following constraints:
ζs,a,s′≥2η+νa−πs,aBs,a,s′∀a∈ A ∀ s′∈ S
ζs,a,s′≥0∀a∈ A ∀ s′∈ S.
Now define dummy variables us,a,s′∀a∈ A ∀ s′∈ S using:
us,a,s′≥ζ2
s,a,s′
η
which is equivalently represented by:
q4ζ2
s,a,s′+ (η−us,a,s′)2≤(η+us,a,s′).
Then, at optimality we will have ηϕ∗νa−πs,aBs,a,s′
η=1
4us,a,s′−η. Therefore, the CQP refor-
mulation of (1) is given by:
max
πs(ν+η(A−κ)−1
4X
a∈A X
s′∈S
ˆ
Ps,a,s′us,a,s′)
s.t. q4ζ2
s,a,s′+ (η−us,a,s′)2≤(η+us,a,s′)∀a∈ A ∀ s′∈ S
ζs,a,s′≥2η+νa−πs,aBs,a,s′∀a∈ A ∀ s′∈ S
ζs,a,s′≥0∀a∈ A ∀ s′∈ S
us,a,s′≥0∀a∈ A ∀ s′∈ S
X
a∈A
πs,a = 1
πs,a ≥0∀a∈ A
η≥0
ν∈RA.
Note that the term Aη comes from Pa∈A Ps′∈S ˆ
Ps,a,s′η=Aη.
37
B Solving modified χ2distance projection problems
Since we focus on the modified χ2divergence in this paper, we will describe the method for this
distance only.
B.1 Solution by sorting and subproblems
The method used by Ho et al. (2022) consists of the following steps. Firstly, use Lagrangian
duality to reformulate the projection problem (6) as:
max
ξ,ψ −βξ +ψ−X
s′∈S
ˆ
Ps,a,s′ϕ∗(−ξbs′+ψ)
s.t. ξ∈R+, ψ ∈R.
Then, recalling that ϕ∗(z) = max 1 + z
2,02−1, we wish to eliminate the max operator in order
to make the model tractable. In order to do so, we observe that at optimality, we necessarily
have that ϕ∗(−ξbs′+ψ) = −1 holds for exactly ˆ
Svalues of s′, for some ˆ
S∈ {0, . . . , S}. In order
to find the optimal solution, we can therefore solve the model resulting from enforcing each
value of ˆ
Sexplicitly, and select the solution with the best objective value. In order to do so,
w.l.o.g. we sort the elements of bso that they are non-increasing. Then, for each ˆ
S∈ {0, . . . , S}
we create a subproblem of the reformulated projection problem by constraining ξ, ψ to enforce
that ϕ∗(−ξbs′+ψ) = −1 for each s′∈ {1,..., ˆ
S}. The final term in the objective function
becomes:
−X
s′∈S
ˆ
Ps,a,s′ϕ∗(−ξbs′+ψ) =
ˆ
S
X
s′=1
ˆ
Ps,a,s′−
S
X
s′=ˆ
S+1
ˆ
Ps,a,s′ 1 + −ξbs′+ψ
22
−1!
=
ˆ
S
X
s′=1
ˆ
Ps,a,s′−
S
X
s′=ˆ
S+1
ˆ
Ps,a,s′(−ξbs′+ψ) + (−ξbs′+ψ)2
4.
Therefore, the subproblem is given by (13)-(16).
max
ξ,ψ −βξ +ψ+
ˆ
S
X
s′=1
ˆ
Ps,a,s′−
S
X
s′=ˆ
S+1
ˆ
Ps,a,s′(−ξbs′+ψ) + (−ξbs′+ψ)2
4(13)
s.t. −ξbs′+ψ≤ −2∀s′∈ {1,..., ˆ
S}(14)
−ξbs′+ψ≥ −2∀s′∈ { ˆ
S+ 1, . . . , S}(15)
ξ∈R+, ψ ∈R.(16)
Note that for ˆ
S= 0, constraint (14) is redundant and can be removed. Similarly, for ˆ
S=S,
constraint (15) can be removed. Given this formulation, the solution of the subproblem is
obtained from solving at most 3 problems, each with an analytical solution. By Ho et al. (2022),
for a fixed ˆ
Sand ξ, the solution of this subproblem in ψis given by:
ψ∗=
−2 + ξb ˆ
S+1 if H(ξ)≤ −2 + ξb ˆ
S+1
−2 + ξb ˆ
Sif H(ξ)≥ −2 + ξb ˆ
S
H(ξ) otherwise,
(17)
38
where
H(ξ) = 2Pˆ
S
s′=1 ˆ
Ps,a,s′+ξPS
s′=ˆ
S+1 bs′ˆ
Ps,a,s′
PS
s′=ˆ
S+1 ˆ
Ps,a,s′
.
For some border cases, we do not need to solve the problem in all 3 of these cases. In particular,
we have the following special cases:
1. ˆ
S= 0. In this case, the second case is not defined as bˆ
Sdoes not exist.
2. ˆ
S=S−1. In this case we have:
H(ξ) = 2PS−1
s′=1 ˆ
Ps,a,s′+ξbSˆ
Ps,a,S
ˆ
Ps,a,S
=2(1 −ˆ
Ps,a,S )
ˆ
Ps,a,S
+bSξ
=2
ˆ
Ps,a,S
+ (−2 + bSξ)
>−2 + ξbS
=−2 + ξb ˆ
S+1.
Hence, the first case in (17) is impossible. In addition, for ˆ
S=S−1 the problem becomes:
max
ξ,ψ −βξ +ψ+
S−1
X
s′=1
ˆ
Ps,a,s′−ˆ
Ps,a,S (−ξbS+ψ) + (−ξbS+ψ)2
4
s.t. −ξbs′+ψ≤ −2∀s′∈ {1, . . . , S −1}
−ξbS+ψ≥ −2
ξ∈R+, ψ ∈R.
In the third case of (17), we have ψ=2
ˆ
Ps,a,S + (−2 + bSξ) and so the objective function is
given by:
−βξ +2
ˆ
Ps,a,S
+ (−2 + bSξ)−ˆ
Ps,a,S
2
ˆ
Ps,a,S
−2!+1
4 2
ˆ
Ps,a,S
−2!2
.
Therefore, the derivative of the objective function is (bS−β)ξ≤0∀ξ≥0, since β≥
min b=bS. Hence, ξshould be set at zero if it is unconstrained.
3. ˆ
S=S. In this case, the problem becomes (18)-(20):
max
ξ,ψ −βξ +ψ+ 1 (18)
s.t. −ξbs′+ψ≤ −2∀s′∈ {1, . . . , S }(19)
ξ∈R+, ψ ∈R.(20)
Constraint (19) implies that ψ≤ξbS−2. Since the objective is increasing in ψ, this means
ψ∗=−2 + ξbS. Hence, the second case in (17) is guaranteed. Furthermore, the objective
is given by maxξ(bS−β)−1. Since the assumption made by Ho et al. (2022) is that
min b≤βand bS= min b, the objective is decreasing in ξand so the optimal solution is
(ξ∗, ψ∗) = (0,−2). The optimal objective value is −1.
39
Now, in each case defined by (17), the problem can be reformulated as a univariate program
with one constraint. In the first case, Ho et al. (2022) show that the model becomes:
max
ξ
−βξ +ξ b ˆ
S+1 −2 +
ˆ
S
X
s′=1
ˆ
Ps,a,s′−
S
X
s′=ˆ
S+1
ˆ
Ps,a,s′ (−ξbs′+ξ b ˆ
S+1 −2) + (−ξbs′+ξ b ˆ
S+1 −2)2
4!
s.t. ξ≥2
S
X
s′=ˆ
S+1
(bˆ
S+1 −bs′)ˆ
Ps,a,s
−1
.
Differentiating the objective function, we find that it’s derivative is given by:
−β+bˆ
S+1 −
S
X
s′=ˆ
S+1
ˆ
Ps,a,s′bˆ
S+1 −bs′+1
2(bˆ
S+1 −bs′)(−ξbs′+ξ b ˆ
S+1 −2).
which can be written as:
−β+bˆ
S+1 −
S
X
s′=ˆ
S+1
ˆ
Ps,a,s′ξbˆ
S+1 −bs′2.
which means the globally optimal ξis given by:
ξ∗
1=−β+bˆ
S+1
PS
s′=ˆ
S+1 ˆ
Ps,a,s′bˆ
S+1 −bs′2.
In the second case, it is easy to see that ξ∗
2is obtained by replacing bˆ
S+1 with bˆ
S. The model is
therefore:
max
ξ
−βξ +ξ b ˆ
S−2 +
ˆ
S
X
s′=1
ˆ
Ps,a,s′−
S
X
s′=ˆ
S+1
ˆ
Ps,a,s′(−ξbs′+ξ b ˆ
S−2) + (−ξbs′+ξ b ˆ
S−2)2
4
s.t. ξ≤2
S
X
s′=ˆ
S+1
(bˆ
S−bs′)ˆ
Ps,a,s
−1
.
The corresponding globally optimal solution is given by:
ξ∗
2=−β+bˆ
S
PS
s′=ˆ
S+1 ˆ
Ps,a,s′bˆ
S−bs′2.
In the final case, we note that:
H′(ξ) = PS
s′=ˆ
S+1 bs′ˆ
Ps,a,s′
PS
s′=ˆ
S+1 ˆ
Ps,a,s′
.
The model then becomes:
max
ξ
−βξ +H(ξ) +
ˆ
S
X
s′=1
ˆ
Ps,a,s′−
S
X
s′=ˆ
S+1
ˆ
Ps,a,s′(−ξbs′+H(ξ)) + (−ξbs′+H(ξ))2
4
40
s.t. ξ≤2
S
X
s′=ˆ
S+1
(bˆ
S+1 −bs′)ˆ
Ps,a,s
−1
ξ≥2
S
X
s′=ˆ
S+1
(bˆ
S−bs′)ˆ
Ps,a,s
−1
The derivative of the objective is given by:
−β+H′(ξ)−
S
X
s′=ˆ
S+1
ˆ
Ps,a,s′H′(ξ)−bs′+1
2(H′(ξ)−bs′)(−ξbs′+H(ξ)).
From the same steps as for the first case, this leads to:
ξ∗
3=−β+H′(ξ)
PS
s′=ˆ
S+1 ˆ
Ps,a,s′(H′(ξ)−bs′)2.
Then solving the problem in each case corresponds to checking if the optimal ξlies within the
allowed range, and selecting one of the bounds if it does not.
B.2 Reformulation of projection problem
As shown by Ho et al. (2022), a general projection problem can be reformulated as:
max
ψ,ξ −βξ +ψ−X
s′∈S
ˆ
Ps,a,s′ϕ∗(−ξbs′+ψ)
s.t. ξ∈R+, ψ ∈R.
For the modified χ2distance, we have ϕ∗(z) = max 1 + z
2,02−1, or equivalently ϕ∗(z) =
1
4max {z+ 2,0}2−1. Hence, we can represent ϕ∗(−ξbs′+ψ) via:
ζs′≥ −ξbs′+ψ+ 2 ∀s′∈ S
ζs′≥0∀s′∈ S
us′≥1
4ζ2
s′∀s′∈ S.
Then, the model becomes:
max
ξ,ψ,ζ,u−βξ +ψ−X
s′∈S
ˆ
Ps,a,s′(us′−1)
s.t. ζs′≥ −ξbs′+ψ+ 2 ∀s′∈ S
us′≥1
4ζ2
s′∀s′∈ S
ζs′≥0∀s′∈ S
ξ∈R+, ψ ∈R.
C A newsvendor model incorporating backorder costs
Suppose that action ais taken when in state sand assume that b′now represents a backorder
cost per unit of unmet demand. For a given realisation xof the demand random variable Xs,a,
41
we define the one-period reward incorporating backorder costs as:
r′
s,a,x =cmin{x, ¯s} − wa −h(¯s−min{x, ¯s})−b′max{x−¯s, 0}.
In addition, let P′
s,a =P′
s,a,xx∈Xs,a represent a (non-parametric) candidate for the distribution
of Xs,a. We can then formulate the non-parametric robust Bellman update as:
vn+1
s= max
πs∈∆A
min
P′
s∈P′
sX
a∈A
πs,a X
x∈Xs,a
P′
s,a,x r′
s,a,x +γvn
g(x|s,a)∀s∈ S,(21)
where P′
sis an ambiguity set for the true distribution of Xs,a (not the true transition distribu-
tion). This set can be defined using ϕ-divergences as follows:
P′
s=(P′
s∈∆|Xs,1|×. . . ×∆|Xs,A |:X
a∈A
da(P′
s,a,ˆ
P′s,a)≤κ),
where ˆ
P′s,a =fXs,a (x|ˆ
θ)x∈Xs,a
, for example. Similarly, we can formulate the parametric
update problem as:
vn+1
s= max
πs∈∆A
min
θs∈ΘsX
a∈A
πs,a X
x∈Xs,a
fXs,a (x|θs,a)r′
s,a,x +γvn
g(x|s,a)∀s∈ S.
In these formulations, we could simplify the terms relating to backorder costs as follows:
X
x∈Xs,a
P′
s,a,x max {x−¯s, 0}=
|Xs,a|
X
x=¯s+1
P′
s,a,x(x−¯s) (22)
X
x∈Xs,a
fXs,a (x|θs,a) max {x−¯s, 0}=
|Xs,a|
X
x=¯s+1
fXs,a (x|θs,a)(x−¯s).
If we have infinite support demands, i.e. |Xs,a|=∞, then this implies that an infinite number
of decision variables are required for the non-parametric model. This means that a completely
different treatment is required. In many cases, however, the parametric expression can be further
simplified. For example, if the demand random variable is Xs,a ∼Pois(λs,a), then we have:
|Xs,a|
X
x=¯s
fXs,a (x|θs,a)(x−¯s) =
∞
X
x=¯s+1
λx
s,a exp(−λs,a)
x!(x−¯s)
=λs,a
∞
X
x=¯s+1
λx−1
s,a exp(−λs,a)
(x−1)! −¯s 1−
¯s
X
x=0
λx
s,a exp(−λs,a)
x!!
=λs,a
∞
X
x=¯s
λx
s,a exp(−λs,a)
x!−¯s1−FXs,a (¯s|λs,a)
=λs,a 1−FXs,a (¯s−1|λs,a)−¯s1−FXs,a(¯s|λs,a ),
which only involves finite sums. Without incorporating further information on the true distri-
bution of Xs,a such as its moments, the expression in (22) cannot be simplified further. The
infinite number of variables required means that the algorithms in this paper are not applicable
to the robust Bellman update problem in (21).
42