Conference PaperPDF Available

Dominance Criteria and Solution Sets for the Expected Scalarised Returns

Abstract and Figures

In many real-world scenarios, the utility of a user is derived from the single execution of a policy. In this case, to apply multi-objective reinforcement learning, the expected utility of the returns must be optimised. Various scenarios exist where a user's preferences over objectives (also known as the utility function) are unknown or difficult to specify. In such scenarios, a set of optimal policies must be learned. However, settings where the expected utility must be maximised have been largely overlooked by the multi-objective reinforcement learning community and, as a consequence, a set of optimal solutions has yet to be defined. In this paper we address this challenge by proposing first-order stochastic dominance as a criterion to build solution sets to maximise expected utility. We also propose a new dominance criterion, known as expected scalarised returns (ESR) dominance, that extends first-order stochastic dominance to allow a set of optimal policies to be learned in practice. Finally, we define a new solution concept called the ESR set, which is a set of policies that are ESR dominant.
Content may be subject to copyright.
Dominance Criteria and Solution Sets
for the Expected Scalarised Returns
Conor F. Hayes
School of Computer Science
National University of Ireland Galway
Ireland
c.hayes13@nuigalway.ie
Timothy Verstraeten
AI Lab
Vrije Universiteit Brussel
Belgium
timothy.verstraeten@vub.ac.be
Diederik M. Roijers
AI Lab, Vrije Universiteit Brussel (BE)
& Microsystems Technology,
HU Univ. of Appl. Sci. Utrecht (NL)
diederik.yamamoto-roijers@hu.nl
Enda Howley
School of Computer Science
National University of Ireland Galway
Ireland
enda.howley@nuigalway.ie
Patrick Mannion
School of Computer Science
National University of Ireland Galway
Ireland
patrick.mannion@nuigalway.ie
ABSTRACT
In many real-world scenarios, the utility of a user is derived from
the single execution of a policy. In this case, to apply multi-objective
reinforcement learning, the expected utility of the returns must
be optimised. Various scenarios exist where a user’s preferences
over objectives (also known as the utility function) are unknown
or dicult to specify. In such scenarios, a set of optimal policies
must be learned. However, settings where the expected utility must
be maximised have been largely overlooked by the multi-objective
reinforcement learning community and, as a consequence, a set of
optimal solutions has yet to be dened. In this paper we address
this challenge by proposing rst-order stochastic dominance as a
criterion to build solution sets to maximise expected utility. We also
propose a new dominance criterion, known as expected scalarised
returns (ESR) dominance, that extends rst-order stochastic dom-
inance to allow a set of optimal policies to be learned in practice.
Finally, we dene a new solution concept called the ESR set, which
is a set of policies that are ESR dominant.
KEYWORDS
multi-objective; decision making; distributional; reinforcement learn-
ing; stochastic dominance
1 INTRODUCTION
In multi-objective reinforcement learning (MORL), there are two
classes of algorithms: single-policy and multi-policy [
26
,
31
]. Each
MORL algorithm has two phases: the learning phase and the execu-
tion phase [
26
]. When using single-policy methods, an agent learns
a single optimal policy that maximises a user’s utility function
where a user’s preferences over objectives are represented by a util-
ity function. The agent then executes the optimal policy during the
execution phase. Single-policy methods require the utility function
of a user to be known during the learning phase. In certain scenarios
a user’s preferences over objectives may be unknown; therefore, the
utility function is unknown. In this case, a user is said to be in the
unknown utility function or unknown weights scenario [
13
,
26
]. In
the unknown utility function scenario, multi-policy methods must
be used to learn a set of optimal policies during the learning phase.
We assume that the utility function of the user will become known
during the execution phase. Once the utility function of the user
is known, it is possible to select a policy, from the set of learned
policies, that will maximise the user’s utility function.
In contrast to single-objective reinforcement learning (RL), mul-
tiple optimality criteria exist for MORL [
26
]. In scenarios where the
utility of the user is derived from multiple executions of a policy,
the scalarised expected returns (SER) must be optimised. However,
in scenarios where the utility of a user is derived from a single exe-
cution of a policy, the expected utility of the returns (or expected
scalarised returns, ESR) must be optimised. The majority of MORL
research focuses on the SER criterion and linear utility functions
[
22
], which limits the applicability of MORL to real-world prob-
lems. In the real world, a user’s utility function may be derived in
a linear or non-linear manner. For known linear utility functions,
single-objective methods can be used to learn an optimal policy
[
26
]. Non-linear utility functions do not distribute across the sums
of the immediate and future returns, which invalidates the Bellman
equation [
25
]. Therefore, to learn optimal policies for non-linear
utility functions, strictly multi-objective methods must be used.
For non-linear utility functions, a user can prefer signicantly
dierent policies depending on whether the SER or ESR criterion is
optimised [
22
,
23
]. Unfortunately, the ESR criterion has received
very little attention, to date, in the MORL community. To learn
optimal policies in many real-world scenarios where a policy will
be executed only once, the ESR criterion must be optimised. For
example, in a medical setting where a user has one opportunity to
select a treatment, a user will want to maximise the expected utility
of a single outcome. However, choosing the wrong optimisation
criterion (SER) for such a scenario could potentially lead to a dier-
ent policy than that which would be learned under ESR. In the real
world, like in the aforementioned scenario, learning a sub-optimal
policy could have catastrophic outcomes.
Therefore, it is crucial that the MORL community focuses on
developing both single-policy and multi-policy methods that can
learn optimal policies under the ESR criterion. Recently, a number
of single-policy methods have been implemented that can learn
optimal policies under the ESR criterion [
12
,
25
]. Based on the
ndings of Hayes et al. [
11
,
12
], a distribution over the expected
utility of the returns must be used to learn an optimal policy under
the ESR criterion in realistic settings where rewards are stochastic
1
.
Traditionally, a single expected value of the returns is used to make
decisions. However, the expected value cannot account for the
range of positive or adverse eects a decision might have [
12
]. In
the current MORL literature, no multi-policy methods exist for the
ESR criterion. In fact, a set of optimal policies for the ESR criterion
has yet to be dened.
This paper aims to ll the aforementioned research gaps that
exist for the ESR criterion. Due to the lack of existing research
for the ESR criterion, a formal denition of the requirements to
satisfy the ESR criterion has yet to be dened. In Section 3, we
dene the requirements necessary to satisfy the ESR criterion. The
applicability of MORL to many real-world scenarios under the ESR
criterion is limited because no solution set has been dened for
scenarios when a user’s utility function is unknown. In Section 4,
we show how rst-order stochastic dominance can be used to dene
sets of optimal policies under the ESR criterion. However, using
FSD in practice, when the utility function of a user is unknown,
determining a set of optimal policies is dicult because FSD relies
on having access to a utility function. We address this challenge in
Section 5 and expand rst-order stochastic dominance to dene a
new dominance criterion, called expected scalarised returns (ESR)
dominance. This work proposes that ESR dominance can be used
to learn a set of optimal solutions, which we dene as the ESR set.
2 BACKGROUND
2.1 Multi-Objective Reinforcement Learning
In multi-objective reinforcement learning, we deal with decision
making problems with multiple objectives, often modelled as a
multi-objective Markov decision process. An MOMDP represents a
tuple,
M=(S,A,T, 𝛾, R)
, where
S
and
A
are the state and action
spaces,
T:S × A × S [0,1]
is a probabilistic transition function,
𝛾
is a discount factor determining the importance of future rewards
and
R:S × A × S → R𝑛
is an
𝑛
-dimensional vector-valued imme-
diate reward function. In multi-objective reinforcement learning,
𝑛>1.
2.2 Utility Functions
In MORL, utility functions are used to model a user’s preferences,
and are used in both single-objective and multi-objective RL. Utility
functions are functions that map returns to a scalar value which
represents the user’s preferences over the returns,
𝑢:R𝑛R,(1)
where
𝑢
is a utility function and
R𝑛
is an n-dimensional vector.
Linear utility functions are widely used to represent a user’s pref-
erences,
𝑢=
𝑛
Õ
𝑖=1
𝑤𝑖𝑟𝑖,(2)
where
𝑤𝑖
is the preference weight and
𝑟𝑖
is the value at position
𝑖
of the return vector. However, certain scenarios exist where linear
utility functions cannot accurately represent a user’s preferences.
1
We note that distributional methods also work well for simple problems with deter-
ministic rewards. In such cases, the value distribution only has a single value vector
per state-action pair that occurs with probability 1.0.
In this case, the user’s preferences must be represented using a
non-linear utility function.
In this paper, we consider monotonically increasing utility func-
tions [26], i.e.,
(∀ 𝑖, 𝑉 𝜋
𝑖𝑉𝜋
𝑖 𝑖, 𝑉 𝜋
𝑖>𝑉𝜋
𝑖)=⇒ (∀𝑢,𝑢 (V𝜋)>𝑢(V𝜋)),(3)
where
V𝜋
and
V𝜋
are the values of executing policies
𝜋
and
𝜋
respectively.
A monotonically increasing utility function includes linear utility
functions of the form in Equation 2. It is important to note that in
certain scenarios the utility function may be unknown, therefore
we do not know the shape of the utility function. If we assume the
utility function is monotonically increasing we know that, if the
value of one of the objectives in the return vector increases, then
the utility also increases [
26
]. This assumption makes it possible to
determine an ordering over policies when the shape of the utility
function is unknown.
2.3 Scalarised Expected Returns and Expected
Scalarised Returns
For MORL, the ability to express a user’s preferences over objectives
as a utility function is essential when learning a single optimal
policy. In MORL dierent optimality criteria exist [
26
]. In MORL,
the utility function can be applied to the expectation of the returns,
or the utility function can be applied directly to the returns before
computing the expectation. Calculating the expected value of the
return of a policy before applying the utility function leads to the
scalarised expected returns (SER) optimisation criterion:
𝑉𝜋
𝑢=𝑢 E"
Õ
𝑡=0
𝛾𝑡r𝑡
𝜋, 𝜇0#!,(4)
where
𝜇0
is the probability distribution over possible starting states.
SER is the most commonly used criterion in the multi-objective
(single agent) planning and reinforcement learning literature [
26
].
For SER, a coverage set is dened as a set of optimal solutions
for all possible utility functions. If the utility function is instead
applied before computing the expectation, this leads to the expected
scalarised returns (ESR) optimisation criterion [12, 25, 26]:
𝑉𝜋
𝑢=E"𝑢
Õ
𝑡=0
𝛾𝑡r𝑡!
𝜋, 𝜇0#.(5)
ESR is the most commonly used criterion in the game theory litera-
ture on multi-objective games [22].
2.4 Stochastic Dominance
Stochastic dominance [
3
,
10
] gives a partial order between distribu-
tions and can be used when making decisions under uncertainty.
Stochastic dominance is particularly useful when a distribution
must be taken into consideration rather than an expected value
when making decisions. Stochastic dominance is a prominent dom-
inance criterion in nance, economics and decision theory. When
making decisions under uncertainty, Stochastic dominance can be
used to determine the most risk averse decision. Various degrees
of stochastic dominance exist, however, in this paper we focus on
rst-order stochastic dominance (FSD). FSD can be used to give a
2
0 2 4 6 8 10
0
0.2
0.4
0.6
0.8
1
Utility
Probability
𝐹𝑋
𝐹𝑌
Figure 1: For random variables 𝑋and 𝑌,𝑋𝐹𝑆 𝐷 𝑌, where 𝐹𝑋
and 𝐹𝑌are the cumulative distribution functions (CDFs) of 𝑋
and 𝑌respectively. In this case, 𝑋is preferable to 𝑌because
higher utilities occur with greater frequency in 𝐹𝑋.
partial ordering over random variables or random vectors to give
an FSD dominant set.
In Denition 2.1 we present the necessary conditions for FSD and
in Theorem 2.2 we prove that if a random variable is FSD dominant
it has at least as high an expected value as another random variable
[34]. We use the work of Wolfstetter [34] to prove Theorem 2.2.
Denition 2.1. For random variables X and Y, X 𝐹𝑆 𝐷 Y if:
𝑃(𝑋>𝑧) ≥ 𝑃(𝑌>𝑧),𝑧
If we consider the cumulative distribution function (CDF) of X,
𝐹𝑋, and the CDF of Y, 𝐹𝑌, we can say that X 𝐹𝑆 𝐷 Y if:
𝐹𝑋(𝑧) ≤ 𝐹𝑌(𝑧),𝑧.
Theorem 2.2. If X
𝐹𝑆 𝐷
Y, then X has a greater than or equal
expected value as Y.
𝑋𝐹𝑆 𝐷 𝑌=𝐸(𝑋) ≥ 𝐸(𝑌).
Proof.
By a known property of expected values the following
is true for any random variable:
E(𝑋)=+∞
0
(1𝐹𝑋(𝑥)) 𝑑𝑥
E(𝑌)=+∞
0
(1𝐹𝑌(𝑥)) 𝑑𝑥
Therefore, if X 𝐹𝑆𝐷 Y then:
+∞
0
(1𝐹𝑋(𝑥)) 𝑑𝑥 +∞
0
(1𝐹𝑌(𝑥)) 𝑑𝑥
Which gives,
E(𝑋) ≥ E(𝑌)
[34]
3 EXPECTED SCALARISED RETURNS
In contrast to single-objective reinforcement learning, dierent
optimality criteria exist for MORL. In scenarios where the utility
of a user is derived from multiple executions of a policy, the agent
should optimise over the SER criterion. In scenarios where the
utility of a user is derived from a single execution of a policy, the
agent should optimise over the ESR criterion. Let us consider, as
an example, a power plant that generates electricity for a city and
emits harmful
𝐶𝑂2
and greenhouse gases. City regulations have
been imposed which limit the amount of pollution that the power
plant can generate. If the regulations require that the emissions
from the power plant do not exceed a certain amount over an entire
year, the SER criterion should be optimised. In this scenario, the
regulations allow for the pollution to vary day to day, as long as
the emissions do not exceed the regulated level for a given year.
However, if the regulations are much stricter and the power plant is
ned every day it exceeds a certain level of pollution, it is benecial
to optimise under the ESR criterion.
The majority of MORL research focuses on linear utility func-
tions. However, in the real world, a user’s utility function can be
non-linear. For example, a utility function is non-linear in situa-
tions where a minimum value must be achieved on each objective
[
20
]. Focusing on linear utility functions limits the applicability
of MORL in real-world decision making problems. For example,
linear utility functions cannot be used to learn policies in concave
regions of the Pareto front [
32
]. Furthermore, if a user’s preferences
are non-linear, these are fundamentally incompatible with linear
utility functions. In this case, strictly multi-objective methods must
be used to learn optimal policies for non-linear utility functions.
In MORL, for non-linear utility functions, signicantly dierent
policies are preferred when optimising under the ESR criterion
versus the SER criterion [
23
]. It is important to note that, for linear
utility functions, the distinction between ESR and SER does not
exist [22].
For example, a decision maker has to choose between the follow-
ing lotteries, 𝐿1and 𝐿2, which are highlighted in Table 1.
𝐿1
P(𝐿1=R)R
0.5 (4, 3)
0.5 (2, 3)
𝐿2
P(𝐿2=R)R
0.9 (1, 3)
0.1 (10, 2)
Table 1: A lottery, 𝐿1, has two possible returns, (4, 3) and (2,
3), each with a probability, p, of 0.5. A lottery, 𝐿2, has two
possible returns, (1, 3) with a probability, 𝑝of 0.9 and (10, 2)
with a probability of 0.1.
The decision maker has the following non-linear utility function:
𝑢(x)=𝑥2
1+𝑥2
2,(6)
where
x
is a vector returned from
R
in Table 1, and
𝑥1
and
𝑥2
are the values of two objectives. Note that this utility function is
monotonically increasing for
𝑥1
0and
𝑥2
0. Under the SER
criterion, the decision maker will compute the expected value of
each lottery, apply the utility function, and select the lottery that
3
maximises their utility function. Let us consider which lottery the
decision maker will play under the SER criterion:
𝐿1:𝐸(𝐿1)=0.5(4,3) + 0.5(2,3)=(2,1.5)+(1,1.5)
𝐿1:𝑢(𝐸(𝐿1)) =(22+1.52)+(12+1.52)=6.25 +3.25 =9.5
𝐿2:𝐸(𝐿2)=0.9(1,3) + 0.1(10,2)=(0.9,2.7)+(1,0.2)
𝐿2:𝑢(𝐸(𝐿2)) =(0.92+2.72)+(12+0.22)=8.1+1.04 =9.14
Therefore, a decision maker with the utility function in Equation 6
will prefer to play lottery 𝐿1under the SER criterion.
Under the ESR criterion, the decision maker will rst apply the
utility function to the return vectors, compute the expectation, and
select the lottery to maximise their utility function. Let us consider
how a decision maker will choose which lottery to play under the
ESR criterion:
𝐿1:𝑢(𝐿1)=𝑢(4,3) + 𝑢(2,3)=(42+32)+(22+32)=(25)+(13)
𝐿1:E(𝑢(𝐿1)) =0.5(25) + 0.5(13)=12.5+6.5=19
𝐿2:𝑢(𝐿2)=𝑢(1,3) +𝑢(10,2)=(12+32) + (102+22)=(10) + (104)
𝐿2:E(𝑢(𝐿2)) =0.9(10) + 0.1(104)=9+10.4=19.4
Therefore, a decision maker with the utility function in Equation 6
will prefer to play lottery
𝐿2
under the ESR criterion. From the ex-
ample, it is clear that users with the same non-linear utility function
can prefer dierent policies, depending on which multi-objective
optimisation criterion is selected. Therefore, it is critical that the
distinction ESR and SER is taken into consideration when selecting
a MORL algorithm to learn optimal policies in a given scenario.
The majority of MORL research focuses on the SER criterion [
22
].
By comparison, the ESR criterion has received very little attention
from the MORL community [
12
,
22
,
25
,
26
]. Many of the traditional
MORL methods cannot be used when optimising under the ESR
criterion. The fact that non-linear utility functions in MOMDPs
do not distribute across the sum of immediate and future returns
invalidates the Bellman equation [25],
max
𝜋
E"𝑢 R
𝑡+
Õ
𝑖=𝑡
𝛾𝑖r𝑖!
𝜋, 𝑠𝑡#
𝑢(R
𝑡) + max
𝜋
E"𝑢
Õ
𝑖=𝑡
𝛾𝑖r𝑖!
𝜋, 𝑠𝑡#,
(7)
where 𝑢is a non-linear utility function and R
t=Í𝑡1
𝑖=0𝛾𝑖r𝑖.
Hayes et al. [
12
] implement a Distributional Monte Carlo Tree
Search (DMCTS) algorithm, which learns a posterior distribution
over the expected utility of individual policy executions. DMCTS
achieves state-of-the-art performance under the ESR criterion. Hayes
et al. [
12
] demonstrate that, when optimising under the ESR crite-
rion, making decisions based on a distribution over the expected
utility of the returns is crucial to learn optimal policies in realistic
problems where rewards are stochastic. Traditional RL approaches
use the expected value of the future returns to make decisions. The
expected value cannot provide the agent with sucient critical in-
formation to avoid adverse outcomes and exploit positive outcomes
when making a decision [12].
To understand why it is critical to make decisions when optimis-
ing under the ESR criterion using a distribution over the expected
utility of the returns, let us consider the following example in Ta-
ble 2 regarding a human decision maker.
𝐿3
P(𝐿3=R)R
0.5 (-20, 1)
0.5 (20, 3)
𝐿4
P(𝐿4=R)R
0.9 (0, 2)
0.1 (10, 2)
Table 2: A lottery, 𝐿3, has two possible returns, (-20, 1) and
(20, 3), each with a probability of 0.5. A lottery, 𝐿4, has two
possible returns, (0, 2) with a probability of 0.9 and (10, 2)
with a probability of 0.1.
The decision maker has the following non-linear utility function:
𝑢(x)=𝑥1+𝑥2
2(8)
where
x
is a vector returned from
R
in Table 2, and
𝑥1
and
𝑥2
are the values of two objectives. Note that this utility function
is monotonically increasing for all values of
𝑥1
and for values of
𝑥20.
For the non-linear utility function in Equation 8, under the ESR
criterion, both
𝐿3
and
𝐿4
have the same expected utility value of
5. It is important to note if an agent plays lottery
𝐿3
, there is 0.5
chance of receiving a utility of -19 and a 0.5 chance of receiving
a utility of 29. For a human decision maker, receiving a utility of
29 is an ideal outcome. However, receiving a utility of -19 might
represent a severely negative outcome that the decision maker
would want to avoid, e.g. going into debt. Instead, the decision
maker may prefer lottery
𝐿4
. As shown in this example, it is crucial
that a distribution over the expected utility of the returns is used
when making decisions under the ESR criterion.
The current MORL literature on the ESR criterion assumes a
scalar expected utility (see Section 2.3) [
12
,
22
,
25
,
26
]. As demon-
strated above, using a single expected value to make decisions
under the ESR criterion is not sucient to avoid choosing policies
with undesirable outcomes. Therefore, it is necessary to adopt a
distributional approach to ESR problems.
Firstly, we dene a multi-objective version of the value distribu-
tion [
6
],
Z𝜋
, which gives the distribution over returns of a random
vector [30] when a policy 𝜋is executed, such that,
EZ𝜋=E"
Õ
𝑡=0
𝛾𝑡r𝑡
𝜋, 𝜇0#.(9)
Moreover, a value distribution can be used to represent policies.
Under the ESR criterion, the utility of the value distribution,
𝑍𝜋
𝑢
,
is dened as a distribution over the scalar utilities received from
applying the utility function to each vector in the value distribution,
Z𝜋
. Therefore,
𝑍𝜋
𝑢
is a distribution over the scalar utility of vector
returns of a random vector received from executing a policy,
𝜋
,
such that,
E𝑍𝜋
𝑢=E"𝑢
Õ
𝑡=0
𝛾𝑡r𝑡!
𝜋, 𝜇0#.(10)
The utility of the value distribution can only be calculated when
the utility function is known a priori.
4
In the examples used in Section 3, the utility function of the
user is known. However, many scenarios exist where the user’s
utility function is unknown at the time of learning [
26
]. In this
scenario, a set of policies that are optimal for all monotonically
increasing utility functions must be learned. However, for the ESR
criterion, a set of optimal solutions has yet to be dened. To learn
a set of optimal policies under the ESR criterion we must develop
new methods.
To address this challenge, in Section 4 we apply rst-order sto-
chastic dominance to determine a partial ordering over value dis-
tributions to satisfy the ESR criterion.
4 STOCHASTIC DOMINANCE FOR ESR
For MORL there are two classes of algorithms: single-policy and
multi-policy algorithms [
26
,
31
]. When the user’s utility function
is known a priori, it is possible to use a single-policy algorithm
[
12
,
25
] to learn an optimal solution. However, when the user’s
utility function is unknown we aim to learn a set of policies that
are optimal for all monotonically increasing utility functions. The
current literature on the ESR criterion focuses only on scenarios
where the utility function of a user is known [
12
,
25
], overlooking
scenarios where the utility function of a user is unknown. Moreover,
a set of solutions under the ESR criterion for the unknown utility
function scenario [26] has yet to be dened.
Various algorithms have been proposed to learn solution sets
under the SER criterion (see Section 2.3), for example [
18
,
27
,
33
].
Under the SER criterion, multi-policy algorithms determine opti-
mality by comparing policies based on the utility of vector valued
expectations (Equation 4). In contrast, under the ESR criterion it
is crucial to maintain a distribution over the utility of possible
vector-valued outcomes. SER multi-policy algorithms cannot be
used to learn optimal policies under the ESR criterion because they
compute expected value vectors. It is necessary to develop new
methods that can generate solution sets for the ESR criterion with
unknown utilities. The development of methods that determine
an optimal partial ordering over value distributions is a promising
avenue to address this challenge.
First-order stochastic dominance (see Section 2.4) is a method
which gives a partial ordering over random variables [
15
,
34
]. FSD
compares the cumulative distribution functions of the underlying
probability distributions of random variables to determine optimal-
ity. To satisfy the ESR criterion, it is essential that the expected
utility is maximised. To use FSD for the ESR criterion, we must show
the FSD conditions presented in Section 2.4 also hold when opti-
mising the expected utility for unknown monotonically increasing
utility functions.
For the single-objective case, Theorem 4.1 proves for random
variables X and Y, if X
𝐹𝑆 𝐷
Y, the expected utility of X is greater
than, or equal to, the expected utility of Y for monotonically in-
creasing utility functions. In Theorem 4.1, random variables X and
Y are considered, and their corresponding CDFs
𝐹𝑋
,
𝐹𝑌
. The work
of Mas-Colell et al. [17] is used as a foundation for Theorem 4.1.
Theorem 4.1. A random variable, X, is preferred to a random
variable, Y, for all decision makers with a monotonically increasing
utility function if, and only if, X 𝐹 𝑆𝐷 Y.
𝑋𝐹𝑆 𝐷 𝑌=E(𝑢(𝑋)) ≥ E(𝑢(𝑌))
Proof. If X 𝐹 𝑆𝐷 Y, then2,
𝐹𝑋(𝑧) ≤ 𝐹𝑌(𝑧),𝑧
Since,
E(𝑢(𝑋)) =
−∞
𝑢(𝑧)𝑑𝐹𝑋(𝑧)
E(𝑢(𝑌)) =
−∞
𝑢(𝑧)𝑑𝐹𝑌(𝑧)
When integrating both
E(𝑢(𝑋))
and
E(𝑢(𝑌))
by parts, the following
results is generated:
E(𝑢(𝑋)) =[𝑢(𝑧)𝐹𝑋(𝑧)]
−∞
−∞
𝑢(𝑧)𝐹𝑋(𝑧)𝑑𝑧
E(𝑢(𝑌)) =[𝑢(𝑧)𝐹𝑌(𝑧)]
−∞
−∞
𝑢(𝑧)𝐹𝑌(𝑧)𝑑𝑧
Given
𝐹𝑋(−∞)
=
𝐹𝑌(−∞)
= 0 and
𝐹𝑋(∞)
=
𝐹𝑌(∞)
= 1, the rst
terms in E(𝑢(𝑋)) and E(𝑢(𝑌)) are equal, and thus
E(𝑢(𝑋)) − E(𝑢(𝑌)) =
−∞
𝑢(𝑧)𝐹𝑌(𝑧)𝑑𝑧
−∞
𝑢(𝑧)𝐹𝑋(𝑧)𝑑𝑧
Since
𝐹𝑋(𝑧) ≤ 𝐹𝑌(𝑧)
and
𝑢(𝑧) ≥
0for all monotonically increasing
utility functions, then
E(𝑢(𝑋)) − E(𝑢(𝑌)) ≥ 0
and thus,
E(𝑢(𝑋)) ≥ E(𝑢(𝑌))
A utility function maps an input (scalar or vector return) to an
output (scalar utility). Since the probability of receiving some utility
is equal to the probability of receiving some return for a random
variable, X, we can write the following:
𝑃(𝑋>𝑐)=𝑃(𝑢(𝑋)>𝑢(𝑐)),(11)
where
𝑐
is a constant. Using the results shown in Theorem 4.1 and
Equation 11, the FSD conditions highlighted in Section 2.4 can be
rewritten to include monotonically increasing utility functions:
𝑃(𝑢(𝑋)>𝑢(𝑧)) 𝑃(𝑢(𝑌)>𝑢(𝑧)) (12)
Denition 4.2. Let X and Y be random variables. X dominates
Y for all decision makers with a monotonically increasing utility
function if the following is true:
𝑋𝐹𝑆 𝐷 𝑌
𝑢:𝑣:𝑃(𝑢(𝑋)>𝑢(𝑣)) ≥ 𝑃(𝑢(𝑌)>𝑢(𝑣)).
In MORL, the return from the reward function is a vector, where
each element in the return vector represents an objective. To apply
FSD to MORL under the ESR criterion, random vectors must be
considered. In this case, a random vector (or multi-variate random
variable) is a vector whose components are scalar-valued random
variables on the same probability space. For simplicity, this paper
focuses on the case in which a random vector has two random
variables, known as the bi-variate case. FSD conditions have been
proven to hold for random vectors with
𝑛
random variables in the
works of Sriboonchitta et al. [
29
], Levhari et al. [
14
], Nakayama
et al. [
19
] and Scarsini [
28
]. In Theorem 4.3, the work of Atkinson
2
CDFs with lower probability values for a given
𝑧
are preferable. Figure 1 explains
why this is the case.
5
and Bourguignon [
2
] is distilled into a suitable Theorem for MORL.
Theorem 4.3 highlights how the conditions for FSD hold for ran-
dom vectors while satisfying the ESR criterion for a monotonically
increasing utility function,
𝑢
, where
𝑑2𝑢
𝑑𝑥1𝑑 𝑥2
0
[24]
. It is important
to note Atikson and Bourguignon [
2
] have proven Theorem 4.3
for utility functions where
𝑑2𝑢
𝑑𝑥1𝑑 𝑥2
0. We plan to extend these
conditions for MORL in a future work. In Theorem 4.3,
X
and
Y
are
random vectors where each random vector consists of two random
variables,
X=[𝑋1, 𝑋2]
and
Y=[𝑌1, 𝑌2]
.
𝐹𝑋1𝑋2
and
𝐹𝑌1𝑌2
are the
corresponding CDFs.
Theorem 4.3. A random vector,
X
, is preferred to a random vector,
Y
, by all decision makers with a monotonically increasing utility
function if, and only if, X𝐹𝑆 𝐷 Y.
X𝐹𝑆 𝐷 Y=E(𝑢(X)) ≥ E(𝑢(Y))
Proof. Since X𝐹𝑆 𝐷 Ymeans,
𝐹𝑋1𝑋2(𝑡, 𝑧 ) ≤ 𝐹𝑌1𝑌2(𝑡, 𝑧 )
The expected utility can be written as follows:
E(𝑢(X)) =
−∞
−∞
𝑢(𝑡, 𝑧 )𝑓𝑋1𝑋2(𝑡, 𝑧)𝑑𝑡𝑑𝑧
E(𝑢(Y)) =
−∞
−∞
𝑢(𝑡, 𝑧 )𝑓𝑌1𝑌2(𝑡, 𝑧 )𝑑𝑡𝑑𝑧
where
𝑓𝑋1𝑋2
and
𝑓𝑌1𝑌2
are the probability density functions of
X
and
Y
, respectively. Only the steps for the integration of
E(𝑢(X))
are shown below, however, the steps for integration of
E(𝑢(Y))
are
the same:
E(𝑢(X)) =
−∞ [𝑢(𝑡, 𝑧 )𝐹𝑋1𝑋2(𝑡, 𝑧)]
−∞
−∞
𝑑𝑢
𝑑𝑡 (𝑡 , 𝑧)𝐹𝑋1𝑋2(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧
=
−∞ lim
𝑧→∞ 𝑢(𝑡, 𝑧 )𝐹𝑋1𝑋2(𝑡, 𝑧 ) − lim
𝑧→−∞ 𝑢(𝑡, 𝑧 )𝐹𝑋1𝑋2(𝑡, 𝑧 )
−∞
𝑑𝑢
𝑑𝑡 (𝑡 , 𝑧)𝐹𝑋1𝑋2(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧
Given lim𝑧→−∞ 𝐹𝑋1𝑋2(𝑡, 𝑧)=0and lim𝑧→∞ 𝐹𝑋1𝑋2(𝑡 , 𝑧)=𝐹𝑋1(𝑡):
−∞
lim
𝑧→∞ 𝑢(𝑡, 𝑧 )𝐹𝑋1𝑋2(𝑡, 𝑧 )𝑑𝑡
−∞
−∞
𝑑𝑢
𝑑𝑡 (𝑡 , 𝑧)𝐹𝑋1𝑋2(𝑡, 𝑧 )𝑑𝑡𝑑𝑧
Integrating the rst term gives the following:
=lim
𝑡→∞ 𝑢(𝑡, ∞)𝐹𝑋1(𝑡) − lim
𝑡→−∞ 𝑢(𝑡, ∞)𝐹𝑋1(−∞)
−∞
𝑑𝑢
𝑑𝑡 (𝑡 , ∞)𝐹𝑋1(𝑡)𝑑𝑡
−∞
−∞
𝑑𝑢
𝑑𝑡 (𝑡 , 𝑧)𝐹𝑋1𝑋2(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧
Given 𝐹𝑋1(−∞) =0,𝐹𝑋1(∞) =1and 𝑢(,∞) =or −∞.
=
−∞
𝑑𝑢
𝑑𝑡 (𝑡 , ∞)𝐹𝑋1(𝑡)𝑑𝑡
−∞
−∞
𝑑𝑢
𝑑𝑡 (𝑡 , 𝑧)𝐹𝑋1𝑋2(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧
Then, integrating the second term gives the following:
=
−∞
𝑑𝑢
𝑑𝑡 (𝑡 , ∞)𝐹𝑋1(𝑡)𝑑𝑡
−∞ lim
𝑡→∞
𝑑𝑢
𝑑𝑧 (∞, 𝑧 )𝐹𝑋1𝑋2(∞, 𝑧)
lim
𝑡→−∞
𝑑𝑢
𝑑𝑧 (−∞, 𝑧 )𝐹𝑋1𝑋2(−∞, 𝑧)−
−∞
𝑑2𝑢
𝑑𝑡𝑑𝑧 (𝑡 , 𝑧)𝐹𝑋1𝑋2(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧
Given 𝐹𝑋1𝑋2(−∞, 𝑧)=0and 𝐹𝑋1𝑋2(, 𝑧)=𝐹𝑋2(𝑧), then:
E(𝑢(X)) =
−∞
𝑑𝑢
𝑑𝑡 (𝑡 , ∞)𝐹𝑋1(𝑡)𝑑𝑡
−∞
𝑑𝑢
𝑑𝑧 (∞, 𝑧 )𝐹𝑋2(𝑧)𝑑𝑧
+
−∞
−∞
𝑑2𝑢
𝑑𝑡𝑑𝑧 (𝑡 , 𝑧)𝐹𝑋1𝑋2(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧
Therefore,
E(𝑢(Y)) =
−∞
𝑑𝑢
𝑑𝑡 (𝑡 , ∞)𝐹𝑌1(𝑡)𝑑𝑡
−∞
𝑑𝑢
𝑑𝑧 (∞, 𝑧 )𝐹𝑌2(𝑧)𝑑𝑧
+
−∞
−∞
𝑑2𝑢
𝑑𝑡𝑑𝑧 (𝑡 , 𝑧)𝐹𝑌1𝑌2(𝑡, 𝑧)𝑑𝑡 𝑑𝑧
E(𝑢(X))−E(𝑢(Y)) =
−∞
𝑑𝑢
𝑑𝑡 (𝑡 , ∞)𝐹𝑋1(𝑡)𝑑𝑡
−∞
𝑑𝑢
𝑑𝑧 (∞, 𝑧 )𝐹𝑋2(𝑧)𝑑𝑧
+
−∞
−∞
𝑑2𝑢
𝑑𝑡𝑑𝑧 (𝑡 , 𝑧)𝐹𝑋1𝑋2(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧 +
−∞
𝑑𝑢
𝑑𝑡 (𝑡 , ∞)𝐹𝑌1(𝑡)𝑑𝑡
+
−∞
𝑑𝑢
𝑑𝑧 (∞, 𝑧 )𝐹𝑌2(𝑧)𝑑𝑧
−∞
−∞
𝑑2𝑢
𝑑𝑡𝑑𝑧 (𝑡 , 𝑧)𝐹𝑌1𝑌2(𝑡, 𝑧)𝑑𝑡 𝑑𝑧
For a monotonically increasing utility function
𝑑𝑢
𝑑𝑡
0,
𝑑𝑢
𝑑𝑧
0
and
𝑑2𝑢
𝑑𝑡𝑑 𝑧
0. Given, the utility function,
𝑢
, is assumed to be
monotonically increasing and for FSD
𝐹𝑋(𝑡, 𝑧 ) ≤ 𝐹Y(𝑡 , 𝑧)
which
gives the following:
E(𝑢(X)) − E(𝑢(Y)) ≥ 0.
Finally,
E(𝑢(X)) ≥ E(𝑢(Y)).
Using the results from Theorem 4.3, Equation 12 can be updated
to include random vectors,
𝑃(𝑢(X)>𝑢(z)) 𝑃(𝑢(Y)>𝑢(z)).(13)
Denition 4.4. For random vectors
X
and
Y
,
X
is preferred over
Y
by all decision makers with a monotonically increasing utility
function if, and only if, the following is true:
X𝐹𝑆 𝐷 Y
𝑢:(∀v:𝑃(𝑢(X)>𝑢(v)) 𝑃(𝑢(Y)>𝑢(v))
Using the results from Theorem 4.3 and Denition 4.4, it is possi-
ble to extend FSD to MORL. For MORL, under the ESR criterion, the
value distribution,
Z𝜋
, is considered to be the full distribution of
the returns of a random vector received when executing a policy,
𝜋
(see Section 3). Value distributions can be used to represent policies.
In this case, it is possible to use FSD to obtain a partial ordering
over policies. For example, consider two policies,
𝜋
and
𝜋
, where
each policy has the underlying value distribution
Z𝜋
and
Z𝜋
. If
Z𝜋𝐹𝑆 𝐷 Z𝜋then 𝜋will be preferred over 𝜋.
Denition 4.5. Policies
𝜋
and
𝜋
have value distributions
Z𝜋
and
Z𝜋
. Policy
𝜋
is preferred over policy
𝜋
by all decision makers with
a utility function,
𝑢
, that is monotonically increasing if, and only if,
the following is true:
Z𝜋𝐹𝑆 𝐷 Z𝜋.
6
Now that a partial ordering over policies has been dened under
the ESR criterion for the unknown utility function scenario, it is
possible to dene a set of optimal policies.
5 SOLUTION SETS FOR ESR
Section 4 denes a partial ordering over policies under the ESR
criterion for unknown utility using FSD. In the unknown utility
function scenario it is infeasible to learn a single optimal policy
[
26
]. When a user’s utility function is unknown, multi-policy MORL
algorithms must be used to learn a set of optimal policies. To apply
MORL to the ESR criterion in scenarios with unknown utility, a
set of optimal policies under the ESR criterion must be dened. In
Section 5, FSD is used to dene multiple sets of optimal policies for
the ESR criterion.
Firstly, a set of optimal policies, known as the undominated set,
is dened. The undominated set is dened using FSD, where each
policy in the undominated set has an underlying value distribu-
tion that is FSD dominant. The undominated set contains at least
one optimal policy for all possible monotonically increasing utility
functions.
Denition 5.1. The undominated set,
𝑈(Π)
, is a sub-set of all
possible policies for where there exists some utility function,
𝑢
,
where a policy’s value distribution is FSD dominant.
𝑈(Π)=n𝜋Π𝑢, 𝜋Π:Z𝜋𝐹𝑆 𝐷 Z𝜋o
However, the undominated set may contain excess policies. For
example, under FSD, if two dominant policies have value distribu-
tions that are equal, then both policies will be in the undominated
set. Given both value distributions are equal, a user with a mono-
tonically increasing utility function will not prefer one policy over
the other. In this case, both policies have the same expected utility.
To reduce the number of policies that must be considered at execu-
tion time, for each possible utility function we can keep just one
corresponding FSD dominant policy; such a set of policies is called
a coverage set (CS).
Denition 5.2. The coverage set,
𝐶𝑆 (Π)
, is a subset of the un-
dominated set,
𝑈(Π)
, where, for every utility function,
𝑢
, the set
contains a policy that has a FSD dominant value distribution,
𝐶𝑆 (Π) 𝑈(Π) ∧ 𝑢, 𝜋𝐶𝑆 (Π),𝜋Π:Z𝜋𝐹𝑆 𝐷 Z𝜋
In practice, for scenarios where the utility function is unknown,
it is dicult to compute the undominated set or coverage set using
FSD because FSD relies on having a user’s utility function avail-
able to calculate dominance. To address this challenge, expected
scalarised returns (ESR) dominance is dened. Multi-policy algo-
rithms can use ESR dominance as a method under the ESR criterion
to learn a set of optimal policies.
Denition 5.3. For random vectors
X
and
Y
,
X>𝐸𝑆𝑅 Y
for all
decision makers with a monotonically increasing utility function if,
and only if, the following is true:
X>𝐸𝑆𝑅 Y
𝑢:(∀v:𝑃(𝑢(X)>𝑢(v)) 𝑃(𝑢(Y)>𝑢(v))
∧∃ v:𝑃(𝑢(X)>𝑢(v)) >𝑃(𝑢(Y)>𝑢(v))).
ESR dominance (Denition 5.3) extends FSD, however, ESR dom-
inance is a more strict dominance criterion. For FSD, policies that
have equal value distributions are considered dominant policies,
which is not the case under ESR dominance. Therefore, if a random
vector is ESR dominant, the random vector has a greater expected
utility than all ESR dominated random vectors. Theorem 5.4 proves
that ESR dominance satises the ESR criterion when the utility
function of the user is unknown for all monotonically increasing
utility functions. Theorem 5.4 focuses on random vectors
X
and
Y
where each random vector has two random variables, such that
X=[𝑋1, 𝑋2]
and
Y=[𝑌1, 𝑌2]
.
𝐹X
and
𝐹Y
are the corresponding
CDFs and
v=[𝑡, 𝑧 ]
. However, Theorem 5.4 can easily be extended
for random vectors with 𝑛random variables (X=[𝑋1, 𝑋2, .. ., 𝑋𝑛]).
Theorem 5.4. A random vector,
X
, is preferred to a random vector,
Y
, by all decision makers with a monotonically increasing utility
function if, and only if, X𝐸𝑆𝑅 Y:
X>𝐸𝑆𝑅 Y=E(𝑢(X)) >E(𝑢(Y))
Proof. X
and
Y
are random vectors with
𝑛
random variables. If
X>𝐸𝑆𝑅 Ythe following two conditions must be met for all 𝑢:
(1) v:𝑃(𝑢(X)>𝑢(v)) ≥ 𝑃(𝑢(Y)>𝑢(v))
(2) v:𝑃(𝑢(X)>𝑢(v)) >𝑃(𝑢(Y)>𝑢(v))
From Denition 4.4, if X𝐹 𝑆 𝐷 Ythen the following is true:
𝑢:v:𝑃(𝑢(X)>𝑢(v)) 𝑃(𝑢(Y)>𝑢(v))
If X𝐹𝑆 𝐷 Y, then, from Theorem 4.3, the following is true:
E(𝑢(X)) ≥ E(𝑢(Y))
If condition 1is satised, the expected utility of
X
is at least equal
to the expected utility of Y, then:
E(𝑢(X)) =
−∞
−∞
𝑢(𝑡, 𝑧 )𝑓X(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧
E(𝑢(Y)) =
−∞
−∞
𝑢(𝑡, 𝑧 )𝑓Y(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧
In order to satisfy condition 2, some limits must exist to give the
following,
𝑏
𝑎𝑑
𝑐
𝑢(𝑡, 𝑧 )𝑓X(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧 >𝑏
𝑎𝑑
𝑐
𝑢(𝑡, 𝑧 )𝑓Y(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧
The minimum requirement to satisfy condition 1 is:
−∞
−∞
𝑢(𝑡, 𝑧 )𝑓X(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧 =
−∞
−∞
𝑢(𝑡, 𝑧 )𝑓Y(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧
If condition 1 is satised, to satisfy condition 2 some limits must
exist:
𝑏
𝑎𝑑
𝑐
𝑢(𝑡, 𝑧 )𝑓X(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧 >𝑏
𝑎𝑑
𝑐
𝑢(𝑡, 𝑧 )𝑓Y(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧.
Therefore,
𝑎
−∞ 𝑐
−∞
𝑢(𝑡, 𝑧 )𝑓X(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧 +𝑏
𝑎𝑑
𝑐
𝑢(𝑡, 𝑧 )𝑓X(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧 +
𝑏
𝑑
𝑢(𝑡, 𝑧 )𝑓X(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧 >𝑎
−∞ 𝑐
−∞
𝑢(𝑡, 𝑧 )𝑓Y(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧 +
𝑏
𝑎𝑑
𝑐
𝑢(𝑡, 𝑧 )𝑓Y(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧 +
𝑏
𝑑
𝑢(𝑡, 𝑧 )𝑓Y(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧.
7
Finally,
−∞
−∞
𝑢(𝑡, 𝑧 )𝑓X(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧 >
−∞
−∞
𝑢(𝑡, 𝑧 )𝑓Y(𝑡, 𝑧 )𝑑𝑡 𝑑𝑧
if X>𝐸𝑆𝑅 Y, then,
E(𝑢(X)) >E(𝑢(Y)) .
In the ESR dominance criterion dened in Denition 5.3, the
utility of dierent vectors is compared. However, it is not possible to
calculate the utility of a vector when the utility function is unknown.
In this case, Pareto dominance [
21
] can be used instead to determine
the relative utility of the vectors being compared.
Denition 5.5.
A
Pareto dominates (
𝑝
)
B
if the following is true:
A𝑝B⇔ (∀𝑖:A𝑖B𝑖) ∧ (∃𝑖:A𝑖>B𝑖).(14)
For monotonically increasing utility functions, if the value of an
element of the vector increases, then the scalar utility of the vector
also increases. Therefore, using Denition 5.5, if vector
A
Pareto
dominates vector
B
, for a monotonically increasing utility function,
A
has a higher utility than
B
. To make ESR comparisons between
value distributions, Pareto dominance can be used.
Denition 5.6. For random vectors
X
and
Y
,
X>𝐸𝑆𝑅 Y
for all
monotonically increasing utility functions if, and only if, the fol-
lowing is true:
X>𝐸𝑆𝑅 Y
v:𝑃(X>𝑃v) ≥ 𝑃(Y>𝑃v) ∧ ∃v:𝑃(X>𝑃v)>𝑃(Y>𝑃v).
Therefore, as per Denition 5.7, ESR dominance can be used to
give a partial ordering over policies.
Denition 5.7. For value distributions
Z𝜋
and
Z𝜋
for policies
𝜋
and
𝜋
,
𝜋
is preferred over
𝜋
by all decision makers with a mono-
tonically increasing utility function if, and only if, the following is
true:
Z𝜋>𝐸𝑆𝑅 Z𝜋
Using ESR dominance, it is possible to dene a set of optimal
policies, known as the ESR set.
Denition 5.8. The ESR set,
𝐸𝑆𝑅(Π)
, is a sub-set of all policies
where each policy in the ESR set is ESR dominant,
𝐸𝑆𝑅(Π)={𝜋Π| 𝜋Π:Z𝜋
>𝐸𝑆𝑅 Z𝜋}.(15)
The ESR set is a set of non-dominated policies, where each policy
in the ESR set is ESR dominant. The ESR set can be considered a
coverage set, given no excess policies exist in the ESR set. It is
viable for a multi-policy MORL method to use ESR dominance to
construct the ESR set, given Pareto dominance is used to determine
ESR dominance when the utility function of a user is unknown.
6 RELATED WORK
The various orders of stochastic dominance have been used exten-
sively as a method to determine the optimal decision when making
decisions under uncertainty in economics [
7
], nance [
1
,
4
], game
theory [
9
], and various other real-world scenarios [
5
]. However,
stochastic dominance has largely been overlooked in systems that
learn. Cook and Jarret [
8
] use various orders of stochastic domi-
nance and Pareto dominance with genetic algorithms to compute
optimal solution sets for an aerospace design problem with multiple
objectives when constrained by a computational budget. Martin et
al. [
16
] use second-order stochastic dominance (SSD) with a single-
objective distributional RL algorithm [
6
]. Martin et al. [
16
] use SSD
to determine the optimal action to take at decision time, and this
approach is shown to learn good policies during experimentation.
7 CONCLUSION & FUTURE WORK
The ESR criterion has largely been ignored by the MORL commu-
nity, with the exception of the work of Roijers et al. [
25
,
26
] and
Hayes et al. [
11
,
12
]. While these works present single-policy algo-
rithms that are suitable to learn policies under the ESR criterion, a
formal denition of the necessary requirements to satisfy the ESR
criterion had not previously been dened. In Section 3, we outline,
through examples and denitions, the necessary requirements to
satisfy the ESR criterion. The formal denitions outlined in Section
3 ensure that an optimal policy can be learned when the utility
function of the user is known under the ESR criterion. However,
in the real world, a user’s preferences over objectives (or utility
function) may be unknown at the time of learning.
Prior to this paper, a suitable solution set for the unknown utility
function scenario under the ESR criterion had not been dened. This
long-standing research gap has restricted the applicability of MORL
in real-world scenarios under the ESR criterion. In Section 4 and
Section 5 we dene the necessary solution sets required for multi-
policy algorithms to learn a set of optimal policies under the ESR
criterion when the utility function of a user is unknown. This work
aims to answer some of the existing research questions regarding
the ESR criterion. Moreover, we aim to highlight the importance
of the ESR criterion when applying MORL to real-world scenarios.
In order to successfully apply MORL to the real world, we must
implement new single-policy and multi-policy algorithms that can
learn solutions for non-linear utility functions in various scenarios.
A promising starting point for future work would be to learn a
set of optimal solutions under the ESR criterion in a multi-objective
multi-armed bandit setting. Learning an optimal set of policies in a
bandit setting is a natural starting point for any new multi-policy
algorithm and would require implementing the new dominance
criteria outlined in this paper.
ACKNOWLEDGEMENTS
Conor F. Hayes is funded by the National University of Ireland
Galway Hardiman Scholarship. This research was supported by
funding from the Flemish Government under the “Onderzoekspro-
gramma Articiële Intelligentie (AI) Vlaanderen” program.
8
REFERENCES
[1]
Mukhtar M. Ali. 1975. Stochastic dominance and portfolio analysis. Journal of
Financial Economics 2, 2 (1975), 205–229. https://doi.org/10.1016/0304-405X(75)
90005-7
[2]
A. B. Atkinson and F. Bourguignon. 1982. The Comparison of Multi-
Dimensioned Distributions of Economic Status. The Review of Eco-
nomic Studies 49, 2 (04 1982), 183–201. https://doi.org/10.2307/2297269
arXiv:https://academic.oup.com/restud/article-pdf/49/2/183/4720580/49-2-
183.pdf
[3]
Vijay S. Bawa. 1975. Optimal rules for ordering uncertain prospects. Journal of
Financial Economics 2, 1 (1975), 95 – 121. https://doi.org/10.1016/0304-405X(75)
90025-2
[4]
Vijay S. Bawa. 1978. Safety-First, Stochastic Dominance, and Optimal Portfolio
Choice. The Journal of Financial and Quantitative Analysis 13, 2 (1978), 255–271.
http://www.jstor.org/stable/2330386
[5] Vijay S. Bawa. 1982. Research Bibliography-Stochastic Dominance: A Research
Bibliography. Manage. Sci. 28, 6 (June 1982), 698–712. https://doi.org/10.1287/
mnsc.28.6.698
[6]
Marc G Bellemare, Will Dabney, and Rémi Munos. 2017. A distributional perspec-
tive on reinforcement learning. In International Conference on Machine Learning.
PMLR, Sydney, 449–458.
[7]
E. Choi and Stanley Johnson. 1988. Stochastic Dominance and Uncertain Price
Prospects. Center for Agricultural and Rural Development (CARD) at Iowa State
University, Center for Agricultural and Rural Development (CARD) Publications 55
(01 1988). https://doi.org/10.2307/1059583
[8]
Laurence Cook and Jerome Jarrett. 2018. Using Stochastic Dominance in Multi-
Objective Optimizers for Aerospace Design Under Uncertainty. https://doi.org/
10.2514/6.2018-0665
[9]
Peter C Fishburn. 1978. Non-cooperative stochastic dominance games. Interna-
tional Journal of Game Theory 7, 1 (1978), 51–61.
[10]
Josef Hadar and William R. Russell. 1969. Rules for Ordering Uncertain Prospects.
The American Economic Review 59, 1 (1969), 25–34. http://www.jstor.org/stable/
1811090
[11]
Conor F Hayes, Mathieu Reymond, Diederik M Roijers, Enda Howley, and Patrick
Mannion. 2021. Risk-Aware and Multi-Objective Decision Making with Dis-
tributional Monte Carlo Tree Search. arXiv preprint arXiv:2102.00966 (2021).
https://arxiv.org/abs/2102.00966
[12]
Conor F Hayes, Mathieu Reymond, Diederik M Roijers, Enda Howley, and Patrick
Mannion. 2021 In Press. Distributional Monte Carlo Tree Search for Risk-Aware
and Multi-Objective Reinforcement Learning. In Proceedings of the 20th Inter-
national Conference on Autonomous Agents and MultiAgent Systems, Vol. 2021.
IFAAMAS.
[13]
Conor F. Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström,
Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf,
Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick
Mannion, Ann Nowé, Gabriel Ramos, Marcello Restelli, Peter Vamplew, and
Diederik M. Roijers. 2021. A Practical Guide to Multi-Objective Reinforcement
Learning and Planning. arXiv:2103.09568 [cs.AI]
[14]
David Levhari, Jacob Paroush, and Bezalel Peleg. 1975. Eciency Analysis for
Multivariate Distributions. The Review of Economic Studies 42, 1 (1975), 87–91.
http://www.jstor.org/stable/2296822
[15]
Haim Levy. 1992. Stochastic Dominance and Expected Utility: Survey and Anal-
ysis. Management Science 38, 4 (1992), 555–593. http://www.jstor.org/stable/
2632436
[16]
John Martin, Michal Lyskawinski, Xiaohu Li, and Brendan Englot. 2020. Stochas-
tically Dominant Distributional Reinforcement Learning. In International Confer-
ence on Machine Learning. PMLR, 6745–6754.
[17]
Andreu Mas-Colell, Michael Dennis Whinston, Jerry R Green, et al
.
1995. Mi-
croeconomic theory. Vol. 1. Oxford university press New York.
[18]
Kristof Van Moaert and Ann Nowé. 2014. Multi-Objective Reinforcement
Learning using Sets of Pareto Dominating Policies. Journal of Machine Learning
Research 15, 107 (2014), 3663–3692. http://jmlr.org/papers/v15/vanmoaert14a.
html
[19]
H. Nakayama, T. Tanino, and Y. Sawaragi. 1981. Stochastic Dominance for
Decision Problems with Multiple Attributes and/or Multiple Decision-Makers.
IFACProce edings Volumes 14, 2 (1981), 1397 – 1402. https://doi.org/10.1016/S1474-
6670(17)63673-5 8th IFAC World Congress on Control Science and Technology
for the Progress of Society, Kyoto, Japan, 24-28 August 1981.
[20]
David O’Callaghan and Patrick Mannion. 2021. Exploring the Impact of Tunable
Agents in Sequential Social Dilemmas. arXiv preprint: arXiv:2101.11967 (2021).
https://arxiv.org/abs/2101.11967
[21] Vilfredo Pareto. 1896. Manuel d’Economie Politique. Vol. 1. Giard, Paris.
[22]
Roxana Rădulescu, Patrick Mannion, Diederik M. Roijers, and Ann Nowé. 2020.
Multi-objective multi-agent decision making: a utility-based analysis and survey.
Autonomous Agents and Multi-Agent Systems 34, 10 (2020).
[23]
Roxana Rădulescu, Patrick Mannion, Yijie Zhang, Diederik M Roijers, and Ann
Nowé. 2020. A utility-based analysis of equilibria in multi-objective normal-form
games. The Knowledge Engineering Review 35 (2020).
[24]
Scott F. Richard. 1975. Multivariate Risk Aversion, Utility Independence and
Separable Utility Functions. Management Science 22, 1 (1975), 12–21. http:
//www.jstor.org/stable/2629784
[25]
Diederik M Roijers, Denis Steckelmacher, and Ann Nowé. 2018. Multi-objective
Reinforcement Learning for the Expected Utility of the Return. In Proceedings of
the Adaptive and Learning Agents workshop at FAIM 2018.
[26]
Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. 2013.
A survey of multi-objective sequential decision-making. Journal of Articial
Intelligence Research 48 (2013), 67–113.
[27]
Diederik M. Roijers, Shimon Whiteson, and Frans A. Oliehoek. 2014. Linear
Support for Multi-Objective Coordination Graphs. In Proceedings of the 2014
International Conference on Autonomous Agents and Multi-Agent Systems (Paris,
France) (AAMAS ’14). International Foundation for Autonomous Agents and
Multiagent Systems, Richland, SC, 1297–1304.
[28]
Marco Scarsini. 1988. Dominance Conditions for Multivariate Utility Functions.
Management Science 34, 4 (1988), 454–460. http://www.jstor.org/stable/2631934
[29]
Songsak Sriboonchitta, Wing-Keung Wong, s Dhompongsa, and Hung Nguyen.
2009. Stochastic Dominance and Applications to Finance, Risk and Economics.
https://doi.org/10.1201/9781420082678
[30]
Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Intro-
duction. A Bradford Book, Cambridge, MA, USA.
[31]
Peter Vamplew, Richard Dazeley, Adam Berry, Rustam Issabekov, and Evan
Dekker. 2011. Empirical evaluation methods for multiobjective reinforcement
learning algorithms. Machine Learning 84 (07 2011), 51–80. https://doi.org/10.
1007/s10994-010- 5232-5
[32]
Peter Vamplew, John Yearwood, Richard Dazeley, and Adam Berry. 2008. On
the Limitations of Scalarisation for Multi-objective Reinforcement Learning of
Pareto Fronts. In AI 2008: Advances in Articial Intelligence, Wayne Wobcke and
Mengjie Zhang (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 372–378.
[33]
Weijia Wang and Michèle Sebag. 2012. Multi-objective Monte-Carlo Tree Search
(Proceedings of Machine Learning Research, Vol. 25), Steven C. H. Hoi and Wray
Buntine (Eds.). PMLR, Singapore, 507–522.
[34]
Elmar Wolfstetter. 1999. Topics in Microeconomics: Industrial Organization, Auc-
tions, and Incentives. Cambridge University Press. https://doi.org/10.1017/
CBO9780511625787
9
... The unknown utility function scenario has three phases: the learning phase, the selection phase and the execution phase. During the learning phase, a multi-An earlier version of this work was presented at the Adaptive and Learning Agents Workshop 2021 [17]. This article extends our workshop paper with additional theoretical analysis and new empirical results. ...
Article
Full-text available
In many real-world scenarios, the utility of a user is derived from a single execution of a policy. In this case, to apply multi-objective reinforcement learning, the expected utility of the returns must be optimised. Various scenarios exist where a user’s preferences over objectives (also known as the utility function) are unknown or difficult to specify. In such scenarios, a set of optimal policies must be learned. However, settings where the expected utility must be maximised have been largely overlooked by the multi-objective reinforcement learning community and, as a consequence, a set of optimal solutions has yet to be defined. In this work, we propose first-order stochastic dominance as a criterion to build solution sets to maximise expected utility. We also define a new dominance criterion, known as expected scalarised returns (ESR) dominance, that extends first-order stochastic dominance to allow a set of optimal policies to be learned in practice. Additionally, we define a new solution concept called the ESR set, which is a set of policies that are ESR dominant. Finally, we present a new multi-objective tabular distributional reinforcement learning (MOTDRL) algorithm to learn the ESR set in multi-objective multi-armed bandit settings.
Preprint
Full-text available
In many real-world scenarios, the utility of a user is derived from the single execution of a policy. In this case, to apply multi-objective reinforcement learning, the expected utility of the returns must be optimised. Various scenarios exist where a user's preferences over objectives (also known as the utility function) are unknown or difficult to specify. In such scenarios, a set of optimal policies must be learned. However, settings where the expected utility must be maximised have been largely overlooked by the multi-objective reinforcement learning community and, as a consequence, a set of optimal solutions has yet to be defined. In this paper we address this challenge by proposing first-order stochastic dominance as a criterion to build solution sets to maximise expected utility. We also propose a new dominance criterion, known as expected scalarised returns (ESR) dominance, that extends first-order stochastic dominance to allow a set of optimal policies to be learned in practice. We then define a new solution concept called the ESR set, which is a set of policies that are ESR dominant. Finally, we define a new multi-objective distributional tabular reinforcement learning (MOT-DRL) algorithm to learn the ESR set in a multi-objective multi-armed bandit setting.
Article
Full-text available
The majority of multi-agent system implementations aim to optimise agents’ policies with respect to a single objective, despite the fact that many real-world problem domains are inherently multi-objective in nature. Multi-objective multi-agent systems (MOMAS) explicitly consider the possible trade-offs between conflicting objective functions. We argue that, in MOMAS, such compromises should be analysed on the basis of the utility that these compromises have for the users of a system. As is standard in multi-objective optimisation, we model the user utility using utility functions that map value or return vectors to scalar values. This approach naturally leads to two different optimisation criteria: expected scalarised returns (ESR) and scalarised expected returns (SER). We develop a new taxonomy which classifies multi-objective multi-agent decision making settings, on the basis of the reward structures, and which and how utility functions are applied. This allows us to offer a structured view of the field, to clearly delineate the current state-of-the-art in multi-objective multi-agent decision making approaches and to identify promising directions for future research. Starting from the execution phase, in which the selected policies are applied and the utility for the users is attained, we analyse which solution concepts apply to the different settings in our taxonomy. Furthermore, we define and discuss these solution concepts under both ESR and SER optimisation criteria. We conclude with a summary of our main findings and a discussion of many promising future research directions in multi-objective multi-agent systems.
Conference Paper
Full-text available
Real-world decision problems often have multiple, possibly conflicting , objectives. In multi-objective reinforcement learning, the effects of actions in terms of these objectives must be learned by interacting with an environment. Typically, multi-objective reinforcement learning algorithms optimise the utility of the expected value of the returns. This implies the underlying assumption that it is indeed the expected value of the returns (i.e., an average returns over many runs) that is important to the user. However, this is not always the case. For example in a medical treatment setting only the return of a single run matters to the patient. This return is expressed in terms of multiple objectives such as maximising the probability of a full recovery and minimising the severity of side-effects. The utility of such a vector-valued return is often a non-linear combination of the return in each objective. In such cases, we should thus optimise the expected value of the utility of the returns, rather than the utility of the expected value of the returns. In this paper, we propose a novel method to do so, based on policy gradient, and show empirically that our method is key to learning good policies with respect to the expected value of the utility of the returns.
Article
Full-text available
Sequential decision-making problems with multiple objectives arise naturally in practice and pose unique challenges for research in decision-theoretic planning and learning, which has largely focused on single-objective settings. This article surveys algorithms designed for sequential decision-making problems with multiple objectives. Though there is a growing body of literature on this subject, little of it makes explicit under what circumstances special methods are needed to solve multi-objective problems. Therefore, we identify three distinct scenarios in which converting such a problem to a single-objective one is impossible, infeasible, or undesirable. Furthermore, we propose a taxonomy that classifies multi-objective methods according to the applicable scenario, the nature of the scalarization function (which projects multi-objective values to scalar ones), and the type of policies considered. We show how these factors determine the nature of an optimal solution, which can be a single policy, a convex hull, or a Pareto front. Using this taxonomy, we survey the literature on multi-objective methods for planning and learning. Finally, we discuss key applications of such methods and outline opportunities for future work.
Article
Full-text available
While Stochastic Dominance has been employed in various forms as early as 1932, it has only been since 1969--1970 that the notion has been developed and extensively employed in the area of economics, finance, agriculture, statistics, marketing and operations research. In this survey, the first-, second- and third-order stochastic dominance rules are discussed with an emphasis on the development in the area since the 1980s.
Article
In multi-objective multi-agent systems (MOMASs), agents explicitly consider the possible trade-offs between conflicting objective functions. We argue that compromises between competing objectives in MOMAS should be analyzed on the basis of the utility that these compromises have for the users of a system, where an agent’s utility function maps their payoff vectors to scalar utility values. This utility-based approach naturally leads to two different optimization criteria for agents in a MOMAS: expected scalarized returns (ESRs) and scalarized expected returns (SERs). In this article, we explore the differences between these two criteria using the framework of multi-objective normal-form games (MONFGs). We demonstrate that the choice of optimization criterion (ESR or SER) can radically alter the set of equilibria in a MONFG when nonlinear utility functions are used.
Article
In decision making under uncertainty, the best alternative may be obtained by maximizing the expected utility. To this end, many researchers have been paying much attention to the identification of utility functions. Inpractical situations, however, there seem to be not a few possible cases such as group decisions where it is difficult to identify utility functions. When only partial knowledge of utility functions is available, one way for ranking alternatives is to make the most of information on probability distributions of alternatives. The theory of stochastic dominance has been developed for this purpose. Although the use of stochastic dominance may seldom lead to the final decision in general, it may be useful for narrowing down the alternetive set. In this paper, several kinds of stochastic dominance will first be surveyed and discussed. Then their potential effectiveness to risk management will be verified along with some examples.
Book
Utility in Decision Theory Choice under certainty Basic probability background Choice under uncertainty Utilities and risk attitudes Foundations of Stochastic Dominance Some preliminary mathematics Deriving representations of preferences Stochastic dominance (SD) Issues in Stochastic Dominance A closer look at the mean-variance rule Multivariate SD Stochastic dominance via quantile functions Financial Risk Measures The problem of risk modeling Some popular risk measures Desirable properties of risk measures Choquet Integrals as Risk Measures Extended theory of measures Capacities The Choquet integral Basic properties of the Choquet integral Comonotonicity Notes on copulas A characterization theorem A class of coherent risk measures Consistency with SD Foundational Statistics for Stochastic Dominance From theory to applications Structure of statistical inference Generalities on statistical estimation Nonparametric estimation Basics of hypothesis testing Models and Data in Econometrics Justifications of models Coarse data Modeling dependence structure Some additional statistical tools Applications to Finance Diversification Diversification on convex combinations Prospect and Markowitz SD Market rationality and efficiency SD and rationality of momentum effect Applications to Risk Management Measures of profit/loss for risk analysis REITs and stocks and fixed-income assets Evaluating hedge funds performance Evaluating iShare performance Applications to Economics Indifference curves/location-scale (LS) family LS family for n random seed sources Elasticity of risk aversion and trade Income inequality Appendix: Stochastic Dominance Tests Bibliography Index Exercises appear at the end of each chapter.
Book
This textbook aims to provide a comprehensive overview of the essentials of microeconomics. It offers unprecedented depth of coverage, whilst allowing lecturers to 'tailor-make' their courses to suit personal priorities. Covering topics such as noncooperative game theory, information economics, mechanism design and general equilibrium under uncertainty, it is written in a clear, accessible and engaging style and provides practice exercises and a full appendix of terminology.
Article
About 400 publications, working papers and books are included in this bibliography on Stochastic Dominance. It contains an exhaustive listing of papers that are either basic contributions to this subject or primarily concerned with applications of the Stochastic Dominance concepts. It also contains selective listing of papers from finance, economics, mathematics, mathematical physics, mathematical psychology, operations research and statistics literature to illustrate the wide applicability of Stochastic Dominance concepts.