ChapterPDF Available

Fair Deep Reinforcement Learning with Preferential Treatment

Authors:
  • Duke Kunshan University

Abstract

Learning fair policies in reinforcement learning (RL) is important when the RL agent may impact many users. We investigate a variant of this problem where equity is still desired, but some users may be entitled to preferential treatment. In this paper, we formalize this more sophisticated fair optimization problem in deep RL using generalized fair social welfare functions (SWF), provide a theoretical discussion to justify our approach, explain how deep RL algorithms can be adapted to tackle it, and empirically validate our propositions on several domains. Our contributions are both theoretical and algorithmic, notably: (1) We obtain a general bound on the suboptimality gap in terms of SWF-optimality using average reward of a policy SWF-optimal for the discounted reward, which notably justifies using standard deep RL algorithms, even for the average reward; (2) Our algorithmic innovations include a state-augmented DQN-based method for learning either deterministic or stochastic policies, which also applies to the usual fair optimization setting without any preferential treatment.
Fair Deep Reinforcement Learning with
Preferential Treatment
Guanbao Yua, Umer Siddiqueband Paul Wenga;*
aShanghai Jiao Tong University, Shanghai, China
bUniversity of Texas, San Antonio, USA
ORCiD ID: Paul Weng https://orcid.org/0000-0002-2008-4569
Abstract. Learning fair policies in reinforcement learning (RL) is
important when the RL agent may impact many users. We investigate
a variant of this problem where equity is still desired, but some users
may be entitled to preferential treatment. In this paper, we formalize
this more sophisticated fair optimization problem in deep RL using
generalized fair social welfare functions (SWF), provide a theoretical
discussion to justify our approach, explain how deep RL algorithms
can be adapted to tackle it, and empirically validate our propositions
on several domains. Our contributions are both theoretical and algo-
rithmic, notably: (1) We obtain a general bound on the suboptimality
gap in terms of SWF-optimality using average reward of a policy
SWF-optimal for the discounted reward, which notably justifies us-
ing standard deep RL algorithms, even for the average reward; (2)
Our algorithmic innovations include a state-augmented DQN-based
method for learning either deterministic or stochastic policies, which
also applies to the usual fair optimization setting without any prefer-
ential treatment.
1 Introduction
We consider autonomous systems based on deep reinforcement
learning (RL). When they operate in real applications (e.g., traffic
lights, software-defined networking, data centers), they may impact
many end-users. Hence, for these deployed systems to be accepted by
those users, fairness needs to be taken into account in their design.
Fairness is rooted in the equal treatment of equals principle, which
informally speaking means that individuals with similar characteris-
tics should be treated in a similar way. Previous work [51, 12, 62]
in learning fair policies in RL focuses on such notion with the ad-
ditional assumption that all individuals are equal, which may not be
suitable for all applications. For instance, it is customary for service
providers (in e.g., software-defined networking, data centers) to pro-
vide different levels of QoS (quality of service) to different user tiers.
In such cases, although the principle of “equal treatment of equals”
is still a desired objective, higher-paying users should arguably be
entitled to higher priority or better services.
In our work, we relax the assumption of equal individuals, i.e.,
different users may have different rights. We study this more sophis-
ticated fair optimization problem in deep RL, where efficient policies
should be learned such that while some users may receive preferen-
tial treatment, users with similar rights are still equitably treated.
Corresponding Author. Email: paul.weng@sjtu.edu.cn.
Contributions We formalize this novel problem in deep RL as a
fair optimization problem using generalized fair social welfare func-
tions (SWF) (Section 4). We discuss its difficulties and provide some
theoretical results to justify our algorithms (Section 4). We notably
extend a performance bound, which now holds for a general class
of fair SWFs. Based on this discussion, we propose several adapta-
tions of deep RL algorithms to solve our problem (Section 5). In par-
ticular, we design a novel state-augmented DQN-based method for
learning fair stochastic policies. Finally, we experimentally validate
our propositions (Section 6).
2 Related Work
Fairness has recently become an important and active research direc-
tion [20, 60, 49, 52, 15, 9, 2, 34, 61]. While the majority of this lit-
erature in machine learning focuses on the equal treatment of equals
principle, other aspects of fairness have been considered in AI, e.g.,
proportionality [55, 4] or envy-freeness [14] and its multiple variants
(e.g., [5, 11]). In contrast, our work is based on studies in distributive
justice [46, 7, 33]. We aim at optimizing a social welfare function that
encodes fairness. This principled approach has also been recently ad-
vocated in several recent papers [22, 54, 58, 18, 19].
In mathematical optimization, such an approach is called fair op-
timization [40]. Many continuous and combinatorial optimization
problems in various application domains [3, 50, 37, 38, 41] have
been extended to optimize for fairness. In this direction, the closest
work [41] regards fair optimization in Markov decision processes.
However, the methods proposed in this direction typically assume
that the model is known and therefore, they do not require learning.
Fairness also starts to be considered in RL, e.g., fairness constraint
to reduce discrimination [57], fairness with respect to state visita-
tion [23, 21], the usual case of fairness with respect to agents [24],
or the more general case of fairness with respect to users [51, 12, 62,
30]. This last direction can be understood as an extension of fair op-
timization to (deep) RL. Our work follows this principled approach,
but investigates a more general setting. While previous work assumes
all users to be equal, we relax this assumption. In particular, com-
pared to the most related work [51], we extend their work to the
fair optimization setting with preferential treatment, generalize their
bound to be valid for continuous concave SWFs, and propose more
efficient value-based algorithms.
State augmentation (used in our DQN variants) has been exploited
in various previous work, e.g., in MDPs [26] or more recently in safe
ECAI 2023
K. Gal et al. (Eds.)
© 2023 The Authors.
This article is published online with Open Access by IOS Press and distributed under the terms
of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).
Please check ArXiv or contact the authors for any appendices or supplementary material mentioned in the paper.
doi:10.3233/FAIA230606
2922
RL [53], risk-sensitive RL [17], RL with delays [36], and partially
observable path planning [35]. However, to the best of our knowl-
edge, this technique has not been applied in fair optimization. More-
over, our technique to learn stochastic policies in DQN is also novel.
3 Background
We introduce our notations and recall the necessary background in
sequential decision making and fairness modeling.
Notations Matrices are denoted in uppercase and vectors in low-
ercase. Both are written in bold. For any integer D>0, the D1
simplex is denoted by ΔD={w[0,1]D|iwi=1}. We de-
note SDthe symmetric group of degree D(i.e., set of permutations
over {1,...,D}). For any permutation σSDand vector vRD,
vector vσdenotes (vσ(1),...,vσ(D)). For any vector vRD,v
corresponds to the vector with the components of vector vsorted in
an increasing order (i.e., v
1...v
D).
3.1 Sequential Decision-Making
Markov Decision Process (MDP) We consider sequential
decision-making problems that can be modeled as an MDP [45],
which is is defined by the following elements: a set of states S,a
set of actions A,Pis a transition model, ris a reward function, d0
is a probability distribution over initial states, and γ[0,1) is a dis-
count factor. The usual goal of this model is to learn a policy πthat
maximizes some performance measure, e.g., the expected discounted
reward criterion or the expected average reward criterion.
A(stationary and Markov) policy πselects which action to take
in any state s: it can be deterministic (π(s)=a∈A) or stochastic
(π(a|s)=Pr(a|s)). With the discounted reward, the (state)
value function of a policy πis defined by:
vπ(s)=EP
t=1
γt1rt|S0=s.(1)
Similarly, its action-value function Qπ(s, a)is given by
Qπ(s, a)=EP,π
t=1 γt1rt|S0=s, A0=a. The prob-
lem for the discounted reward can be formally defined as follows:
argmaxπs∈S d0(s)vπ(s). The action-value function of a solu-
tion to this problem is called optimal Q-function.
With the average reward criterion, the (state) value function of a
policy πis usually referred to as gain denoting by gπ:
gπ(s) = lim
h→∞
1
hEP h
t=1
rt|S0=s.(2)
The problem here is given by argmaxπμπ, where μπ=
s∈S d0(s)gπ(s).
We assume that the MDP is weakly communicating1.Recall that
for such MDPs, the optimal gain is constant (state-independent).
Note that without such assumption, which is weaker than ergodicity,
learning is hopeless. In the theoretical discussion, we also assume
for simplicity that Sand Aare finite. In that case, all the functions
in this model can be seen as vectors or matrices (all written in bold).
1Scan be partitioned into two T,C: set Tin which all states are transient
under every stationary policy, and Cin which any two states can be reached
from each other under some stationary policy.
Multiobjective MDP (MOMDP) Since in our setting, an RL
agent’s actions can impact several users, we consider MOMDP [47],
an extension of MDP, in which the goal is to optimize multiple objec-
tives (i.e., utilities of users) instead of a single one. Thus, the rewards
in MOMDPs are vectors where each component can be interpreted
as a scalar reward allocated to one user in our setting. The reward
function of a MOMDP is denoted R(s, a)RDwhere Dis the
number of objectives (i.e., users).
The value functions in MDPs can be naturally extended to
MOMDPs, e.g., the (state) value function becomes:
Vπ(s)=EP
t=1
γt1Rt|S0=s,(3)
where RtRDis the vector reward obtained at time step tand
all operations (addition, product) are component-wise. Vectors can
be compared according to Pareto dominance2.Being a partial order,
finding all Pareto-optimal solutions in MOMDPs is infeasible in gen-
eral [44].
Deep RL Since we are interested in domains where the
state/action spaces may be large or even continuous, we consider
deep RL algorithms where value functions or policies are approxi-
mated by neural networks.
Deep Q-Network (DQN) [31] is an example of deep RL algorithm
where the optimal Q-function is approximated by a neural network
with parameter θ. This Q-network takes a state sas input and outputs
an estimated ˆ
Qθ(s, a)for all actions. It is trained to minimize the
following L2loss for transitions (s, a, r, s)sampled from a replay
buffer storing experiences generated from online interactions with
the environment:
(r+max
a∈A γˆ
Qθ(s,a
)ˆ
Qθ(s, a))2,
where θparametrized a target network to enable stabler training.
The term r+γˆ
Qθ(s,a
)is called target Q-value.
3.2 Fairness
Since the agent in an MOMDP can be seen as allocating rewards
to users, it is natural to resort to a notion of fairness discussed in
distributive justice [33], which studies fair distribution of wealth. We
recall this notion next and its extension to preferential treatment.
Fairness with Equal Users The notion of fairness that is adopted
in fair optimization enforces three natural principles: impartiality,
equity, and efficiency. Impartiality corresponds to the “equal treat-
ment of equals” principle. Under the assumption that all users are
equal, which is a common assumption in past work, this principle im-
plies that reordering a reward distribution leads to an equivalent one.
Equity is based on the Pigou-Dalton principle [33], which states that
aPigou-Dalton transfer (i.e., small transfer of reward from a better-
off user to a worse-off user) results in a fairer solution. This princi-
ple enforces that more balanced reward distributions are preferred.
Efficiency requires a fair solution to be Pareto-optimal. It is natural
because selecting a Pareto-dominated solution would be irrational.
These three principles provide some guidance about how to se-
lect among reward allocations: they induce a binary relation between
2v,vRD,vweakly Pareto-dominates v⇔∀i, viv
i. Besides,
vPareto-dominates v⇔∀i, viv
iand j, vj>v
j. A non Pareto-
dominated solution is called Pareto-optimal.
G. Yu et al. / Fair Deep Reinforcement Learning with Preferential Treatment 2923
vectors. For instance, assume that there are only two users (D=2)
and that the possible solutions are (0,5),(5,2), and (3,3). These
principles imply that (5,2) is preferred to (0,5). Indeed, by impar-
tiality, (5,2) is equivalent to (2,5). By efficiency, (2,5) is preferred
to (2,3). By equity, (2,3) = (0,5)+ (+2,2) is preferred to (0,5).
However, using these principles alone, we cannot decide whether
(3,3) or (5,2) should be preferred. Thus, although this relation re-
fines Pareto dominance, it is still only a partial order.
Since using a partial order is not practical in autonomous decision-
making, we rely on the concept of social welfare function (SWF) to
enforce a total order. An SWF φ:RDRevaluates how good a
solution is for a group of users by aggregating their utilities. Interest-
ingly, the three previous principles translate into three properties for
an SWF that encodes this notion of fairness. Impartiality implies that
φis symmetric (i.e., independent of the order of its arguments). Eq-
uity means that φis Schur-concave (i.e., monotonic with respect to
Pigou-Dalton transfers). This implies that φcannot be linear, which
formally shows that the utilitarian approach (i.e., ivi) is not suit-
able for fairness. Efficiency enforces that φis monotonic with respect
to Pareto dominance.
In the literature, two main families of SWFs satisfying those three
conditions have been considered. The first family is the generalized
Gini SWF (GGF) [59]:
GGFw(v)=
D
i=1
wiv
i,(4)
where vRDand wΔDis a fixed positive weight vector whose
components are strictly decreasing (i.e., w1>... > wD>0). The
second family has the following form:
φu(v)=
D
i=1
u(vi),(5)
where u:RRis a strictly concave function. Both families
can encode various well-known specific types of fairness. For in-
stance, GGFwcan represent lexicographic maxmin fairness [46]
with the components of wdecreasing sufficiently fast and φuin-
cludes α-fairness (α>0) [32] with uα(x)= x1α
1αif α=1
and uα(x)=log(x)otherwise. Note that both families define func-
tions that are concave and sub-differentiable. We provide more de-
tails about these SWFs and explain why they satisfy the three fairness
principles in Appendix A (the appendix is available in the extended
version of the paper).
Fairness with Preferential Treatment Assuming that all users
are equal is inadequate in some domains. For instance, in situations
where an autonomous system manages the operations of a service
provider, as discussed in the introduction, different tiers of users need
to be taken into account. Another example is in the medical domain,
where priority may be given to patient with specific needs (e.g., el-
derly or children). In those cases, the three fairness principles should
be enforced, while allowing preferential treatment.
One typical approach to achieve this more general notion of fair-
ness is via user duplication [8], i.e., if a user is more important, s/he
should be counted more times (via importance weight) than other
users. Since this weight can naturally be normalized, formally, if
each user ireceives some fractional entitlement pi(i.e., importance
weights), when two users are equally important, they would receive
equal weights. In contrast, if a user is entitled to preferential treat-
ment, s/he would consequently receive a larger share of the total im-
portance weight. This technique allows to naturally extend the previ-
ous families of SWFs.
Assume given some importance weights pΔD. The first family,
which we call generalized GGF (G3F)[39] is defined as follows:
G3Fp,w(v)=
i
ωiv
i,for any vRD(6)
where wis defined as in (4) and the weight ωiis defined as:
ωi=wi
k=1
pσ(k)wi1
k=1
pσ(k),(7)
with wbeing a monotone increasing function that linearly interpo-
lates the points (i/D, i
k=1 wk)together with the point (0,0), and
σthe permutation sorting the components of vector vin increasing
order, i.e., vσ(i)=v
ifor all i. This formula amounts to averaging
each portion (in terms of cumulated p) of the i/D-th smallest values
of vand apply the standard GGF to these Daverages.
The second generalized family can be obtained more simply as
follows: for a given fixed strictly concave u:RR,
φp,u(v)=
D
i=1
piu(vi).(8)
We call a fair SWF with preferential treatment a generalized fair
SWF and denote it ψ. For instance, ψcan be any instance of the two
previous families. Such an SWF enforces how trade-offs are made
between efficiency and equity via the choice of wor u. The choice
of pdepends on what entitlement to give to users. Overall, the choice
of a suitable SWF for a task is domain and problem dependent.
4 Problem Statement and Discussions
We formulate the novel problem of fair optimization in deep RL with
preferential treatment, then recall its difficulties and present some
theoretical discussion justifying our approach.
This fair optimization problem can be formally stated by optimiz-
ing generalized fair SWFs ψin MOMDPs. It corresponds to deter-
mine a policy that generates a fair distribution of rewards subject to
importance weights p. Since we focus on deep RL, we directly write
it with parametrized policy πθ:
argmax
πθ
ψ(J(πθ)) ,(9)
where J(πθ)corresponds to the vectorial version of the standard RL
objective, e.g., the discounted or average reward. Note that this for-
mulation is general, since it accepts any generalized fair SWFs (e.g.,
G3Fp,wor generalized α-fairness mentioned in Section 3.2). A so-
lution to this problem is called ψ-γ-optimal policy if the discounted
reward is used, ψ-average-optimal policy if the average reward is
used, or simply ψ-optimal policy to include both cases.
Standard multi-objective approaches aim at finding the set of
Pareto optimal solutions (or an approximation), which may be in-
feasible in general. In contrast, our formulation directly aims for a
Pareto-optimal ψ-fair policy. Moreover, the usual approach of opti-
mizing a weighted sum of the objectives is insufficient, because such
aggregation functions do not favor equitable solutions (i.e., violation
of the Pigou-Dalton principle). In addition, note that directly apply-
ing ψon immediate rewards (instead of their expectation as we do)
provides no guarantee in terms of fairness, because this naive ap-
proach does not allow compensation over time and expectation to
reach more equitable reward distribution.
G. Yu et al. / Fair Deep Reinforcement Learning with Preferential Treatment2924
Recall that the choices of ψand pdepend on the problem and what
the system designer aims to achieve. In practice, one may run the
algorithms (see Section 5) with varying pand select the parameters
that provide suitable trade-offs. Note that our formulation does not
assume the presence of any malicious actor, which may possibly try
to manipulate the system. We leave such questions to future work.
Difficulties As an extension of the problem investigated by [51],
Problem (9) also faces the same difficulties, notably state-dependent
optimality and optimality of stochastic policies. For the first point, in
contrast to standard RL, a policy that is ψ-γ-optimal in an initial state
may not be ψ-γ-optimal in another initial state. Gladly, ψ-average-
optimality is state-independent, which may be another argument for
using this criterion. For the second point, searching fair solutions
among deterministic policies becomes insufficient. However, a ψ-
optimal policy exists among stationary stochastic policies, because
intuitively, stochastic choices allows fairer compensation.
4.1 Theoretical Discussion
We provide some useful new results that justify our approach.
Bounds Although the average reward criterion may be more suit-
able, notably for a service provider, the discounted reward criterion
may still be desirable because optimizing the latter is usually con-
sidered easier than the former in RL. An important question is then
to bound how far a ψ-γ-optimal policy π
γistoaψ-average-optimal
policy π
1in terms of ψ-average-optimality. The next theorem, stated
informally for legibility sake (see Appendix B.1 for formal state-
ment), provides such a bound, which extends a previous result [51]
only valid for GGF to any continuous concave ψ.
Theorem 1 For any weakly-communicating MOMDP and any γ
close to 1, there exist constants Cand Ksuch that:
ψ(μπ
γ)ψ(μπ
1)(1 γ)CK
Proof 1 The proof technique is similar to [51]. The more general
result is achieved by resorting to the Fenchel dual of ψ. We also cor-
rected a sign issue in the previous result. More details can be found
in Appendix B.1.
Constant Cdepends on both the reward function and ψ, while con-
stant Kdepends on the transition function. This result provides a
performance guarantee for using the discounted reward criterion in-
stead of the average reward, thus motivating the use of the former in
the algorithm design, even if one aims to optimize the latter.
Concavity of G3FAlthough [43] has already proved the concav-
ity of G3F, here we provide another straightforward proof as an al-
ternative proof technique.
Lemma 1 For any pΔD, for any wΔDsuch that its compo-
nents are decreasing, function G3Fp,wis concave.
Proof 2 We prove that G3Fp,wis a Choquet integral with respect to
a super-modular capacity. By [29], such integrals are concave. See
Appendix B.2 for details.
While φp,u in (8) clearly defines a concave function, it was not
completely obvious for G3F. Our lemma confirms that both general-
ized families yield concave functions. The concavity of ψsuggests
that our fair optimization (9) should enjoy some nice properties. For
instance, with a linear approximation scheme, the overall problem
would be a convex optimization problem. In deep RL, the overall
problem is not convex anymore, but from the point of view of the
last layer of a neural network (usually linear, e.g., in DQN), the op-
timization problem is still convex. In addition, concavity is required
to justify the DQN variants, as discussed next.
5 Proposed Algorithms
We explain how deep RL methods for the discounted reward can be
adapted to solve Problem (9). For space reasons, we focus here on
the adaptation of DQN [31]. Because the extensions of actor-critic
(AC) methods (i.e., A2C and PPO), called ψ-A2C and ψ-PPO re-
spectively, are more straightforward. Notably, the policy gradient for
these methods can be easily obtained via the chain rule. We explain
how the critics can be adapted and trained next. Other AC methods
could be extended in a similar way.
Following [51], the architectures of these critics, but also DQN,
are modified such that their outputs are vectorial (e.g., for DQN,
R|A|×Dinstead of R|A|). Although the Bellman principle of op-
timality does not hold anymore due to the state-dependent ψ-
optimality, vector value functions can still be temporally decom-
posed. Interestingly, therefore, the computational cost of the algo-
rithms depends linearly on the number of users (i.e., vector dimen-
sion), which suggests that our approach could scale to a large number
of users. We discuss next our three extensions of DQN. Note critics
are trained like in DQN but without maximizing over future actions.
ψ-DQN ψ-DQN is the direct extension of GGF-DQN proposed
by [51] to optimize GGF. Since ψ-DQN aims to optimize ψ, the tar-
get Q-value is changed to:
r+γˆ
Qθ(s,a
),
where a= argmaxa∈A ψr+γˆ
Qθ(s,a
). The best next ac-
tion is chosen such that the immediate reward plus discounted future
rewards (both vectorial) is fair. For execution in a state s, an action
in argmaxa∈A ψˆ
Qθ(s, a)is chosen.
The performance of ψ-DQN is limited for several reasons. First, it
learns a stationary, Markov, and deterministic policy, which is known
to be insufficient (see Section 4). Moreover, since ψis assumed to
be concave, by Jensen inequality, Esψr+γˆ
Qθ(s,a
)
ψEsr+γˆ
Qθ(s,a
). Therefore, ψ-DQN implicitly maxi-
mizes a lower bound of an approximation of objective function (9).
Since DQN is known to be more sample-efficient than AC meth-
ods, it is useful to design better algorithms than ψ-DQN for our prob-
lem. Next, we propose two such novel extensions of DQN that can
achieve better performance.
ψ-CDQN One simple approach to improve the performance of
ψ-DQN is to relax the assumption that the learned policy is Markov,
which can be achieved via state augmentation. In ψ-CDQN, the agent
observes the current state and the past cumulated vector reward. Intu-
itively, using this additional information, the agent can better balance
the reward distribution over users.
Formally, an original state stis augmented as follows: ¯st=
st,1
λR1:twhere λ=t1
τ=1 γτ1acts as a scaling factor, R1:t=
t1
τ=1 γτ1rτdenotes the discounted total reward received so far,
G. Yu et al. / Fair Deep Reinforcement Learning with Preferential Treatment 2925
which is reset to zero at the beginning of an episode. The target Q-
value is changed as follows:
rt+γˆ
Qθst+1,a
),
where a= argmaxa∈A ψˆ
Qθst+1,a
). Here, the immediate
reward rtis removed from the optimal action computation since
this signal is already included in the augmented state as part of
the discounted total reward. For execution in a state s, an action in
argmaxa∈A ψˆ
Qθs, a)is chosen.
ψ-CSDQN Since stochastic policies may dominate determinis-
tic ones (Section 4), the performance of ψ-DQN (and possibly ψ-
CDQN) can be improved by considering stochastic policies. We de-
scribe how to achieve this next.
With stochastic policies, the target Q-value is changed to:
ˆ
Qθ(s, a)=r+γˆ
Q
θ(s,·),
where ˆ
Q
θ(s,·)=a∈A π(a|s)ˆ
Qθ(s,a
)denotes an esti-
mated Q-value achieved at a next state by a policy π, defined as:
π(·|s) = argmax
π
ψ(r+γ
a∈A
π(a|s)ˆ
Qθ(s,a
)) (10)
This reformulation assumes that in the next state, the best stochastic
policy is applied (in contrast to a deterministic greedy policy in DQN
or ψ-DQN). For execution in a state s, an action is sampled from
π(·|s)in argmaxπψ(a∈A π(a|s)ˆ
Qθ(s,a
)).
Problem (10) is an easy optimization problem. Since ψis con-
cave, it is a convex optimization with |A| variables corresponding
to π(·|s)Δ|A|. As a general approach, it can be solved
by projected gradient ascent, which consists in repeatedly updat-
ing the current π(·|s)and projecting the updated variables to
the simplex Δ|A|. Recall that projection on a simplex can be done
efficiently [13]. Interestingly, for specific ψ, more specialized algo-
rithms can be used. For instance, when choosing G3Fp,was ψ, one
can obtain the optimal stochastic policy by solving a linear program
(see Appendix C.3).
Finally, by augmenting states like in ψ-CDQN, we can formu-
late the last novel algorithm called ψ-CSDQN, which can learn fair
stochastic policies with augmented states. Although one may expect
a better performance from this new algorithm, ψ-CDQN may still
be useful in domains (e.g., robotics) where deterministic policies are
favored. Note that all these DQN variants follow the -greedy policy
during training and with probability 1, the best action is chosen
in a way corresponding to the specific variant as explained above.
6 Experimental Results
Our proposed generic algorithms can be instantiated with different
SWFs ψ. We have performed experiments with the two families of
SWFs discussed in Section 5: G3Fand the generalized α-fairness.
For space reasons, we mainly discuss here our results for G3Ffor
two reasons. First, it is an extension of GGF, a well-studied SWF in
economics [59]. Second, it is a more general SWF than α-fairness,
which only applies when rewards are positive. Moreover, we empha-
size here our evaluation of the DQN variants, which constitute our
main algorithmic contribution. Additional experimental results with
α-fairness or the AC methods (A2C, PPO) can be found in App. ??.
All the results are averaged over 10 runs with different seeds.
Our algorithms3(with relevant baselines) are evaluated in the same
three domains as in [51] to help with comparability. In roughly in-
creasing problem sizes (i.e., number of users, state/action spaces),
they are: (i) Species conservation (SC), (ii) Traffic light control (TL),
and (iii) Data center control (DC). We briefly describe them next (see
appendix of [51] for more details).
The first domain (SC) [10] simulates an ecological conservation
problem in which two species—an endangered species (sea otters)
and its prey (northern abalone)—interact with one another, poten-
tially leading to the extinction of some species. An observed state
includes the population levels of the two species. The size of the ac-
tion set is 5. Fairness is expressed over the two species (D=2) and
can be understood as both species remaining alive and having a bal-
anced population density. Because population densities may not be
comparable directly, using equal weights for pmay not be suitable.
In that case, a fair SWF with importance weights may be beneficial.
The second domain (TL) is a traffic light control problem [28]
where an agent controls the traffic lights at a single intersection to
optimize traffic flow. A state includes the waiting times and densi-
ties of cars waiting in each lane. An action amounts to selecting the
next traffic-light phase among four phases: NSL, NSSR, EWL, and
EWSR, where NSL (north-south left) represents the phase with the
green light assigned to the left lanes of the roads approaching from
the north and south, NSSR (north-south straight and right) represents
the phase with the green light assigned to the straight and right lanes
of roads approaching from the north and south, and so on. Fairness
is defined over each direction at the intersection (i.e., D=4). We as-
sume that some lanes will be given preferential treatment (e.g., due
to morning rush, traffic flows are unbalanced) and that the waiting
times for cars in these lanes will be optimized with higher priorities,
while other lanes with equal preferences will be treated fairly.
The last domain (DC) is a data center traffic control problem [48],
which involves connecting a large number of computers according
to some network topology. In particular, we consider a network with
a fat-tree topology, which connects 16 computers via 20 switches.
A state is composed of each computer network information. A con-
tinuous action corresponds to the allocation of bandwidth for each
host. The vector reward is calculated by penalizing the bandwidths
per host by the sum of queue lengths. In this domain, fairness can
be expressed with respect to the number of hosts (e.g., D= 16).
Since the action space is continuous, this domain is incompatible
with DQN-like algorithms. we thus only run the AC methods and
their GGF/G3Fextensions.
On these domains, we have run an extensive set of experiments to
answer a series of questions. Additional experimental results can be
found in Appendix F.
Do our algorithms learn fairer (w.r.t. G3F) solutions than their
respective counterparts? This is a sanity check to verify that our
methods perform as intended. All the algorithms are run in SC with
weight pset to (0.9,0.1) where the first component corresponds to
sea otters. The G3Fscores are obtained by applying G3Fon the em-
pirical average vector returns of trajectories sampled with the learned
policies. As expected, Figure 1a shows all the G3Falgorithms reach
higher scores than their GGF and original counterparts, indicating
that fairness with priority set by pwas better achieved.
Figure 1b shows the corresponding empirical average vector re-
turns before aggregating with G3F. Recall that optimizing GGF (i.e.,
3To help reproducibility, our implementations can be found in this repository
https://github.com/AdaptiveAutonomousAgents/ecai2023.
G. Yu et al. / Fair Deep Reinforcement Learning with Preferential Treatment2926
DQN
GGF-DQN
G
3
F-DQN
G
3
F-CDQN
G
3
F-CSDQN
A2C
GGF-A2C
G
3
F-A2C
PPO
GGF-PPO
G
3
F-PPO
0.2
0.4
0.6
0.8
1.0
G3FScore
(a) G3Fscore during the testing phase.
DQN
GGF-DQN
G
3
F-DQN
G
3
F-CDQN
G
3
F-CSDQN
A2C
GGF-A2C
G
3
F-A2C
PPO
GGF-PPO
G
3
F-PPO
0.0
0.2
0.4
0.6
0.8
1.0
Average density
Sea-otters Abalones
(b) Population densities during testing.
CV min density max density
0.00
0.25
0.50
0.75
1.00
1.25
1.50
DQN G3F-DQN G3F-CDQN G3F-CSDQN
(c) CV, min, and max densities during testing.
Figure 1: Performances of DQN, A2C, PPO, and their GGF or G3Fcounterparts in SC. Weight p=(0.9,0.1) for the G3Falgorithms.
0 10 20 30 40 50 60
Number of Interactions (k)
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Average accumulated density
DQN
G3F-DQN
G3F-CDQN
G3F-CSDQN
G3F-A2C
G3F-PPO
(a) Accumulated densities during training.
(0.1,0.9) (0.3,0.7) (0.5,0.5) (0.7,0.3) (0.9,0.1)
0.0
0.2
0.4
0.6
0.8
1.0
Average density
G3F-CDQN
G3F-CSDQN
Sea-otters Abalones
(b) Population densities during testing.
CV min density max density
0.0
0.2
0.4
0.6
0.8
1.0G3F-CDQN
G3F-CSDQN
p0=0.1p0=0.3p0=0.5p0=0.7p0=0.9
(c) CV, min and max densities during testing.
Figure 2: Additional experimental results in the SC domain: (Left) G3Fwith p=(0.9,0.1), (Center, Right) G3Fwith varying p.
G3Fwith p=(0.5,0.5)) would lead to a much larger density of
abalones [51]. However, optimizing G3Fwith a higher priority given
to sea otters achieves more balanced individual densities than their
corresponding counterparts. A non-uniform pmay help correct ad-
vantages conferred to some users by the environment.
How much control over solutions does pprovide? To answer
this question, we evaluate the G3Falgorithms with varying impor-
tance weights pin the SC and TL domains. Figure 2c shows the
testing performance of CDQN and CSDQN in SC in terms of Coeffi-
cient of Variation (CV), minimum and maximum density. Recall that
CV is a simple inequality measure defined as the ratio of the standard
deviation to the mean. Lower CV values imply more balanced solu-
tions. For experiments in SC, we increase the preference weight p0of
the first objective from 0.1to 0.9(i.e., p1decreases from 0.9to 0.1,
correspondingly). As a result, the density of the sea otter increases,
resulting in lower CV, higher minimum density, and lower maximum
density. This observation can be further confirmed from the corre-
sponding non-aggregated densities with varying p(Figure 2b).
In the TL domain, we vary weight p0(assigned to North), while
the remaining weight is assigned uniformly over the remaining three
components (directions) of p. Figure 4a shows that waiting times of
cars coming from lanes with higher weights are shorter than those
coming from lanes with lower weights. Interestingly, the waiting
times of cars coming from north and south are close, although they
are assigned different weights. This is because the agent’s action can
affect two lanes at the same time in this case. For example, an ac-
tion NSL corresponds to the phase when the left lanes of north and
south are given a green light and cars can only turn left during this
phase. As a result, optimizing the waiting time in one lane will have
an effect on the opposite lane as well.
The above results show that by appropriately adjusting weights p,
we can achieve desired control over multiple users, up to constraints
imposed by the problem structure.
When more weight is given to one objective and equal weights
are assigned to the other objectives, does the “equal treatment
of equals” principle still hold? The previous discussion in the TL
domain suggests that this may not always be the case, due to the
inherent structure of the control problem. It is however interesting to
answer this question when less or no dependence between objectives
is expected. We therefore turn to the DC domain where there are 16
objectives in total.
In this domain, we compare the performances of G3Fextensions
of AC methods with their original and GGF counterparts under the
following setting. Specifically, the first objective is given a weight
of 1
4, and the other objectives are assigned equal weights (i.e., 1
20 ).
The distributions of G3Fscores for different algorithms are depicted
in Figure 3b. Not surprisingly, the G3Falgorithms achieve higher
scores compared to the standard/GGF counterparts. In addition, the
standard RL algorithms and their GGF adaptations have similar per-
formance for this G3Fmeasure. This is because the GGF adaptations
normally have lower total rewards than their standard methods, and
optimizing GGF does not necessarily guarantee good G3Fscores.
Figure 3a also illustrates the performances of standard deep RL al-
gorithms and their GGF/G3Fcounterparts in terms of CV, minimum,
and maximum bandwidths w.r.t the objectives with identical weights
(i.e., the first objective is excluded in the statistics). Compared to
standard or GGF versions of A2C and PPO, the G3Fcounterparts
have lower minimum and maximum bandwidths since more weight
is given to the first objective. While the GGF algorithms have a lower
CV than standard RL algorithms as expected, we notice that the ob-
jectives with identical weights are treated fairly for G3Fmethods, as
indicated by lower CVs than standard RL algorithms, which validates
that the “equal treatment of equals” principle still hold.
Does considering past discounted reward or learning a stochas-
tic policy help in DQN-based algorithms? For this question, we
compare all our DQN variants in the SC and TL domains to investi-
gate the benefits of considering past discounted reward or stochastic
policies. Figure 1a, 1c, 2a show the performances of those algorithms
in the SC domain. Note that while Figure 2a plots the training curves
within 60k interactions, the AC methods are indeed trained with
600k interactions for convergence before testing. Moving from DQN,
G. Yu et al. / Fair Deep Reinforcement Learning with Preferential Treatment 2927
CV min bandwidth max bandwidth
0.000
0.025
0.050
0.075
0.100 A2C
GGF-A2C
G3F-A2C
PPO
GGF-PPO
G3F-PPO
(a) CV, minimum and maximum bandwidths.
A2C GGF-A2C G3F-A2C PPO GGF-PPO G3F-PPO
1.6
1.8
2.0
G3FScore
(b) G3Fscore during the testing phase.
Figure 3: Performances of A2C, PPO and their GGF, G3Fcounterparts during testing in the DC domain.
p0=0.1p0=0.3p0=0.5p0=0.7p0=0.9
0
2
4
6
8
Average waiting time (k)
G3F-CDQN
G3F-CSDQN
North East South West
(a) Individual waiting times during testing.
0 10 20 30 40 50 60
Number of Interactions (k)
4000
6000
8000
10000
12000
14000
16000
Average accumulated waiting time
DQN
G3F-DQN
G3F-CDQN
G3F-CSDQN
G3F-A2C
G3F-PPO
(b) Waiting times in stationary env.
0 10 20 30 40 50 60
Number of Interactions (k)
0
2500
5000
7500
10000
12500
15000
Average accumulated waiting time
DQN
G3F-DQN
G3F-CDQN
G3F-CSDQN
G3F-A2C
G3F-PPO
(c) Waiting times in non-stationary env.
Figure 4: Experimental results in the TL domain: (Left) Effects of using different weights for p, (Center, Right) Waiting times during training.
G3F-DQN, G3F-CDQN, to G3F-CSDQN nearly always yields an
increase in terms of average density (more efficient), a decrease in
terms of CV (more equitable), an increase in terms of min density
(more equitable), and an increase in terms of G3F(fairer). This lat-
ter point experimentally confirms the theoretical discussion about the
optimality of stochastic policies in Section 4.
Focusing on GGF (i.e., uniform p), our proposed G3F-CDQN
(i.e., GGF-CDQN) can find better solutions than GGF-DQN (see Ap-
pendix F.1). This shows that our novel DQN-based algorithms out-
perform the one proposed by [51].
Interestingly, G3F-CDQN and G3F-CSDQN outperform DQN in
terms of average density, which is exactly what is optimized by DQN.
This is explained by the fact that this domain is actually partially ob-
servable. In such situations, state augmentation and stochastic poli-
cies are known to be beneficial. Similar conclusions can be drawn for
the TL domain as well (Figure 4b, Appendix F.2).
When is it preferable to resort to our DQN-based variants?
Figure 2a, 4b, 4c show the training performances. Note that the x-
axis corresponds to the number of interactions, which may not corre-
spond to the timesteps in an environment (e.g., A2C simultaneously
use several environments to generate training data).
In the SC domain, Figure 2a shows that DQN-based methods en-
joy much better sample efficiency than the AC methods. This is con-
firmed in the TL domain. Our experiments in that domain are usually
run in a stationary environment (i.e., probabilities of cars entering in
the intersection are fixed). We also performed some experiments in a
non-stationary environment case (i.e., simulating different time peri-
ods during the day: morning/evening rush hours or low traffic). In the
stationary environment case, Figure 4b shows again that the DQN-
based algorithms learn faster the AC methods in terms of number of
interactions. The results in a non-stationary environment shown in
Figure 4c further strengthen the case for the DQN-based algorithms:
they can adapt faster to environmental changes than the AC methods.
In conclusion, if sample efficiency is important, one should choose
CSDQN if learning stochastic policies is acceptable; otherwise
CDQN should be preferred if deterministic policies are required
(e.g., in robotics).
7 Conclusion
We investigated the fair optimization problem with preferential treat-
ment in (deep) RL. For this novel problem, we presented both the-
oretical and algorithmic contributions. For the theory, we extended
an existing bound to justify the use of the discounted reward instead
of the average reward in the algorithm design. For the algorithms,
we presented several extensions of deep RL algorithms and notably
proposed a novel state-augmented DQN-based method, which can be
adapted to learn either deterministic (CDQN) or stochastic policies
(CSDQN). Extensive experimental results on several domains were
provided for validation.
The novel algorithmic idea of CSDQN could be adapted to
other RL problems with sophisticated objective functions (e.g., safe
RL [27] or risk-sensitive RL [17]) or with constraints [1]. In contrast
to existing work based on policy gradient, our technique could tackle
those problems with a DQN-based method. In addition, our approach
could also be extended to the fair multi-agent setting [62]. We leave
these directions to future work.
Acknowledgements
This work is supported in part by the program of the Shanghai NSF
(No. 19ZR1426700).
References
[1] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel, ‘Con-
strained policy optimization’, in ICML, (2017).
[2] A. Agarwal, A. Beygelzimer, M. Dudík, J. Langford, and H. Wallach,
‘A reductions approach to fair classification’, in ICML, (2018).
[3] Edoardo Amaldi, Stefano Coniglio, Luca G. Gianoli, and Can Umut
Ileri, ‘On single-path network routing subject to max-min fair flow al-
location’, Electronic Notes in Discrete Math.,41, 543–550, (2013).
G. Yu et al. / Fair Deep Reinforcement Learning with Preferential Treatment2928
[4] X. Bei, S. Liu, C.K. Poon, and H. Wang, ‘Candidate selections with
proportional fairness constraints’, in AAMAS, (2022).
[5] Aurélie Beynier, Yann Chevaleyre, Laurent Gourvès, Ararat Harutyun-
yan, Julien Lesca, Nicolas Maudet, and Anaëlle Wilczynski, ‘Local
envy-freeness in house allocation problems’, AAMAS, (2019).
[6] Stephen Boyd and Lieven Vandenberghe, Convex Optimization, Cam-
bridge university press, 2004.
[7] Steven J. Brams and Alan D. Taylor, Fair Division: From Cake-Cutting
to Dispute Resolution, Cambridge University Press, March 1996.
[8] Steven J. Brams and Alan D. Taylor, Fair Division: From Cake-Cutting
to Dispute Resolution, Cambridge University Press, 1996.
[9] R. Busa-Fekete, B. Szörényi, P. Weng, and S. Mannor, ‘Multi-objective
bandits: Optimizing the generalized Gini index’, in ICML, (2017).
[10] Iadine Chadès, Janelle MR Curtis, and Tara G Martin, ‘Setting realistic
recovery targets for two interacting endangered species, sea otter and
northern abalone’, Conservation Biology,26(6), 1016–1025, (2012).
[11] M. Chakraborty, A. Igarashi, W. Suksompong, and Y. Zick, ‘Weighted
envy-freeness in indivisible item allocation’, TEAC,9(3), 1–39, (2021).
[12] J. Chen, Y. Wang, and T. Lan, ‘Bringing fairness to actor-critic rein-
forcement learning for network utility optimization’, in INFOCOM,
(2021).
[13] Yunmei Chen and Xiaojing Ye, ‘Projection onto a simplex’, arXiv
preprint arXiv:1101.6081, (2011).
[14] Yann Chevaleyre, Paul E Dunne, Michel Lemaître, Nicolas Maudet,
Julian Padget, Steve Phelps, and Juan A Rodríguez-aguilar, ‘Issues in
Multiagent Resource Allocation’, Computer,30, 3–31, (2006).
[15] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvit-
skii, ‘Fair clustering through fairlets’, in NeurIPS, (2017).
[16] Gustave Choquet, ‘Theory of capacities’, in Annales de l’institut
Fourier, volume 5, pp. 131–295, (1954).
[17] Yinlam Chow and Mohammad Ghavamzadeh, ‘Algorithms for CVaR
optimization in MDPs’, in NeurIPS, (2014).
[18] Cyrus Cousins, ‘An axiomatic theory of provably-fair welfare-centric
machine learning’, in NeurIPS, (2021).
[19] Virginie Do and Nicolas Usunier, ‘Optimizing generalized gini indices
for fairness in rankings’, in SIGIR, (2022).
[20] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and
Richard Zemel, ‘Fairness through awareness’, in Innovations in Theo-
retical Computer Science Conf., (2012).
[21] Ganesh Ghalme, Vineet Nair, Vishakha Patil, and Yilun Zhou, ‘Long-
term resource allocation fairness in average markov decision process
(amdp) environment’, in AAMAS, (2022).
[22] Hoda Heidari, Claudio Ferrari, Krishna Gummadi, and Andreas
Krause, ‘Fairness behind a veil of ignorance: A welfare analysis for
automated decision making’, in NeurIPS, (2018).
[23] S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, and A. Roth, ‘Fair-
ness in reinforcement learning’, in ICML, (2017).
[24] Jiechuan Jiang and Zongqing Lu, ‘Learning fairness in multi-agent sys-
tems’, in NeurIPS, (2019).
[25] Bernard F. Lamond and Martin L. Puterman, ‘Generalized Inverses in
Discrete Time Markov Decision Processes’, SIAM J. on Matrix Analy-
sis and Appl.,10(1), 118–134, (jan 1989).
[26] Yaxin Liu and Sven Koenig, ‘Risk-sensitive planning with one-switch
utility functions: Value iteration’, in AAAI, pp. 993–999, (2005).
[27] Yongshuai Liu, Jiaxin Ding, and Xin Liu, ‘IPO: Interior-point policy
optimization under constraints’, AAAI, (2020).
[28] Pablo Alvarez Lopez, Michael Behrisch, Laura Bieker-Walz, Jakob
Erdmann, Yun-Pang Flötteröd, Robert Hilbrich, Leonhard Lücken, Jo-
hannes Rummel, Peter Wagner, and Evamarie Wießner, ‘Microscopic
traffic simulation using SUMO’, in IEEE ITSC, (2018).
[29] László Lovász, ‘Submodular functions and convexity’, in Mathematical
programming the state of the art, 235–257, Springer, (1983).
[30] Debmalya Mandal and Jiarui Gan, ‘Socially fair reinforcement learn-
ing’, arXiv preprint arXiv:2208.12584, (2022).
[31] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Belle-
mare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, S. Pe-
tersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
D. Wierstra, S. Legg, and D. Hassabis, ‘Human-level control through
deep reinforcement learning’, Nature,518, 529–533, (2015).
[32] Jeonghoon Mo and Jean Walrand, ‘Fair end-to-end window-based con-
gestion control’, Transactions on Networking,8(5), 556–567, (2000).
[33] Hervé Moulin, Fair Division and Collective Welfare, MIT Press, 2004.
[34] Razieh Nabi, Daniel Malinsky, and Ilya Shpitser, ‘Learning optimal fair
policies’, in ICML, (2019).
[35] L. Nardi and C. Stachniss, ‘Uncertainty-aware path planning for navi-
gation on road networks using augmented MDPs’, in ICRA, (2019).
[36] Somjit Nath, Mayank Baranwal, and Harshad Khadilkar, ‘Revisiting
state augmentation methods for reinforcement learning with stochastic
delays’, in CIKM, (2021).
[37] Arnie Neidhardt, Hanan Luss, and K. R. Krishnan, ‘Data fusion and
optimal placement of fixed and mobile sensors’, in IEEE Sensors Ap-
plications Symposium, (2008).
[38] Viet Hung Nguyen and Paul Weng, An efficient primal-dual algorithm
for fair combinatorial optimization problems’, in COCOA, (2017).
[39] W. Ogryczak, ‘On principles of fair resource allocation for importance
weighted agents’, in Intl. Workshop on Social Informatics, (2009).
[40] Wlodzimierz Ogryczak, Hanan Luss, Michał Pióro, Dritan Nace, and
Artur Tomaszewski, ‘Fair optimization and networks: A survey’, J. of
Applied Mathematics,2014, (2014).
[41] Wlodzimierz Ogryczak, Patrice Perny, and Paul Weng, A compromise
programming approach to multiobjective markov decision processes’,
Intl. J. of Info. Tech. & Decision Making,12(05), 1021–1053, (2013).
[42] Włodzimierz Ogryczak and Tomasz ´
Sliwi´
nski, ‘On optimization of
the importance weighted owa aggregation of multiple criteria’, in Intl.
Conf. on Computational Sc. and Its Applications, pp. 804–817, (2007).
[43] Włodzimierz Ogryczak and Tomasz ´
Sliwi´
nski, ‘On solving optimiza-
tion problems with ordered average criteria and constraints’, Fuzzy Op-
timization: Recent Advances and Applications, 209–230, (2010).
[44] Patrice Perny and Paul Weng, ‘On finding compromise solutions in
multiobjective Markov decision processes’, in ECAI, (2010).
[45] Martin L. Puterman, Markov decision processes: discrete stochastic dy-
namic programming, Wiley, 1994.
[46] John Rawls, The Theory of Justice, Havard university press, 1971.
[47] D.M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley, ‘A survey of
multi-objective sequential decision-making’, JAIR,48, 67–113, (2013).
[48] Fabian Ruffy, Michael Przystupa, and Ivan Beschastnikh, ‘Iroko: A
framework to prototype reinforcement learning for data center traffic
control’, in Workshop on ML for Systems at NeurIPS, (2019).
[49] S. Sharifi-Malvajerdi, M. Kearns, and A. Roth, Average Individual
Fairness: Algorithms, Generalization and Experiments’, in NeurIPS,
(2019).
[50] Huaizhou Shi, R. Venkatesha Prasad, Ertan Onur, and I. G. M. M.
Niemegeers, ‘Fairness in wireless networks: Issues, measures and chal-
lenges’, IEEE Com. Surveys & Tutorials,16(1), 5–24, (2014).
[51] Umer Siddique, Paul Weng, and Matthieu Zimmer, ‘Learning fair poli-
cies in multi-objective (deep) reinforcement learning with average and
discounted rewards’, in ICML, (2020).
[52] Ashudeep Singh and Thorsten Joachims, ‘Policy Learning for Fairness
in Ranking’, in NeurIPS, (2019).
[53] A. Sootla, A.I. Cowen-Rivers, T. Jafferjee, Z. Wang, D.H. Mguni,
J. Wang, and H. Ammar, ‘Sauté rl: Almost surely safe reinforcement
learning using state augmentation’, in ICML, (2022).
[54] Till Speicher, Hoda Heidari, Nina Grgic-Hlaca, Krishna P Gummadi,
Adish Singla, Adrian Weller, and Muhammad Bilal Zafar, A unified
approach to quantifying algorithmic unfairness: Measuring individual
& group unfairness via inequality indices’, in KDD, (2018).
[55] Ankang Sun, Bo Chen, and Xuan Vinh Doan, ‘Connections between
fairness criteria and efficiency for allocating indivisible chores’, arXiv
preprint arXiv:2101.07435, (2021).
[56] Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Man-
sour, ‘Policy gradient methods for reinforcement learning with function
approximation’, in NIPS, (2000).
[57] Min Wen, Osbert Bastani, and Ufuk Topcu, Algorithms for fairness in
sequential decision making’, in ICML, (2021).
[58] Paul Weng, ‘Fairness in reinforcement learning’, in AI for Social Good
Workshop at IJCAI, (2019).
[59] J.A. Weymark, ‘Generalized Gini inequality indices’, Mathematical So-
cial Sciences,1, 409–430, (1981).
[60] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, Kr-
ishna P. Gummadi, and Adrian Weller, ‘From Parity to Preference-
based Notions of Fairness in Classification’, in NeurIPS, (2017).
[61] Xueru Zhang and Mingyan Liu, ‘Fairness in learning-based sequential
decision algorithms: A survey’, in Handbook of Reinforcement Learn-
ing and Control, 525–555, Springer, (2021).
[62] Matthieu Zimmer, Claire Glanois, Umer Siddique, and Paul Weng,
‘Learning fair policies in decentralized cooperative multi-agent rein-
forcement learning’, in ICML, (2021).
G. Yu et al. / Fair Deep Reinforcement Learning with Preferential Treatment 2929
... Our formulation, which assumes that a known scalarizing function aggregating the objectives is known and concave, is generic. It covers fair optimization in RL [12,13,[19][20][21], but also more complex fairness problems [22,23]. In addition, it includes many approaches in the multi-criteria setting [24][25][26][27]. ...
... This work is an extended version of the following two papers [22,23]. In addition to the published results, we now present all our results in the more general setting, including multi-criteria decision making. ...
... Our expression of fairness over species can be interpreted as both species remaining alive with a balanced population density [12,96]. However, due to potential noncomparability of population densities, equitable treatment of both species may not be suitable in all cases [22]. Thus, we also consider a more complex scenario when preferential treatment may be beneficial, as demonstrated in our experiments in Sect. ...
Article
Full-text available
In this paper, we focus on multi-objective reinforcement learning (RL) where the expected vector returns are aggregated with a concave function. For this generic framework, which includes notably fair optimization in the multi-user setting and compromise optimization in the multi-criteria setting, we present several contributions. After a discussion of its theoretical properties (e.g., need to resort to stochastic policies), we prove a general performance bound that justifies learning a policy using discounted rewards, even if a policy optimal for the average reward is desired. We extend several deep RL algorithms for our problem; notably our adaptation of DQN can learn stochastic policies. In addition, to illustrate the generality of our framework, we consider in the multi-user setting a novel extension of fair optimization in deep RL where users have different entitlements. Our experimental results validate our propositions and also demonstrate its superiority to reward engineering in single-objective RL.
... The recently proposed method ofXu et al. (2024) has adapted the concept of benefit rates to the RL setting and has demonstrated state-of-the-art performance. Another set of approaches use multi-objective MDPs(Siddique et al., 2020;Blandin & Kash, 2024), causal inference(Nabi et al., 2019), or the concept of welfare(Cousins et al., 2024;Yu et al., 2023). Finally, fairness is particularly important in multi-agent MDPs to ensure an optimal agent does not hinder the performance of other agents(Zhang & Shah, 2014;Jiang & Lu, 2019;Mandal & Gan, 2022;Ju et al., 2023). ...
Preprint
Ensuring long-term fairness is crucial when developing automated decision making systems, specifically in dynamic and sequential environments. By maximizing their reward without consideration of fairness, AI agents can introduce disparities in their treatment of groups or individuals. In this paper, we establish the connection between bisimulation metrics and group fairness in reinforcement learning. We propose a novel approach that leverages bisimulation metrics to learn reward functions and observation dynamics, ensuring that learners treat groups fairly while reflecting the original problem. We demonstrate the effectiveness of our method in addressing disparities in sequential decision making problems through empirical evaluation on a standard fairness benchmark consisting of lending and college admission scenarios.
Article
Full-text available
Selecting a subset of candidates with various attributes under fairness constraints has been attracting considerable attention from the AI community, with applications ranging from school admissions to committee selections. The fairness constraints are usually captured by absolute upper bounds and/or lower bounds on the number of selected candidates in specific attributes. In many scenarios, however, the total number of selected candidates is not predetermined. It is, therefore, more natural to express these fairness constraints in terms of proportions of the final selection size. In this paper, we study the proportional candidate selection problem, where the goal is to select a subset of candidates with maximum cardinality while meeting certain proportional fairness constraints. We first analyze the computational complexity of the problem and show strong inapproximability results. Next, we investigate the algorithmic aspects of the problem in two directions. First, by treating the proportional fairness constraints as soft constraints, we devise two polynomial-time algorithms that could return (near) optimal solutions with bounded violations on each fairness constraint. Second, we design an exact algorithm with a fast running time in practice. Simulations based on both synthetic and publicly available data confirm the effectiveness and efficiency of our proposed algorithms.
Conference Paper
Full-text available
We study several fairness notions in allocating indivisible chores (i.e., items with negative values): envy-freeness and its relaxations. For allocations under each fairness criterion, we establish their approximation guarantee for other fairness criteria. Under the setting of additive cost functions, our results show strong connections between these fairness criteria and, at the same time, reveal intrinsic differences between goods allocation and chores allocation. Furthermore, we investigate the efficiency loss under these fairness constraints and establish their prices of fairness.
Article
Full-text available
We study the fair division problem consisting in allocating one item per agent so as to avoid (or minimize) envy, in a setting where only agents connected in a given network may experience envy. In a variant of the problem, agents themselves can be located on the network by the central authority. These problems turn out to be difficult even on very simple graph structures, but we identify several tractable cases. We further provide practical algorithms and experimental insights.
Article
We introduce and analyze new envy-based fairness concepts for agents with weights that quantify their entitlements in the allocation of indivisible items. We propose two variants of weighted envy-freeness up to one item (WEF1): strong , where envy can be eliminated by removing an item from the envied agent’s bundle, and weak , where envy can be eliminated either by removing an item (as in the strong version) or by replicating an item from the envied agent’s bundle in the envying agent’s bundle. We show that for additive valuations, an allocation that is both Pareto optimal and strongly WEF1 always exists and can be computed in pseudo-polynomial time; moreover, an allocation that maximizes the weighted Nash social welfare may not be strongly WEF1, but it always satisfies the weak version of the property. Moreover, we establish that a generalization of the round-robin picking sequence algorithm produces in polynomial time a strongly WEF1 allocation for an arbitrary number of agents; for two agents, we can efficiently achieve both strong WEF1 and Pareto optimality by adapting the adjusted winner procedure. Our work highlights several aspects in which weighted fair division is richer and more challenging than its unweighted counterpart.
Book
Cutting a cake, dividing up the property in an estate, determining the borders in an international dispute - such problems of fair division are ubiquitous. Fair Division treats all these problems and many more through a rigorous analysis of a variety of procedures for allocating goods (or 'bads' like chores), or deciding who wins on what issues, when there are disputes. Starting with an analysis of the well-known cake-cutting procedure, 'I cut, you choose', the authors show how it has been adapted in a number of fields and then analyze fair-division procedures applicable to situations in which there are more than two parties, or there is more than one good to be divided. In particular they focus on procedures which provide 'envy-free' allocations, in which everybody thinks he or she has received the largest portion and hence does not envy anybody else. They also discuss the fairness of different auction and election procedures.
Article
In this paper, we study reinforcement learning (RL) algorithms to solve real-world decision problems with the objective of maximizing the long-term reward as well as satisfying cumulative constraints. We propose a novel first-order policy optimization method, Interior-point Policy Optimization (IPO), which augments the objective with logarithmic barrier functions, inspired by the interior-point method. Our proposed method is easy to implement with performance guarantees and can handle general types of cumulative multi-constraint settings. We conduct extensive evaluations to compare our approach with state-of-the-art baselines. Our algorithm outperforms the baseline algorithms, in terms of reward maximization and constraint satisfaction.
Article
Systematic discriminatory biases present in our society influence the way data is collected and stored, the way variables are defined, and the way scientific findings are put into practice as policy. Automated decision procedures and learning algorithms applied to such data may serve to perpetuate existing injustice or unfairness in our society. In this paper, we consider how to make optimal but fair decisions, which "break the cycle of injustice" by correcting for the unfair dependence of both decisions and outcomes on sensitive features (e.g., variables that correspond to gender, race, disability, or other protected attributes). We use methods from causal inference and constrained optimization to learn optimal policies in a way that addresses multiple potential biases which afflict data analysis in sensitive contexts, extending the approach of Nabi & Shpitser (2018). Our proposal comes equipped with the theoretical guarantee that the chosen fair policy will induce a joint distribution for new instances that satisfies given fairness constraints. We illustrate our approach with both synthetic data and real criminal justice data.