Research ProposalPDF Available

Effort-aware Fairness: Measures and Mitigations in AI-assisted Decision Making

Authors:

Abstract

Although popularized AI fairness metrics, e.g. demographic parity, have uncovered bias in AI-assisted decision-making outcomes, they do not consider how much effort one has spent to get to where one is today in the input feature space. However, the notion of effort is important in how Philosophy and the Law understand fairness. We propose a new way to conceptualize, evaluate, and achieve Effort-aware Fairness. Our sequential Research Questions (RQs) follow: RQ1. How to develop an Effort-aware Fairness metric compatible with Philosophy and Law? RQ2. How to evaluate Effort-aware Fairness on fairness datasets with time-series features? RQ3. How to design novel algorithm(s) to achieve long-term Effort-aware Fairness? Rigorous answers to these RQs may enable AI model auditors to uncover and potentially correct for unfair decisions against individuals who spent significant efforts to improve but are still stuck with undesirable feature values due to systematic discrimination against their groups.
Effort-aware Fairness: Measures and Mitigations in
AI-assisted Decision Making
Donald Braman (GWU Law), Hal Daumé III (PI, UMD CS), Furong Huang (UMD CS),
Zubin Jelveh (UMD Criminology/iSchool), Tin Nguyen (UMD CS PhD Student)
Abstract
Although popularized AI fairness metrics, e.g. demographic parity, have uncovered bias in
AI-assisted decision-making outcomes, they do not consider how much effort one has spent to
get to where one is today in the input feature space. However, the notion of effort is important
in how Philosophy and the Law understand fairness. We propose a new way to conceptualize,
evaluate, and achieve Effort-aware Fairness. Our sequential Research Questions (RQs) follow:
RQ1. How to develop an Effort-aware Fairness metric compatible with Philosophy and Law?
RQ2. How to evaluate Effort-aware Fairness on fairness datasets with time-series features?
RQ3. How to design novel algorithm(s) to achieve long-term Effort-aware Fairness?
Rigorous answers to these RQs may enable AI model auditors to uncover and potentially cor-
rect for unfair decisions against individuals who spent significant efforts to improve but are still
stuck with undesirable feature values due to systematic discrimination against their groups.
1 Scope of Work and Proposed Activities
AI-infused tools have assisted humans in making high-stakes decisions such as recidivism risk
assessment [e.g., 1] and loan approval [e.g., 13], leading to potentially increased efficiency and
consistency. However, the AI models used in such tools are almost always biased, and this bias
can have serious implications on people’s live. In response, the CS literature has deeply explored
two notions of fairness: group fairness (ensuring that a quantity of interest, such as accuracy,
should be equalized across demographic groups like gender) [e.g., 14] and individual fairness
(ensuring that similar individuals get similar outcomes) [e.g., 6].
Both group fairness and individual fairness consider decisions to be fair based, typically, on
immutable characteristics of an individual. Neither captures any notion of effort: that some indi-
vidual have gotten where they are by the exertion of significant effort. This leads to the “identical
CVs” problem: two people from very different backgrounds may have the same CV—and therefore
be treated equivalently per group or individual fairness—but one of them may have had to work a
lot harder to achieve that. The notion that effort is an important factor in fairness has been well
studied in philosophy [e.g., the free rider problem 9] but has been largely, though not completely
(see related work below), absent from the machine learning fairness literature.
Objectives and Hypotheses. (Intellectual Merit) We will develop a new measure of effort-aware
fairness, grounded in philosophical literature and legal reasoning. We propose to conduct human
studies to validate this measure (against more standard group and individual fairness measures)
to understand whether a lay audience considers outcomes that are measured to be more “effort-
aware fair” to be more fair than those that are measured to be more “group fairness fair” (or
“individual fairness fair”). Second, we will use our measure of effort-aware fairness to develop
mitigations: namely, we will derive algorithms for optimizing machine learning systems against
this measure. We will evaluate our approaches on two datasets in the credit and recidivism risk
assessment settings. (Broader Impacts) In the long term, measures of fairness that better align
1
with philosophical, legal, and human notions of fairness are anticipated to lead to greater trust in
AI-infused systems. We may lead a project for either Technica or TRAILS’ AI summer program
related to our goal of validating this measure, aiming to broaden participation in computing.
Connection with TRAILS. Our project bridges Trails II (Methods and Metrics for Participatory AI)
and III (Participant Sense-Making of AI Trustworthiness). In particular, our metric development
and algorithm development fits in the goals of Trail II, while our human measurement of fairness
and its impact on trust fits within Trail III. Two of the PIs are TRAILS PIs and Senior Personnel
(Profs. Braman and Daumé) and two are not currently (Profs. Huang and Jelveh).
2 Related Work
Philosophy has generally used four main approaches to characterize effort [12, i.a.], of which two
are directly related to effort itself (rather than the perception of effort): Effort as Forces (how much
force would it take to “move” an object), and Effort as Energy (spending effort towards a goal
means allocating part of a finite amount of resources, or energy tank, up until the energy tank is
depleted). Effort as Forces is considered to address several problems unresolved by the second
approach, such as capturing the concept of “resistance” to effort, distinguishing between “fatigue”
and “effort”, and explaining why effort is unpleasant but praiseworthy.
The legal literature, particularly around tort law, has broadly to quantify “effort” into discrete
thresholds [15], ranging from high to low: “best effort”, “reasonable best effort”, “reasonable ef-
forts”, “commercially reasonable efforts”, and “good faith efforts. These intervals are used to
determine whether a party has spent sufficient effort in the past to avoid contractual liability in
a litigation. This is consistent with the philosophical literature in terms of considering the actual
effort, not some counterfactual effort, that each subject has exerted.
A recent line of AI fairness works address this gap and conceptualize “effort” as the relationship
between changes in task-relevant, mutable input features (e.g., number of prior arrests) and the
probability predicted by a model of receiving a favorable outcome (e.g., granting bail). Approaches
differ in terms of whether they take outcome parity as a requirement and attempt to equalize
effort across groups [10,7] or whether they take equalized effort as a requirement and attempt
to equalize outcome parity [8,11,17]. These dual styles of approach are closely related to work
on algorithmic recourse [16], in that they ask how much effort an individual would have to put in
in order to achieve a favorable outcome. Both approaches can be loosely considered to be in the
space of Effort as Forces, and in contrast to the current proposal, we are interested in how much
effort an individual already put in and use that as the value against which we seek parity.
3 Methodology
Both the philosophical literature and the legal litrature around fairness, and particularly around
fairness-as-effort, conceptualize effort in terms of an individual’s past behavior. We propose to
adopt the Effort as Forces approach [12], building on their analogy to Newton’s Second Law.
Building on this idea, we propose to develop new fairness notions and to plan how to evaluate and
algorithmically achieve Effort-aware Fairness on real-world datasets.
3.1 Conceptualizing Effort-aware Fairness
To formulate Effort (E), we draw an analogy to Newton’s Second Law of Motion: Fnet =mawhere
Fnet is the net force on an object of mass (or more generally, inertia) mto give it an acceleration
a(second-order time derivative of position x). Analogously, assuming no friction, a person with
inertia m(characterized by, for instance, demographics-based disadvantage) has to apply a force
FEor effort E|FE|=|Fnet |to make their task-relevant input features (e.g., number of prior arrests)
2
move in a desirable direction with an (observable) acceleration a. We quantify mand ato infer E.1
We propose to formalize inertia mbased on an individual’s immutable characterists (which
overlap significantly with “protected” features in the AI fairness literature) as a “holding-back” effect
and/or historical disadvantage due to an individual’s membership in one or more marginalized
group(s). For example we might compute each feature-specific inertia (e.g. race) as the historical
False Positive Rate (or False Negative Rate, whichever works against the individual’s interest)
for that individual’s group (e.g. Black), and then average feature-specific inertia values across all
immutable features to get an individual-level inertia value m. Next, we proposed to acceleration
aas second-order time derivative of mutable, task-relevant input features (e.g. criminal history
features in recidivism risk assessment).
Given an Effort (E) metric as described above, the challenge is to turn this metric into a mea-
sure of fairness that can be evaluated or optimized away. To do so, we draw analogies to both
group and individual fairness, both of which are deemed, in different legal settings, as relevant to
the US constitutional equal protection clause.2
Inspired by the Suffiency criterion from group fairness—which asserts that YA|R(conditioned
on risk score R, output Yshould be independent of protected feature A) [2]—we develop our Effort-
aware Group Fairness criterion:
YA|E m,nR;a,bN:P(Y=1|E[m,n],A=a) = P(Y=1|E[m,n],A=b)
This criterion states that outcomes should be independent of the sensitive feature only when
similar efforts are expended, which aligns with the main rationale of affirmative action: to correct
for historical disadvantage and thus higher expected efforts from a group to achieve an outcome.
Therefore, our Effort-aware Group Fairness criterion may relax the tension between procedural
fairness (i.e. excluding the sensitive feature from the predictive input space) and affirmative action.
To develop an Effort-aware Individual Fairness metric, we try to enforce that similar individuals
who spent similar efforts should get similar outcomes. We propose a pairwise individual similarity
function as a weighted combination between the current values of task-relevant features and effort
E. The weights might reflect normative values, e.g. higher/lower weights for efforts when the recent
number of priors is low/high to prioritize Criminal Law’s Rehabilitation/Incapacitation purpose.
3.2 Evaluating Effort-aware Fairness
We will compute our Effort (E) and Effort-aware Fairness metrics on two real-world datasets. The
first, more ideal dataset is the Client Legal Utility Engine (CLUE) dataset for recividism risk as-
sessment, which includes demographic and criminal history features of over 40 million records in
Maryland from the mid-1980s to 2021. Prof. Zubin Jelveh and Tin Nguyen have completed clean-
ing and record linkage on this dataset extensively in Spring 2023 (report) to get individual-level
data and their time-series, mutable criminal history features (to compute a), e.g. number of misde-
meanor/felony arrests and convictions. The immutable features (to compute m) are race, gender,
and age. The second dataset is Credit Score Classification, which includes only one immutable
feature (age) and nearly 20 mutable features.
1One complication in practical Physics problems is a friction force fthat often scales with the magnitude of the
applied force FE(up to fmax) but goes in the opposite direction. In this more nuanced setting, which we will explore in
the later phases of this project, we might quantify m,a, and fto infer FE=Fnet f=maf. We will decide which features
are relevant to quantify fafter we complete a no-friction pilot.
2The US Supreme Court rules in Washington v. Davis (1976) that: “Disproportionate impact is not irrelevant, but it is
not the sole touchstone of an invidious racial discrimination forbidden by the Constitution. [4]. Older US Supreme Court
decisions such as F.S. Royster Guano Co. v. Commonwealth of Virginia (1920) interprets equal protection consistently
with individual fairness [3]. Interestingly, a synergy of both group and individual fairness is found in City of Cleburne,
Tex. v. Cleburne Living Center (1985) [5].
3
To evaluate whether our proposed philosophy-grounded metric of Effort (E) aligns with hu-
mans’ mental models, we will conduct a human study, showing records of pairs of defendants who
received contradictory comparisons of their efforts by our philosophy-grounded metric vs. by a
baseline (e.g. distance to the decision boundary in the input space, or based on some of the re-
lated work from the CS literature) and let crowdworkers normatively decide which defendant spent
more efforts. To evaluate whether our Effort-aware Fairness aligns with human perception of fair-
ness, we will conduct a second human study, showing a test set of CLUE defendants with input
features, their philosophy-grounded Effort (E) scores, the outputs from several AI models, and let
crowdworkers rank those models in terms of perceived fairness. We can then assess how well
our Effort-aware (Group/Individual) Fairness scores correlate with perceived fairness rankings,
compared to baselines such as effort fairness metrics that ignore time.
For mitigation, we build on optimizing long-term fairness by optimizing (at each time step) effort-
aware fairness. This will build on work that optimizes (per time step) the ratio between Supply (of
resources) and Demand (based on input features) of individuals from different groups [18].
4 Project Timeline with Deliverables
Jan 2024: Implement metrics on CLUE and Credit Score Classification datasets. Apply for IRB.
Feb 2024: Submit argumentative manuscript on Effort-aware Fairness conceptualization to FAccT’24.
Mar-May 2024: Design and Evaluate algorithms to achieve long-term Effort-aware Fairness.
Jun-Aug 2024: Run human studies evaluation, e.g. with TRAILS bootcamp or Technica hackathons.
Sep 2024: Submit manuscript on human study evaluation and new algorithm to ICLR’25.
Oct-Dec 2024: Revise manuscript and/or do additional proofs or experiments per ICLR reviews.
Jan 2025: If rejected by ICLR, Submit manuscript to ICML’25. Finish post-project TRAILS Report!
5 Project Personnel
Donald Braman is a GWU Associate Professor of Law, teaching Criminal Law and Evidence.
Before joining GWU, he was the Irving S. Ribicoff Fellow at the Yale Law School.
Hal Daumé III (PI) is a Volpi-Cupal endowed UMD CS Professor and a Senior Principal Re-
searcher at Microsoft Research. His research focus is on developing ML and NLP systems that
interact naturally with people, promote their self-efficacy, while mitigating societal harms.
Furong Huang is a UMD CS Assistant Professor. Through principled methods that address the
challenges in applying ML to real-world application, her research expands the scope of deep
learning model design for learning in constrained edge clients and on graph data.
Zubin Jelveh is an Assistant Professor of College of Information Studies and the Department of
Criminology and Criminal Justice at UMD. His work spans the intersection of data science, applied
microeconomics and social policy, particularly in criminal justice.
Tin Nguyen is a 2nd -year UMD CS PhD student working on law-informed AI fairness. He received
the NIJ Graduate Research Fellowship ($166,500, to be disbursed after dissertation proposal
defense in Spring 2025). He first-authored a paper on fairness evaluation of NLP explanations
(accepted to EMNLP 2023, main track) and co-authored another fairness paper (CHI TRAIT 2023).
Collaboration and Management Plan Tin will continue to meet weekly with his PhD advisors
(Profs. Daumé and Jelveh) to update and get feedback on low-level progress. Furthermore, we
plan to meet remotely every two weeks as a whole project team (including additionallly Profs.
Huang and Braman) to discuss high-level ideas on the philosophical/legal and algorithmic design
aspects to shape our Effort-aware Fairness metric and optimization algorithm so that they align
with theoretical and legal standards. We find no conflict of interests among the project personnel.
4
References
[1] Michelle Bao, Angela Zhou, Samantha Zottola, Brian Brubach, Sarah Desmarais, Aaron
Horowitz, Kristian Lum, and Suresh Venkatasubramanian. 2021. It’s COMPASlicated: The
Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks. Confer-
ence on Neural Information Processing Systems (NeurIPS) (2021).
[2] Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2019. Fairness and Machine Learning:
Limitations and Opportunities. fairmlbook.org. http://www.fairmlbook.org.
[3] US Supreme Court. 1920. FS Royster Guano Co. v. Virginia. No. 165 (1920).
[4] US Supreme Court. 1976. Washington v. Davis. No. 74-1492 (1976).
[5] US Supreme Court. 1985. Cleburne v. Cleburne Living Center, Inc. No. 84-468 (1985).
[6] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012.
Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer
science conference. 214–226.
[7] Ozgur Guldogan, Yuchen Zeng, Jy-yong Sohn, Ramtin Pedarsani, and Kangwook Lee. 2023.
Equal improvability: A new fairness notion considering the long-term impact. Proceedings of
the International Conference on Learning Representations 2023.
[8] Vivek Gupta, Pegah Nokhiz, Chitradeep Dutta Roy, and Suresh Venkatasubramanian. 2019.
Equalizing recourse across groups. arXiv preprint arXiv:1909.03166 (2019).
[9] Russell Hardin and Garrett Cullity. 2003. The free rider problem. (2003).
[10] Hoda Heidari, Vedant Nanda, and Krishna P Gummadi. 2019. On the long-term impact of al-
gorithmic decision policies: Effort unfairness and feature segregation through social learning.
Proceedings of the International Conference on Machine Learning 2019.
[11] Wen Huang, Yongkai Wu, Lu Zhang, and Xintao Wu. 2020. Fairness through equality of
effort. In Companion Proceedings of the Web Conference 2020. 743–751.
[12] Olivier Massin. 2017. Towards a definition of efforts. Motivation Science 3, 3 (2017), 230.
[13] Anne-Sophie Mayer, Franz Strich, and Marina Fiedler. 2020. Unintended Consequences of
Introducing AI Systems for Decision Making. MIS Quarterly Executive 19, 4 (2020).
[14] Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. 2008. Discrimination-aware data min-
ing. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discov-
ery and data mining. 560–568.
[15] Charles Thau. 2020. Is This Really the Best We Can Do? American Courts’ Irrational Efforts
Clause Jurisprudence and How We Can Start to Fix It. Geo. LJ 109 (2020), 665.
[16] Berk Ustun, Alexander Spangher, and Yang Liu. 2019. Actionable recourse in linear clas-
sification. In Proceedings of the conference on fairness, accountability, and transparency.
10–19.
[17] Julius Von Kügelgen, Amir-Hossein Karimi, Umang Bhatt, Isabel Valera, Adrian Weller, and
Bernhard Schölkopf. 2022. On the fairness of causal algorithmic recourse. In Proceedings of
the AAAI conference on artificial intelligence, Vol. 36. 9584–9594.
5
[18] Yuancheng Xu, Chenghao Deng, Yanchao Sun, Ruijie Zheng, Xiyao Wang, Jieyu Zhao, and
Furong Huang. 2023. Equal Long-term Benefit Rate: Adapting Static Fairness Notions to
Sequential Decision Making. In AdvML-Frontiers workshop, ICML.
6
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Classification models are often used to make decisions that affect humans: whether to approve a loan application, extend a job offer, or provide insurance. In such applications, individuals should have the ability to change the decision of the model. When a person is denied a loan by a credit scoring model, for example, they should be able to change the input variables of the model in a way that will guarantee approval. Otherwise, this person will be denied the loan so long as the model is deployed, and -- more importantly --will lack agency over a decision that affects their livelihood. In this paper, we propose to evaluate a linear classification model in terms of recourse, which we define as the ability of a person to change the decision of the model through actionable input variables (e.g., income vs. age or marital status). We present an integer programming toolkit to: (i) measure the feasibility and difficulty of recourse in a target population; and (ii) generate a list of actionable changes for a person to obtain a desired outcome. We discuss how our tools can inform different stakeholders by using them to audit recourse for credit scoring models built with real-world datasets. Our results illustrate how recourse can be significantly affected by common modeling practices, and motivate the need to evaluate recourse in algorithmic decision-making.
Article
Full-text available
Although widely used across psychology, economics, and philosophy, the concept of effort is rarely ever defined. This article argues that the time is ripe to look for an explicit general definition of effort, makes some proposals about how to arrive at this definition, and suggests that a force-based approach is the most promising. Section 1 presents an interdisciplinary overview of some chief research axes on effort, and argues that few, if any, general definitions have been proposed so far. Section 2 argues that such a definition is now needed and proposes a basic methodology to arrive at it, whose first step is to make explicit the various tacit assumptions about effort made across sciences and ordinary thinking. Section 3 unearths 4 different conceptions of effort from research on effort so far: primitive-feelings accounts , comparator-based accounts , resource-based accounts and force-based accounts . It is then argued that the first 2 kinds of accounts, although interesting in their own right, are not strictly speaking about effort. Section 4 considers the 2 most promising general approaches to efforts: resource-based and force-based accounts. It argues that these accounts are not only compatible but actually extensionally equivalent. This notwithstanding, it explains why force-based accounts should be regarded as more fundamental than resource-based accounts.
Conference Paper
Full-text available
In the context of civil rights law, discrimination refers to unfair or unequal treatment of people based on member- ship to a category or a minority, without regard to individ- ual merit. Rules extracted from databases by data mining techniques, such as classification or association rules, when used for decision tasks such as benefit or credit approval, can be discriminatory in the above sense. In this paper, the notion of discriminatory classification rules is introduced and studied. Providing a guarantee of non-discrimination is shown to be a non trivial task. A na¨õve approach, like tak- ing away all discriminatory attributes, is shown to be not enough when other background knowledge is available. Our approach leads to a precise formulation of the redlining prob- lem along with a formal result relating discriminatory rules with apparently safe ones by means of background knowl- edge. An empirical assessment of the results on the German credit dataset is also provided.
Article
AI systems are increasingly substituting human decision making. Based on an in-depth case study, we describe how one such AI system fulfilled the intentions of senior management for introducing it. However, we also identify the unintended consequences (positive and negative) of AI-based decision making for both employees and the organization. Based on this case study, we provide recommendations for introducing decision-making AI systems in a way that fully exploits the potential of these systems and manages the unintended consequences.
It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks
  • Michelle Bao
  • Angela Zhou
  • Samantha Zottola
  • Brian Brubach
  • Sarah Desmarais
  • Aaron Horowitz
  • Kristian Lum
  • Suresh Venkatasubramanian
Michelle Bao, Angela Zhou, Samantha Zottola, Brian Brubach, Sarah Desmarais, Aaron Horowitz, Kristian Lum, and Suresh Venkatasubramanian. 2021. It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks. Conference on Neural Information Processing Systems (NeurIPS) (2021).
Fairness through awareness
  • Cynthia Dwork
  • Moritz Hardt
  • Toniann Pitassi
  • Omer Reingold
  • Richard Zemel
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference. 214-226.
Equal improvability: A new fairness notion considering the long-term impact
  • Ozgur Guldogan
  • Yuchen Zeng
  • Jy-Yong Sohn
  • Ramtin Pedarsani
  • Kangwook Lee
Ozgur Guldogan, Yuchen Zeng, Jy-yong Sohn, Ramtin Pedarsani, and Kangwook Lee. 2023. Equal improvability: A new fairness notion considering the long-term impact. Proceedings of the International Conference on Learning Representations 2023.
  • Vivek Gupta
  • Pegah Nokhiz
Vivek Gupta, Pegah Nokhiz, Chitradeep Dutta Roy, and Suresh Venkatasubramanian. 2019. Equalizing recourse across groups. arXiv preprint arXiv:1909.03166 (2019).
The free rider problem
  • Russell Hardin
  • Garrett Cullity
Russell Hardin and Garrett Cullity. 2003. The free rider problem. (2003).