Research ProposalPDF Available

Operationalizing the Individual v. Group Fairness Dichotomy for Recidivism Risk Assessment: US Legal Challenges and Technical Proposals

Authors:
NIJ GRF 2023 - Research Narrative
Operationalizing the Individual v. Group Fairness Dichotomy for
Recidivism Risk Assessment: US Legal Challenges and Technical
Proposals
Funding opportunity number: O-NIJ-2023-171521
PhD Student: Tin Trung Nguyen
Work Address: IRB 4104, 8125 Paint Branch Dr, College Park, MD 20742
Phone: (240) 927 7255
Email: tintn@umd.edu
ORCID: 0009-0008-5041-3627
Enrolled Program: PhD in Computer Science, University of Maryland, College Park
Advisors: Hal Daumé III and Zubin Jelveh
Keywords: individual fairness, group fairness, equal protection, human study,
AI, recidivism risk assessment, criminal justice, US law, scrutiny, reliance, HCI
Submission Date: May 02, 2023
Contents
1 Statement of Problem and Research Questions 1
1.1 Limited Adaptation of Individual Fairness from High to Low-level Legal Sources ...... 1
1.2 Disagreement on Range of Features for Predictive Use .................... 3
1.3 Human-AI Interaction as a Shield against Fairness Concerns ................. 4
2 Project Design and Implementation 5
2.1 At which Technical Stage should Individual v. Group Fairness be Enforced? ......... 6
2.2 How much Scrutiny and Which Fairness Criteria does each Feature Warrant? ........ 7
2.2.1 Backward Perceived Fairness Human Study for the Highest Threshold ........ 8
2.2.2 Expert and Stakeholder Interviews for the Second-highest and Third-highest Thresh-
olds ........................................... 10
2.3 How does AI Reliance Relate to the Fairness of Criminal Risk Assessment by Humans? . . . 10
2.3.1 Humans’ AI Reliance v. Individual/Group Fairness Metrics on Risk Assessment
Results ......................................... 11
2.3.2 Mental Investment as Alternative Output for Individual and Group Fairness Evaluation 11
2.4 Plan for Deliverables and Publications ............................. 11
3 Capabilities and Competencies 12
1 Statement of Problem and Research Questions
Recidivism risk assessment is among the most studied high-stakes use cases of Artificial Intelligence (AI) in
the algorithmic fairness literature [16,18,37,7,32]. Due to its cross-disciplinary nature, research about fair-
ness in recividism faces numerous challenges, e.g. whether existing datasets are suited serve as a benchmark
to standardize progress of the field, and whether a benchmark dataset in such an intrinsically socio-technical
context like criminal justice, if exists, will ensure beneficial and ethical use [5]. Despite the need for inter-
disciplinary collaboration to address such challenges, the legal literature and the technical AI literature on
this topic have evolved mostly on their own without much exchange. For instance, in addition to recognizing
the process v. outcome fairness dichotomy popular in the legal and social science literature [19], the techni-
cal AI literature also contrasts two technical sub-categories of outcome fairness: group (outcome) fairness
and individual (outcome) fairness [6]. However, a recent popular law review article on AI recidivism risk
assessment fairness maps ‘disparate impact’ to outcome fairness, and ‘disparate treatment’ both to process
(i.e. procedural) fairness and (though more loosely) to individual fairness [25], implying firstly that individ-
ual fairness is not a subcategory of outcome fairness, and secondly that individual fairness is more related
to procedural fairness than to group fairness. Both are inconsistent with the technical literature.
Motivated by such communication challenges between the legal and technical communities on AI fair-
ness research, this proposal aims to narrow the gap by surveying various sources of law in the United
States (US) to identify whether and how the characteristics of individual or group fairness criteria have been
adopted into both high- and low-level legal requirements for bail, probation, sentencing, or parole decisions.
We then investigate operational complications in the process of adopting individual and/or group fairness
into the AI recidivism pipeline in terms of sources of data (which features to use and how) as well as the hu-
man in the loop (potential overreliance by court officials on the risk assessment tool). After reviewing each
legal problem, we propose a corresponding Research Question (RQ). Details of our law review process is
in the Second Appendix (please find ‘Second_Appendix_Detailed_Law_Review.pdf’).
1.1 Limited Adaptation of Individual Fairness from High to Low-level Legal Sources
At a high level, the US Constitution promotes fairness in criminal justice and other legal areas via two con-
cepts: ‘due process’ and ‘equal protection’. Relevant excerpts are ‘No person shall be [...] deprived of life,
liberty, or property, without due process of law’ (the Fifth Amendment) and ‘...nor shall any State deprive any
person of life, liberty, or property, without due process of law; nor deny to any person within its jurisdiction
1
the equal protection of the laws’ (the Fourteenth Amendment). Due process can be mapped to procedural
(or process) fairness, which is difficult to quantify and evaluate, especially in the context of recidivism risk
assessment tools as many models are proprietary, e.g. strict confidentiality constraints imposed by COM-
PAS owners on the expert witness Dr. Rudin in Flores v. Stanford (2021) [15]. Therefore, in the current
recidivism landscape where the majority of tools used by state governments come from private companies1,
it is more practical to evaluate outcome fairness, which might correspond to the ‘equal protection’ clause
[4]. Within outcome fairness, the AI Fairness literature covers two main schools of fairness: group fairness
and individual fairness. Group fairness or group parity is achieved when a statistical metric of interest, e.g.
positive outcome rate, is equalized across different groups with respect to a sensitive feature, e.g. race or
gender [31]. As a more mathematically involved and therefore harder-to-quantify metric, individual fairness
is first introduced by Dwork et al. [17] as a technical concept based on the intuition that similar individuals
should be treated with similar outcome. The most intuitive advantage of individual fairness compared to
group fairness is that it is less subject to arbitrary manipulation of the model, e.g. assigning the same 50%
chance of receiving a high-risk label to everyone regardless of their race only to satisfy a parity metric.
Delving back to high-level legal sources, case laws interpret the ‘equal protection’ clause as covering
both group fairness and individual fairness. Regarding group fairness, the US Supreme Court rules in Wash-
ington v. Davis (1976) that group fairness (disproportionate impact) matters but there should be another
relevant school of fairness: ‘We have not held that a law [...] is invalid under the Equal Protection Clause
simply because it may affect a greater proportion of one race than of another. Disproportionate impact is
not irrelevant, but it is not the sole touchstone of an invidious racial discrimination forbidden by the Con-
stitution. [11]. As group fairness is ‘not the sole’ criterion here, we trace back to older US Supreme Court
decisions to find the missing piece. For instance, F.S. Royster Guano Co. v. Commonwealth of Virginia
(1920) interprets ‘equal protection’ consistently with todays individual fairness if we define similarly situ-
ated as having a high similarity function score: ‘The Equal Protection Clause of the Fourteenth Amendment
commands [...] essentially a direction that all persons similarly situated should be treated alike.’ [10] Inter-
estingly, a synergy of the relevance of the ‘equal protection’ clause to both group and individual fairness is
provided in City of Cleburne, Tex. v. Cleburne Living Center (1985) where the US Supreme Court rules
that ‘Discrimination, in the Fourteenth Amendment sense, connotes a substantive constitutional judgment
1Legal Tech News (2020). The Most Widely Used Risk Assessment Tool in Each U.S State. https://www.law.com/legaltech-
news/2020/07/13/the-most-widely-used-risk-assessment-tool-in-each-u-s-state/
2
that two individuals or groups are entitled to be treated equally with respect to something. [12]. Therefore,
the ‘equal protection’ clause in the US Constitution has been interpreted as promoting both fairness notions.
Moving to lower-level operationalization of fairness criteria, although some states such as California
have developed elaborate legislation to control the quality of state-wide AI prediction tools with respect
to group fairness metrics, we find limited evidence for individual fairness enforcement. Take for example
Section 1320.35 of the California Penal Code, most low-level fairness-related requirements in this statute
are about group fairness only, requiring information about ‘risk levels aggregated by race or ethnicity, gen-
der, offense type, ZIP Code of residency, and release or detention decision’, ‘the predictive accuracy of the
tool by gender, race or ethnicity, and offense type’, and ‘any disparate effect in the tools based on income
level’ [8]. The evaluation of metrics shows that California cares about mitigating not only direct bias but
also indirect bias. For example, even if an AI model does not use the sensitive feature race, the real-world
bias can still penetrate into the models prediction via the use of features that highly correlate with race, e.g.
ZIP Code of residence [9]. However, the closest requirement we find that is remotely related to individual
(outcome) fairness is the option to share of individual-level data conditioned on research contracts. There-
fore, individual fairness is not directly relevant to this statute. Another less detailed example is Section 725
ILCS 5/110-6.4 of the Illinois Compiled Statutes, which requires their state-wide risk assessment tool to
‘not discriminate on the basis of race, gender, educational level, socio-economic status, or neighborhood’,
which even with broad interpretation can only be attributed to group fairness, not individual fairness because
there is no reference to whether the statute cares about how similar two individual are [33]. In summary,
this mismatch between the relative importance of individual fairness v. group fairness at different levels of
law motivates RQ1: At which Technical Stage should Individual v. Group Fairness be Enforced?
1.2 Disagreement on Range of Features for Predictive Use
An important first implementation step in enforcing either individual or group fairness criteria is to determine
which features are appropriate for use, either to compute the similarity function between any two individuals
(for individual fairness) or to evaluate a metric that matters across different groups with respect to each of
those features (for group fairness). In Clark v. Jeter (1988), the US Supreme Court acknowledges a common
law hierarchy of features (classes or group memberships) on which the government may discriminate to
varying degrees for the sake of public interest, from the most to the least stringent amount of justification
the government must provide: strict scrutiny (race and national origin), intermediate scrutiny (gender and
legitimacy, i.e. out-of-wedlock status), and rational basis (other features, e.g. age and disability) [13].
3
Although this hierarchy is not legally binding for the recidivism context, there seems to be general
consensus to not use features in the strict scrutiny range as model inputs. Regarding race, the consensus is
clear. Although most risk assessment tools, e.g. COMPAS, do not use race as a feature, many works such
as Johnson [20] go much further to criticize the tool for indirectly perpetuating racial bias through proxy
features. Regarding national origin, a sample questionnaire2shows that COMPAS does not collect this
feature, and we do not find any literature discussing the use of or unfairness with respect to this feature by any
recidivism tools. Therefore, we assume an implicit consensus that national origin should not be used. For the
lower two ranges (intermediate scrutiny and rational basis), there remains disagreement on whether features
at those ranges might be used. The first example of disagreement is gender (at the intermediate scrutiny
range). On the one hand, the Wisconsin Supreme Court in their State v. Loomis (2016) decision strongly
advocates the use of gender as a feature: "COMPAS’s use of gender promotes accuracy that ultimately
inures to the benefit of the justice system including defendants." [14]. On the other hand, Section 2A:162-
25(2) of the New Jersey Statutes adopts a clear stance against using gender as a feature for AI recidivism
risk assessment: ‘Recommendations for pretrial release shall not be discriminatory based on race, ethnicity,
gender, or socio-economic status. [34]. The second example of disagreement is age (at the rational basis
range with least scrutiny). More specifically, in the same jurisdiction of New York, while Section 168-l of
the Consolidated Laws of New York explicitly requires sex offense recidivism risk assessment to take into
account ‘the age of the sex offender at the time of the commission of the first sex offense’ [29], in their Flores
v. Stanford (2021) decision, the US district court for Southern District of New York indicates an implicit
stance against the use of age in recidivism risk assessment by allowing expert inspection of the data used to
train COMPAS to ‘help Plaintiffs substantiate their allegations that COMPAS punishes juvenile offenders
for their youth, such that Defendants reliance on this tool is constitutionally problematic’ [15].
In summary, while there is a consensus that strict scrutiny features should be excluded from model
inputs, legal sources disagree whether intermediate scrutiny or rational basis features should be excluded,
motivating RQ2: How much Scrutiny and Which Fairness Criteria does each Feature Warrant?
1.3 Human-AI Interaction as a Shield against Fairness Concerns
Moving to the human aspect of fairness evaluation, a typical use case of AI recidivism risk assessment is
that an AI risk score for a defendant is presented to a human (a bond court official, a sentencing judge, or
a parole officer) who is supposed to take into account that risk score but still consider the defendants file
2https://www.documentcloud.org/documents/2702103-Sample-Risk-Assessment-COMPAS-CORE
4
holistically before making a decision. One legal problem is that this Human-AI interaction element becomes
an automatic shield to protect the judge’s decisions against accusations of (individual/group) fairness viola-
tion in future appeals, even though there is no attempt to measure the explicit reliance function of judges on
the AI score. For example, in Brooks v. Commonwealth, even though there is evidence that the trial judge
favors the AI risk assessment-based sentencing recommendation and dismisses the shorter active sentence
in the non-AI recommendation, the Court of Appeals of Virginia rules that ‘the trial court properly exercised
its discretion’ without inquiring the level of reliance by the trial court on the AI risk assessment [28].
Concerningly, the burden of proof to show that judges over-rely on AI is on the defendants. In People v.
Younglove (2019), when the defendants contend that their right to an ‘individualized sentencing decision’ is
violated by the COMPAS risk score presented to the sentencing judge, the Michigan Court of Appeals rules
that ‘defendants offer no evidence that their sentencing courts actually placed significant (or any) weight
on the COMPAS assessments in crafting their sentences. Defendants have failed to carry their burden of
showing that the inclusion of the information affected their substantial rights. [27]. Defendants have almost
no way to reconstruct the mental model of the court officials’ decisions, especially when most judges are not
required to disclose how much weight they give to the AI risk score, which may hide demographic-based
bias. For example, Stevenson and Doleac [35] show that, conditioned on the same recidivism risk level,
judges in Virginia give black defendants sentences that are 15-20% longer than white defendants.
In conclusion, the concerns about judges’ unknown but legally protected reliance on the AI tools moti-
vates RQ3: How does AI Reliance Relate to the Fairness of Criminal Risk Assessment by Humans?
2 Project Design and Implementation
We plan to use two bail-related defendant datasets. The first is a standard, publicly available dataset of
roughly 18,610 defendants in Florida between 2013 and 2014 with recidivism risk scores returned by the
popular tool COMPAS being used by many state courts. This COMPAS dataset was used by Propublica
to uncover the tool’s potential racial bias.3The second is a larger dataset: Client Legal Utility Engine
(CLUE) with over 4.4 million District Court and 700,000 Circuit Court cases scraped by Maryland Volunteer
Lawyers Service (MVLS) from the publicly available Maryland Judiciary Case Search4and other websites
from the mid-1980s to 2021. Both datasets share many common demographic features, e.g. race, gender,
and age, as well as criminal history features, e.g. number of priors, degree of current charge. One feature
3https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm
4https://casesearch.courts.state.md.us/casesearch/
5
that is only available in the CLUE dataset is ZIP Code, which some might argue to have some recidivism
predictive power as some neighborhoods are more prone to crimes than other, but others might worry that
this feature often highly correlates with race, a protected feature. Both datasets have high ecological validity
and relevance since they come real courts (in Broward County, Florida, and most counties in Maryland)
about real defendants. The raw CLUE dataset has been cleaned meticulously by a group led by Prof. Zubin
Jelveh thanks to two continuous Maryland Governor Office’s grants since 2021 (DRCE-2022-0001 and
PIGF-2023-0021, totalling 580,260 USD) and will be ready for our proposed fairness research.
2.1 At which Technical Stage should Individual v. Group Fairness be Enforced?
The issue that individual fairness is promoted at a high level (US Constitution and Supreme Court’s inter-
pretation of the ‘equal protection’ clause) but not yet operationalizable at lower levels sources of law, state
criminal risk assessment statutes, resembles the issue discussed by Dwork et al. [17] that individual fairness
is more desirable in theory but difficult to enforce in practice. Therefore, in the criminal justice setting,
one feasible contribution from the technical AI Fairness community might be to determine which compo-
nent of the risk assessment pipeline is best suited for individual or group fairness. One example pipeline
is illustrated in the technical group fairness review by Alikhademi et al. [2] for the predictive policing con-
text, including four components (with references to technical works which have incorporated group fairness
criteria into each component): dataset pre-processing (changing the input), algorithm design, output post-
processing (changing the output), result analysis. Delving into the technical literature for individual fairness,
we find several examples of how individual fairness criteria might be incorporated into different components
of the pipeline: the pairwise fair representation of the input data by Lahoti et al. [21] as example for dataset
pre-processing, the Sensitive Subspace Robustness (SenSR) algorithm by Yurochkin et al. [38] as example
for algorithm design, the Individual+Group Debiasing (IGD) Post-Processing method by Lohia et al. [24]
as example for output post-processing, and the two individual metric learning methods named Factor Anal-
ysis of Comparable Embeddings (FACE) and Embedded Xenial Pairs Logistic Regression (EXPLORE) by
Mukherjee et al. [26] as example for result analysis. Once we have identified whether individual or group
fairness criteria should be enforced in each component of the pipeline, e.g. via follow-up legal research
or human study, those aforementioned technical individual fairness works, together with the group fairness
works cited in Alikhademi et al. [2], reassure us that there exists technical methods to enforce such criteria.
For the normative question on which part of the technical four-step pipeline to prioritize individual or
group fairness, we propose a semi-structured interview with domain experts and stakeholders, e.g. victim
6
service providers, public defenders, and retired prosecutors or judges, to determine the appropriate fairness
criteria for the first three stages. The "Result Analysis" stage will be addressed in Sub-section 2.2. To
analyze the interview transcripts from each of those three stages, we will use both deductive coding (to
decide which part of an opinion supports individual or group fairness) and inductive coding (to identify
and cluster the main arguments for and against each fairness criterion). One challenge for our qualitative
approach is the fast-moving nature of this algorithmic fairness and criminal law research intersection and
the geographically diverse nature of US laws, which even legal experts may not be fully aware of. One
mitigation is to ground our interviews onto concrete legal questions by regularly reviewing recent state and
federal bills related to AI and algorithmic decision-making with an online tracking tool.5
2.2 How much Scrutiny and Which Fairness Criteria does each Feature Warrant?
The controversy on whether features in the two lower ranges of scrutiny can be used as input for recidivism
risk assessment models indicates room for future research that links to the individual versus group fairness
dichotomy. In the same framework of three continuous scrutiny ranges, we propose three discrete scrutiny
thresholds: the third-highest or ‘exclusion from model inputs’ threshold, the second-highest or ‘group parity
required’ threshold, and the highest or ‘ignorance in individual similarity function’ threshold. The more de-
scriptive name of each threshold corresponds to a pass condition, i.e. a feature’s scrutiny amount warranted
is greater than this threshold if people believe that the feature should satisfy the corresponding threshold’s
pass condition.6Making a connection to fairness, the second-highest and highest thresholds are related to
group and individual fairness, respectively. If we take the view that excluding a feature from the model input
space corresponds to procedural fairness [22,1], the third-highest threshold is related to procedural fairness.
An illustration of our three proposed scrutiny thresholds is given in Figure 1.
To justify the order of our proposed scrutiny thresholds, an example of why the second-highest threshold
(‘group parity required’) is higher than the third-highest threshold (‘exclusion from model inputs’) is the
recurrent debate on whether COMPAS should achieve racial parity even though the tool does not use race
as a predictive feature (i.e. race passes the third-highest threshold) [36]. An intuition for why the highest
threshold (‘ignorance in individual similarity function’) is higher than the second-highest threshold (‘group
5https://openstates.org/
6Firstly, any features that people believe should be excluded from the input space of the AI criminal risk assessment model will
have scrutiny scores above the third-highest threshold. Secondly, for a feature, if people believe it should be the reference feature so
that they can compute a group parity metric and require that parity metric to be good enough, then the scrutiny score of that feature
will exceed the second-highest threshold. Finally, if people believe that for the purpose of evaluating an individual fairness metric,
the individual similarity function must ignore a certain feature, then this feature’s scrutiny score will pass the highest threshold.
7
parity required’) is a thought experiment: even if one believes that a model should not have disparate racial
impact (i.e. race passes the second-highest threshold), one is likely to believe that all else equal, two same-
race individuals are more similar than two different-races individuals (i.e. race fails the highest threshold).
Contextualizing our proposed thresholds back into the framework of scrutiny ranges, we only know
the third-highest threshold must be below the strict scrutiny range because all features in this range (race,
national origin) are excluded from model inputs. As legal sources have not agreed on whether features
in the two lower scrutiny ranges, e.g. gender and age, should be excluded from model inputs, we do not
know whether our third-highest threshold should go between the strict scrutiny and intermediate scrutiny
ranges, within the intermediate scrutiny range, between the intermediate scrutiny and rational basis ranges,
or within the rational basis range. This open question is illustrated in Figure 2. Suppose law review and
empirical works, e.g. human studies with crowdworkers or court officials, can confirm that our proposed
three scrutiny thresholds are consistent with humans’ fairness perception, our framework might help identify
whether group or individual fairness criteria should be evaluated with respect to each feature.
2.2.1 Backward Perceived Fairness Human Study for the Highest Threshold
With respect to the highest scrutiny threshold (‘ignorance in individual similarity function’), we propose a
backward evaluation method to indirectly decide whether a feature should be included or excluded in an
individual similarity function based on human study participants’ perception of whether the outcomes of
each pair of individuals whom the similarity function determine as highly similar are fair or unfair.
Assuming an individual similarity function is good, then the intuition behind individual fairness is that
given two individuals with a high similarity score (S=high) according to the individual similarity function,
the judgment is fair (J=f air) if those two individuals receive the same outputs (O=same) and unfair
(J=un f air) if they receive different outputs (O=di f f ). If the similarity score for two individuals is low
(S=low), individual fairness cannot make any judgement (J=null). We formalize these intuitions with the
logical and () as well as the logical or () as follow:
Assumption: similarity function is good.
(S=high O=same)J=f air
(S=high O=di f f )J=un f air
Note the equivalent of ABis ¬ABwhere ¬is logical negation. We can rewrite the two statements:
¬(S=high O=same)(J=f air) S=low O=d i f f J=f air (1)
8
¬(S=high O=di f f )(J=un f air) S=low O=same J=un f air (2)
Assuming a similarity function is good means that both Statements 1and 2must be true. If either
Statement 1or 2is false, then we know the similarity function is bad. To quantify the ‘badness’ of a
similarity function, we use the pairs of individuals that the similarity function determines as highly similar
(S=high). We can conduct an online human study, e.g. via Prolific, so that crowdworkers can make
human judgements (J=f air or J=un f air) on the recidivism outputs (O=same if two individuals are
classified into the same level of recidivism risk, e.g. both Low, both Medium, or both High, and O=di f f
otherwise, e.g. one Low and one Medium). The between-subject variable is the similarity function, i.e. each
crowdworker will be shown a set of features that only one of the individual similarity function uses.
A pair of defendants with (O=same J=un f air) violates Statement 1while a pair with (O=di f f
J=f air) violates Statement 2, incrementing the ‘badness’ score of the similarity function by one unit.
Based on the crowdworkers’ fairness judgements on the same/different recidivism risk scores for all pairs
of highly similar defendants, we can calculate an overall ‘badness’ score of the similarity function.
Back to the question of how to empirically determine whether a feature should be excluded from the
similarity function, we can repeat the backward evaluation procedure explained above with the same type
of similarity function but take into account different sets of features. For example, suppose there are three
features we need to put on the scrutiny spectrum with respect to the highest scrutiny threshold: race, gender,
and age. From the case laws-based scrutiny ranges, we know the order of decreasing scrutiny is race (strict
scrutiny), gender (intermediate scrutiny), and age (rational basis). If we can show that race is below the
highest scrutiny threshold, it will imply that gender and age are also below the highest scrutiny threshold.
So we start with race. Once we decide on a simple type of similarity function, we can apply it to a set
of features which are surely below the highest scrutiny threshold, e.g. criminal history features, degree of
charge(s), and use this similarity function to run the ‘badness’ score estimation procedure outlined above to
get a baseline ‘badness’ score. Then we can evaluate a new similarity function that has the same form but
additionally takes into account a feature in dispute, e.g. race, and repeat the procedure to get a ‘badness’
score for this new similarity function. At the final step, if the ‘badness’ score of the similarity function
with race is (with statistical significance) higher than the baseline similarity function (without any disputed
features), we may conclude that based on humans’ fairness perception of the recidivism risk assessment
task, race should be ignored in the individual similarity function, thus above the highest scrutiny threshold.
9
A limitation of this proposed study is that its procedure is more complex than asking humans directly
whether they think a feature should be included in the similarity function. We give two justifications for
this backward approach. Firsrly, the line between similarity in everyday life v. similarity with respect to
recidivism risk assessment is blurry. For example, most agree that two persons of the same race are more
likely to be similar than two persons from different races, but many will doubt this similarity assessment for
the purpose of bail decision. Secondly, most people not trained in AI will not be familiar with the technical
individual fairness concept, so it is hard to loop the fairness component in their direct normative assess-
ment of whether a feature should be included in the similarity function. Our approach, which incorporates
humans’ fairness perception with concrete defendant examples, tackles those issues.
One challenge in doing quantitative human study is that we cannot find enough participants with domain
expertise for statistical significance. Our mitigation is to select a sample of online crowdworkers that are
representative of the US population. Looking at the bigger picture of why we believe fairness judgments
from people outside the legal profession are relevant, Awad et al. [3] argue that policymakers should be
aware of and prepare for societal pushback of potentially beneficial but high-stakes decision-making AI
tools, and Liu et al. [23] maintain that the discrepancy between a human-AI legal responsibility allocation
framework from laws and public’s expectations may hinder the adoption of and trust in AI tools at large.
2.2.2 Expert and Stakeholder Interviews for the Second-highest and Third-highest Thresholds
Depending on which features (if any) show empirical results from the human study proposed in Sub-Section
2.2.1 to be below the Highest Threshold, we will conduct qualitative interviews with domain experts and
stakeholders grounded on recent AI bills to address the normative question of whether group parity should
be required for those remaining features and if not, whether those features should be excluded from the
model’s input feature space. We will follow similar qualitative methodology proposed in Sub-section 2.1.
Another difficulty is to find people who understand group fairness/procedural fairness and have time to hear
us explain our scrutiny framework. One potential solution (if this proposal is accepted) is to leverage the
NIJ Fellows’ network to find more committed experts and stakeholders.
2.3 How does AI Reliance Relate to the Fairness of Criminal Risk Assessment by Humans?
Even though we do not know whether judges rely appropriately on AI risk assessment tools, Corbett-Davies
and Goel [9] empirically show that there is a causal effect of COMPAS on bail decisions by courts in Florida.
Noting the technical definition of ‘overreliance’ (when humans agree with incorrect AI predictions), ‘un-
derreliance’ (when humans disagree with AI correct predictions) and ‘appropriate reliance’ (when humans
10
agree with correct AI predictions) [30], the causal influence of AI on bail highlights the need to empir-
ically investigate the relationship between reliance and fairness criteria. For example, if a criminal risk
assessment tool causes judges to over-rely on it too often, the Human-AI Interaction shield against future
fairness-grounded appeals should be nullified, at least for that tool, to discourage courts from using it. Once
reliable methods to quantify judges’ overreliance on AI recidivism risk score and establish its relationship
with fairness criteria, they may become motivations to shift the burden of proof from defendants to judges.
For example, judges might be encouraged by default to disclose which factors about the defendants’ records
they consider in addition to the recidivism risk score or else higher-level courts may assume that the AI risk
score is the only substantive factor considered, which may serve as fairness grounds for appeal.
2.3.1 Humans’ AI Reliance v. Individual/Group Fairness Metrics on Risk Assessment Results
We can simulate the AI-assisted decision-making process of judges by training a simple machine learning
model, e.g. logistic regression, on bail data and show the model’s predictions to human study participants,
e.g. Prolific online crowdworkers by default or judicial professionals if the network for NIJ Fellows gives
us access to them, so that they can make their own recidivism predictions for each defendant based on the
defendant’s features and potentially the model’s prediction. By comparing predictions from the participants,
the model, and ground truths, we may calculate the appropriate reliance, under-reliance, and over-reliance
rate per participant. Based on each participant’s predictions, we can also calculate group fairness and indi-
vidual fairness score per participant based on the predictions each participant makes. With each participant
as a data point on several fairness scores (e.g. group fairness or individual fairness scores) v. reliance
(e.g. appropriate reliance, under-reliance, and over-reliance rate) scatterplots, we can study the relationship
between a human decision maker’s degree of reliance on the AI model and the fairness of their decisions.
2.3.2 Mental Investment as Alternative Output for Individual and Group Fairness Evaluation
We propose a similar study as in Section 2.3.1, but we change the output quantity that is used for fairness
metric calculation: from the risk score a participant assigns to a defendant to the amount of time a participant
spends on that defendant. The motivation behind this modified set-up is that if certain individuals or certain
groups are given significantly less thought (with time as a proxy) than others for such a high-stakes decision
like bail, it might be a sign of existing prejudices among decision makers.
2.4 Plan for Deliverables and Publications
For the conceptual work mapping technical fairness concepts to legal equivalents and developing a scrutiny-
based framework for fairness-related thresholds and ranges of features, we can submit it to annual Fairness-
11
related ACM conferences for the CS community, e.g. FAccT or AIES, and tech-oriented law review journals,
e.g. STLR, for the legal community. Insights from domain experts and stakeholders in Sections 2.1 and 2.2.2
may also go into those fairness or law venues. For the quantitative human studies proposed in Sections 2.2.1
and 2.3.1, we can submit to a technical Human-Computer Interaction (HCI) conference, e.g. CHI or IUI, or
a conference at the intersection of AI and Law, e.g. ICAIL, if we want to reach a larger legal audience. To
reach legal practitioners and the larger public, also plan to publish our results as pre-prints onto arXiv and
advertise those versions via social media like Twitter and informal remote groups, e.g. CS+Law Workshop.
3 Capabilities and Competencies
I (Tin) have taken both technical AI and legal research methods graduate coursework while working on AI
Fairness and Legal NLP research projects, with one accepted manuscript at the CHI TRAIT 2023 work-
shop. Back in Summer 2020, I interned with an IP law firm to better understand the mindset of practitioners.
Regarding my leadership skills, I am a co-founder and former Director of Saigon International Model UN
(SIMUN), a non-profit academic organization in Vietnam to organize MUN conferences with over 100
participants annually. Therefore, I have demonstrated both technical/communication skills and continued
commitment to this AI and Law intersection. My CS advisor, Prof. Hal Daumé III, has over 22,000 citations
and publishes extensively on Fairness and Human-AI Interaction (central themes of this proposal). Although
my co-advisor, Prof. Zubin Jelveh, is a relatively new UMD Criminology faculty, he has secured two con-
secutive grants from the MD Governor’s Office since 2021 to lead pre-trial risk assessment projects with the
CLUE dataset, which our proposal extends upon with novel fairness questions. Since Spring 2023, the three
of us have collaborated on this CLUE project to explore fairness aspects of the record linkage phase, so I be-
lieve we are a highly complementary team where Prof. Daumé contributes the technical fairness and human
study insights, Prof. Jelveh provides the criminologist’s perspective and practical experience with Maryland
court data issues, and I bring my unqiue legal experience and of course the technical AI implementation
skills. We will continue our weekly meetings so that my advisors can help me make regular progress on the
proposed research. Regarding our rigorous academic environment, the UMD CS department is ranked 11th
nationwide by CSRankings 2023 and the UMD Criminology department is ranked 1st nationwide by US
News 2021. Contextualizing our proposal with recent NIJ grants, our research is most related to ‘Applying
Artificial Intelligence to Person-Based Policing Practices’ (Award 2018-75-CX-0002) as both works seek to
improve AI-based criminal risk assessment, but our proposal provides a unique law-informed fairness angle.
12
References
[1] Amanda Agan and Sonja Starr. 2018. Ban the box, criminal records, and racial discrimination: A field
experiment. The Quarterly Journal of Economics 133, 1 (2018), 191–235.
[2] Kiana Alikhademi, Emma Drobina, Diandra Prioleau, Brianna Richardson, Duncan Purves, and Juan E
Gilbert. 2022. A review of predictive policing from the perspective of fairness. Artificial Intelligence
and Law (2022), 1–17.
[3] Edmond Awad, Sohan Dsouza, Jean-François Bonnefon, Azim Shariff, and Iyad Rahwan. 2020.
Crowdsourcing moral machines. Commun. ACM 63, 3 (2020), 48–55.
[4] Jack M Balkin and Reva B Siegel. 2003. The American civil rights tradition: Anticlassification or
antisubordination. Issues in Legal Scholarship 2, 1 (2003).
[5] Michelle Bao, Angela Zhou, Samantha Zottola, Brian Brubach, Sarah Desmarais, Aaron Horowitz,
Kristian Lum, and Suresh Venkatasubramanian. 2021. It’s COMPASlicated: The Messy Relationship
between RAI Datasets and Algorithmic Fairness Benchmarks. Conference on Neural Information
Processing Systems (NeurIPS) (2021).
[6] Reuben Binns. 2020. On the apparent conflict between individual and group fairness. In Proceedings
of the 2020 conference on fairness, accountability, and transparency. 514–524.
[7] Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism
prediction instruments. Big data 5, 2 (2017), 153–163.
[8] California Penal Code. 2021. Pretrial risk assessment tools; legislative intent; definitions; validation;
public information; report on outcomes and potential biases in pretrial release. § 1320.35 (2021).
[9] Sam Corbett-Davies and Sharad Goel. 2018. The measure and mismeasure of fairness: A critical
review of fair machine learning. arXiv preprint arXiv:1808.00023 (2018).
[10] US Supreme Court. 1920. FS Royster Guano Co. v. Virginia. No. 165 (1920).
[11] US Supreme Court. 1976. Washington v. Davis. No. 74-1492 (1976).
[12] US Supreme Court. 1985. Cleburne v. Cleburne Living Center, Inc. No. 84-468 (1985).
[13] US Supreme Court. 1988. Clark v. Jeter. No. 87-5565 (1988).
[14] Wisconsin Supreme Court. 2016. State v. Loomis. No. 2015AP157-CR (2016).
[15] SD New York District Court. 2021. Flores v. Stanford. No. 18 Civ. 02468 (VB)(JCM) (2021).
[16] Julia Dressel and Hany Farid. 2018. The accuracy, fairness, and limits of predicting recidivism. Science
advances 4, 1 (2018).
[17] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness
through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference.
214–226.
[18] James R Foulds, Rashidul Islam, Kamrun Naher Keya, and Shimei Pan. 2020. An intersectional
definition of fairness. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE,
1918–1921.
[19] Stephen W Gilliland and David Chan. 2002. Justice in organizations: Theory, methods, and applica-
tions. (2002).
[20] Gabbrielle M Johnson. 2021. Algorithmic bias: on the implicit biases of social technology. Synthese
198, 10 (2021), 9941–9961.
[21] Preethi Lahoti, Krishna P. Gummadi, and Gerhard Weikum. 2019. Operationalizing Individual Fairness
with Pairwise Fair Representations. Proc. VLDB Endow. 13, 4 (dec 2019), 506518.
[22] Min Kyung Lee, Anuraag Jain, Hea Jin Cha, Shashank Ojha, and Daniel Kusbit. 2019. Procedu-
ral justice in algorithmic fairness: Leveraging transparency and outcome control for fair algorithmic
mediation. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–26.
[23] Peng Liu, Manqing Du, and Tingting Li. 2021. Psychological consequences of legal responsibility
misattribution associated with automated vehicles. Ethics and information technology 23, 4 (2021),
763–776.
[24] Pranay K Lohia, Karthikeyan Natesan Ramamurthy, Manish Bhide, Diptikalyan Saha, Kush R Varsh-
ney, and Ruchir Puri. 2019. Bias mitigation post-processing for individual and group fairness. In Icassp
2019-2019 ieee international conference on acoustics, speech and signal processing (icassp). IEEE,
2847–2851.
[25] Sandra G Mayson. 2019. Bias in, bias out. The Yale Law Journal 128, 8 (2019), 2218–2300.
[26] Debarghya Mukherjee, Mikhail Yurochkin, Moulinath Banerjee, and Yuekai Sun. 2020. Two simple
ways to learn individual fairness metrics from data. In International Conference on Machine Learning.
PMLR, 7097–7107.
[27] Michigan Court of Appeals. 2019. People v. Younglove. No. 341901 (2019).
[28] Court of Appeals of Virginia. 2004. Brooks v. Commonwealth. No. 2540-02-3 (2004).
[29] Consolidated Laws of New York. 2011. Board of examiners of sex offenders. § 168-l (2011).
[30] Samir Passi and Mihaela Vorvoreanu. 2022. Overreliance on AI: Literature review. (2022).
[31] Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. 2008. Discrimination-aware data mining. In
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data
mining. 560–568.
[32] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. 2017. On fairness
and calibration. Advances in neural information processing systems 30 (2017).
[33] Illinois Compiled Statutes. 2023. Statewide risk-assessment tool. § 725 ILCS 5/110-6.4 (2023).
[34] New Jersey Statutes. [n. d.]. Statewide Pretrial Services Program; establishment; risk assessment in-
strument; monitoring of eligible defendants on conditional release (Proposed Legislation). § 2A:162-
25 ([n. d.]).
[35] Megan T Stevenson and Jennifer L Doleac. 2022. Algorithmic risk assessment in the hands of humans.
Available at SSRN 3489440 (2022).
[36] Anne L Washington. 2018. How to argue with an algorithm: Lessons from the COMPAS-ProPublica
debate. Colo. Tech. LJ 17 (2018), 131.
[37] Pak-Hang Wong. 2020. Democratizing algorithmic fairness. Philosophy & Technology 33 (2020),
225–244.
[38] Mikhail Yurochkin, Amanda Bower, and Yuekai Sun. 2020. Training individually fair ML models with
sensitive subspace robustness. In International Conference on Learning Representations.
First Appendix - Figures
Scrutiny (amount warranted)
Pass Condition (a feature’s scrutiny amount warranted ≥ this
threshold, if people believe the feature should satisfy … )
Scrutiny Threshold
(with related fairness concept)
exclusion from model inputs
group parity required
ignorance in individual similarity function
Highest
(individual outcome fairness)
Second-highest
(group outcome fairness)
Third-highest
(procedural fairness)
Figure 1: Illustration of Proposed Scrutiny Thresholds
Scrutiny (amount warranted)
Open Question: Where should the third-highest
(yellow) threshold, i.e. exclusion from model
inputs, go relative to the scrutiny ranges?
We only know the ‘yellow’ threshold must be
below the strict scrutiny range because all
features in this range (race, national origin) are
excluded from model inputs.
Name of
Scrutiny Range
strict scrutiny
intermediate scrutiny
rational basis
Figure 2: Contextualization of Third-highest Scrutiny Threshold relative to Scrutiny Ranges
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
A human driver and an automated driving system (ADS) might share control of automated vehicles (AVs) in the near future. This raises many concerns associated with the assignment of responsibility for negative outcomes caused by them; one is that the human driver might be required to bear the brunt of moral and legal responsibilities. The psychological consequences of responsibility misattribution have not yet been examined. We designed a hypothetical crash similar to Uber’s 2018 fatal crash (which was jointly caused by its distracted driver and the malfunctioning ADS). We incorporated five legal responsibility attributions (the human driver should bear full, primary, half, secondary, and no liability, that is, the AV manufacturer should bear no, secondary, half, primary, and full liability). Participants (N = 1524) chose their preferred liability attribution and then were randomly assigned into one of the five actual liability attribution conditions. They then responded to a series of questions concerning liability assignment (fairness and reasonableness), the crash (e.g., acceptability), and AVs (e.g., intention to buy and trust). Slightly more than 50% of participants thought that the human driver should bear full or primary liability. Legal responsibility misattribution (operationalized as the difference between actual and preferred liability attributions) negatively influenced these mentioned responses, regardless of overly attributing human or manufacturer liability. Overly attributing human liability (vs. manufacturer liability) had more negative influences. Improper liability attribution might hinder the adoption of AVs. Public opinion should not be ignored in developing a legal framework for AVs.
Article
Full-text available
Machine Learning has become a popular tool in a variety of applications in criminal justice, including sentencing and policing. Media has brought attention to the possibility of predictive policing systems causing disparate impacts and exacerbating social injustices. However, there is little academic research on the importance of fairness in machine learning applications in policing. Although prior research has shown that machine learning models can handle some tasks efficiently, they are susceptible to replicating systemic bias of previous human decision-makers. While there is much research on fair machine learning in general, there is a need to investigate fair machine learning techniques as they pertain to the predictive policing. Therefore, we evaluate the existing publications in the field of fairness in machine learning and predictive policing to arrive at a set of standards for fair predictive policing. We also review the evaluations of ML applications in the area of criminal justice and potential techniques to improve these technologies going forward. We urge that the growing literature on fairness in ML be brought into conversation with the legal and social science concerns being raised about predictive policing. Lastly, in any area, including predictive policing, the pros and cons of the technology need to be evaluated holistically to determine whether and how the technology should be used in policing.
Article
Full-text available
Often machine learning programs inherit social patterns reflected in their training data without any directed effort by programmers to include such biases. Computer scientists call this algorithmic bias. This paper explores the relationship between machine bias and human cognitive bias. In it, I argue similarities between algorithmic and cognitive biases indicate a disconcerting sense in which sources of bias emerge out of seemingly innocuous patterns of information processing. The emergent nature of this bias obscures the existence of the bias itself, making it difficult to identify, mitigate, or evaluate using standard resources in epistemology and ethics. I demonstrate these points in the case of mitigation techniques by presenting what I call 'the Proxy Problem'. One reason biases resist revision is that they rely on proxy attributes, seemingly innocuous attributes that correlate with socially-sensitive attributes, serving as proxies for the socially-sensitive attributes themselves. I argue that in both human and algorithmic domains, this problem presents a common dilemma for mitigation: attempts to discourage reliance on proxy attributes risk a tradeoff with judgement accuracy. This problem, I contend, admits of no purely algorithmic solution.
Article
Full-text available
As algorithms increasingly take managerial and governance roles, it is ever more important to build them to be perceived as fair and adopted by people. With this goal, we propose a procedural justice framework in algorithmic decision-making drawing from procedural justice theory, which lays out elements that promote a sense of fairness among users. As a case study, we built an interface that leveraged two key elements of the framework-transparency and outcome control-and evaluated it in the context of goods division. Our interface explained the algorithm's allocative fairness properties (standards clarity) and outcomes through an input-output matrix (outcome explanation), then allowed people to interactively adjust the algorithmic allocations as a group (outcome control). The findings from our within-subjects laboratory study suggest that standards clarity alone did not increase perceived fairness; outcome explanation had mixed effects, increasing or decreasing perceived fairness and reducing algorithmic accountability; and outcome control universally improved perceived fairness by allowing people to realize the inherent limitations of decisions and redistribute the goods to better fit their contexts, and by bringing human elements into final decision-making.
Article
Full-text available
Machine learning algorithms can now identify patterns and correlations in (big) datasets and predict outcomes based on the identified patterns and correlations. They can then generate decisions in accordance with the outcomes predicted, and decision-making processes can thereby be automated. Algorithms can inherit questionable values from datasets and acquire biases in the course of (machine) learning. While researchers and developers have taken the problem of algorithmic bias seriously, the development of fair algorithms is primarily conceptualized as a technical task. In this paper, I discuss the limitations and risks of this view. Since decisions on “fairness measure” and the related techniques for fair algorithms essentially involve choices between competing values, “fairness” in algorithmic fairness should be conceptualized first and foremost as a political question and be resolved politically. In short, this paper aims to foreground the political dimension of algorithmic fairness and supplement the current discussion with a deliberative approach to algorithmic fairness based on the accountability for reasonableness framework (AFR).
Article
A platform for creating a crowdsourced picture of human opinions on how machines should handle moral dilemmas.
Article
We revisit the notion of individual fairness proposed by Dwork et al. A central challenge in operationalizing their approach is the difficulty in eliciting a human specification of a similarity metric. In this paper, we propose an operationalization of individual fairness that does not rely on a human specification of a distance metric. Instead, we propose novel approaches to elicit and leverage side-information on equally deserving individuals to counter subordination between social groups. We model this knowledge as a fairness graph, and learn a unified Pairwise Fair Representation (PFR) of the data that captures both data-driven similarity between individuals and the pairwise side-information in fairness graph. We elicit fairness judgments from a variety of sources, including human judgments for two real-world datasets on recidivism prediction (COMPAS) and violent neighborhood prediction (Crime & Communities). Our experiments show that the PFR model for operationalizing individual fairness is practically viable.