ArticlePDF Available

The accuracy, fairness, and limits of predicting recidivism


Abstract and Figures

Algorithms for predicting recidivism are commonly used to assess a criminal defendant’s likelihood of committing a crime. These predictions are used in pretrial, parole, and sentencing decisions. Proponents of these systems argue that big data and advanced machine learning make these analyses more accurate and less biased than humans. We show, however, that the widely used commercial risk assessment software COMPAS is no more accurate or fair than predictions made by people with little or no criminal justice expertise. We further show that a simple linear predictor provided with only two features is nearly equivalent to COMPAS with its 137 features.
Content may be subject to copyright.
RESEARCH METHODS Copyright © 2018
The Authors, some
rights reserved;
exclusive licensee
American Association
for the Advancement
of Science. No claim to
original U.S. Government
Works. Distributed
under a Creative
Commons Attribution
License 4.0 (CC BY-NC).
The accuracy, fairness, and limits
of predicting recidivism
Julia Dressel and Hany Farid*
Algorithms for predicting recidivism are commonly used to assess a criminal defendants likelihood of committing a
crime. These predictions are used in pretrial, parole, and sentencing decisions. Proponents of these systems argue that
big data and advanced machine learning make these analyses more accurate and less biased than humans. We show,
however, that the widely used commercial risk assessment softwareCOMPASisnomoreaccurateorfairthanpredic-
tions made by people with little or no criminal justice expertise. We further show that a simple linear predictor
provided with only two features is nearly equivalenttoCOMPASwithits137features.
We are the frequent subjects of predictive algorithms that determine
music recommendations, product advertising, university admission,
job placement, and bank loan qualification. In the criminal justice sys-
tem, predictive algorithms have been used to predict where crimes will
most likely occur, who is most likely to commit a violent crime, who is
likely to fail to appear at their court hearing, and who is likely to reoffend
at some point in the future (1).
One widely used criminal risk assessment tool, Correctional Of-
fender Management Profiling for Alternative Sanctions (COMPAS;
Northpointe, which rebranded itself to equivantin January 2017),
has been used to assess more than 1 million offenders since it was de-
veloped in 1998. The recidivism prediction component of COMPAS
the recidivism risk scalehas been in use since 2000. This software
predicts a defendants risk of committing a misdemeanor or felony
within 2 years of assessment from 137 features about an individual
and the individuals past criminal record.
Although the data used by COMPAS do not include an individ-
uals race, other aspects of the data may be correlated to race that
can lead to racial disparities in the predictions. In May 2016, writing
for ProPublica,Angwinet al.(2) analyzed the efficacy of COMPAS on
more than 7000 individuals arrested in Broward County, Florida be-
tween 2013 and 2014. This analysis indicated that the predictions were
unreliable and racially biased. COMPASs overall accuracy for white
defendants is 67.0%, only slightly higher than its accuracy of 63.8% for
black defendants. The mistakes made by COMPAS, however, affected
black and white defendants differently: Black defendants who did not
recidivate were incorrectly predicted to reoffend at a rate of 44.9%,
nearly twice as high as their white counterparts at 23.5%; and white
defendants who did recidivate were incorrectly predicted to not reof-
fend at a rate of 47.7%, nearly twice as high as their black counterparts
defendants over black defendants by underpredicting recidivism for
white and overpredicting recidivism for black defendants.
In response to this analysis, Northpointe argued that the ProPublica
analysis overlooked other more standard measures of fairness that the
COMPAS score satisfies (3) [see also the studies of Flores et al.(4)and
Kleinberg et al. (5)]. Specifically, it is argued that the COMPAS score is
not biased against blacks because the likelihood of recidivism among
high-risk offenders is the same regardless of race (predictive parity), it
can discriminate between recidivists and nonrecidivists equally well for
white and black defendants as measured with the area under the curve of
the receiver operating characteristic, AUC-ROC (accuracy equity), and
the likelihood of recidivism for any given score is the same regardless of
race (calibration). The disagreement amounts to different definitions of
fairness. In an eloquent editorial, Corbett-Davies et al.(6)explainthatit
is impossible to simultaneously satisfy all of these definitions of fairness
because black defendants have a higher overall recidivism rate (in the
Broward County data set, black defendants recidivate at a rate of 51%
as compared with 39% for white defendants, similar to the national
While the debate over algorithmic fairness continues, we consider
the more fundamental question of whether these algorithms are any
better than untrained humans at predicting recidivism in a fair and ac-
curate way. We describe the results of a study that shows that people
from a popular online crowdsourcing marketplacewho, it can reason-
ably be assumed, have little to no expertise in criminal justiceare as
accurate and fair as COMPAS at predicting recidivism. In addition, al-
though Northpointe has not revealed the inner workings of their reci-
divism prediction algorithm, we show that the accuracy of COMPAS on
one data set can be explained with a simple linear classifier. We also
show that although COMPAS uses 137 features to make a prediction,
the same predictive accuracy can be achieved with only two features.
We further show that more sophisticated classifiers do not improve pre-
diction accuracy or fairness. Collectively, these results cast significant
doubt on the entire effort of algorithmic recidivism prediction.
We compare the overall accuracy and bias in human assessment with
the algorithmic assessment of COMPAS. Throughout, a positive predic-
tion is one in which a defendant is predicted to recidivate, whereas a
negative prediction is one in which they are predicted to not recidivate.
We measure overall accuracy as the rate at which a defendant is correctly
predicted to recidivate or not (that is, the combined true-positive and
true-negative rates). We also report on false positives (a defendant is pre-
dicted to recidivate but they do not) and false negatives (a defendant is
predicted to not recidivate but they do).
Human assessment
Participants saw a short description of a defendant that included the
defendants sex, age, and previous criminal history, but not their race
son would recidivate within 2 years of their most recent crime. We used
6211 Sudikoff Laboratory, Department of Computer Science, Dartmouth College,
Hanover, NH 03755, USA.
*Corresponding author. Email:
Dressel and Farid, Sci. Adv. 2018; 4: eaao5580 17 January 2018 1of5
a total of 1000 defendant descriptions that were randomly divided into
20 subsets of 50 each. To make the task manageable, each participant
was randomly assigned to see one of these 20 subsets. The mean and
median accuracy for these predictions is 62.1 and 64.0%.
We compare these results with the performance of COMPAS on
this subset of 1000 defendants. Because groups of 20 participants
judged the same subset of 50 defendants, the individual judgments
are not independent. However, because each participant judged only
one subset of the defendants, the median accuracies of each subset can
reasonably be assumed to be independent. Therefore, the participant
performance on the 20 subsets can be directly compared to the
COMPAS performance on the same 20 subsets. A one-sided ttest re-
veals that the average of the 20 median participant accuracies of 62.8%
[and a standard deviation (SD) of 4.8%] is, just barely, lower than the
COMPAS accuracy of 65.2% (P= 0.045).
To determine whether there is wisdom in the crowd(7)(inour
case, a small crowd of 20 per subset), participant responses were pooled
within each subset using a majority rules criterion. This crowd-based
approach yields a prediction accuracy of 67.0%. A one-sided ttest re-
Prediction accuracy can also be assessed using the AUC-ROC.
The AUC-ROC for our participants is 0.71 ± 0.03, nearly identical
to COMPASs 0.70 ± 0.04.
Prediction accuracy can also be assessed using tools from signal de-
tection theory in which accuracy is expressed in terms of sensitivity (d)
and bias (b). Higher values of dcorrespond to greater participant sen-
sitivity. A value of d= 0 means that the participant has no information
to make reliable identifications no matter what bias he or she might
have. A value of b= 1.0 indicates no bias, a value of b>1indicatesthat
participants are biased to classifying a defendant as not being at risk of
recidivating, and b< 1 indicates that participants are biased to classify-
ing a defendant as being at risk of recidivating. With a dof 0.86 and a
bof 1.02, our participants are slightly more sensitive and slightly less
biased than COMPAS with a dof 0.77 and a bof 1.08.
With considerably less information than COMPAS (only 7 features
compared to COMPASs 137), a small crowd of nonexperts is as accu-
rate as COMPAS at predicting recidivism. In addition, our participants
and COMPASs predictions were in agreement for 692 of the 1000
We measure the fairness of our participants with respect to a de-
fendants race based on the crowd predictions. Our participants
accuracy on black defendants is 68.2% compared with 67.6% for
white defendants. An unpaired ttest reveals no significant difference
across race (P= 0.87). This is similar to that of COMPAS that has an
accuracy of 64.9% for black defendants and 65.7% for white defen-
dants, which is also not significantly different (P= 0.80, unpaired
ttest). By this measure of fairness, our participants and COMPAS
are fair to black and white defendants.
Our participantsfalse-positive rate for black defendants is 37.1%
compared with 27.2% for white defendants. An unpaired ttest reveals
a significant difference across race (P= 0.027). Our participantsfalse-
negative rate for black defendants is 29.2% compared with 40.3% for
white defendants. An unpaired ttest reveals a significant difference
across race (P= 0.034). These discrepancies are similar to that of
COMPAS that has a false-positive rate of 40.4% for black defendants
and 25.4% for white defendants, which are significantly different (P=
0.002, unpaired ttest). COMPASs false-negative rate for black defen-
dants is 30.9% compared with 47.9% for white defendants, which are
significantly different (P= 0.003, unpaired ttest). By this measure
of fairness, our participants and COMPAS are similarly unfair to
black defendants, despite the fact that race is not explicitly specified.
See Table 1 [columns (A) and (C)] and Fig. 1 for a summary of these
Prediction with race
In this second condition, a newly recruited set of 400 participants
repeated the same study but with the defendants race included. We
wondered whether including a defendants race would reduce or
exaggerate the effect of any implicit, explicit, or institutional racial
bias. In this condition, the mean and median accuracy on predicting
whether a defendant would recidivate is 62.3 and 64.0%, nearly iden-
tical to the condition where race is not specified.
The crowd-based accuracy is 66.5%, slightly lower than the condi-
tion where race is not specified, but not significantly different (P= 0.66,
paired ttest). The crowd-based AUC-ROC is 0.71 ± 0.03 and the d/bis
0.83/1.03, similar to the previous no-race condition [Table 1, columns
(A) and (B)].
With respect to fairness, participant accuracy is not significantly
different for black defendants (66.2%) compared with white defendants
(67.6%; P= 0.65, unpaired ttest).Thefalse-positiverateforblackde-
fendants is 40.0% compared with 26.2% for white defendants. An un-
paired ttest reveals a significantdifferenceacrossrace(P= 0.001). The
false-negative rate for black defendants is 30.1% compared with 42.1%
for white defendants that, again, is significantly different (P= 0.030, un-
paired ttest). See Table 1 [column (B)] for a summary of these results.
In conclusion, there is no sufficient evidence to suggest that in-
cluding race has a significant impact on overall accuracy or fairness.
The exclusion of race does not necessarily lead to the elimination of
racial disparities in human recidivism prediction.
Participant demographics
Our participants ranged in age from 18 to 74 (with one participant
over the age of 75) and in education level from less than high
school degreeto professional degree.Neither age, gender, nor
level of education had a significant effect on participant accuracy.
There were not enough nonwhite participants to reliably measure
any differences across participant race.
Table 1. Human versus COMPAS algorithmic predictions from 1000
defendants. Overall accuracy is specified as percent correct, AUC-ROC,
and criterion sensitivity (d) and bias (b). See also Fig. 1.
(A) Human
(no race)
(B) Human
(race) (C) COMPAS
Accuracy (overall) 67.0% 66.5% 65.2%
AUC-ROC (overall) 0.71 0.71 0.70
d/b(overall) 0.86/1.02 0.83/1.03 0.77/1.08
Accuracy (black) 68.2% 66.2% 64.9%
Accuracy (white) 67.6% 67.6% 65.7%
False positive (black) 37.1% 40.0% 40.4%
False positive (white) 27.2% 26.2% 25.4%
False negative (black) 29.2% 30.1% 30.9%
False negative (white) 40.3% 42.1% 47.9%
Dressel and Farid, Sci. Adv. 2018; 4: eaao5580 17 January 2018 2of5
Algorithmic assessment
Because nonexperts are as accurate as the COMPAS software, we
wondered about the sophistication of the COMPAS predictive
algorithm. Northpointes COMPAS software incorporates 137 distinct
features to predict recidivism. With an overall accuracy of around
65%, these predictions are not as accurateaswemightwant,partic-
ularly from the point of view of a defendant whose future lies in the
Northpointe does not reveal the details of the inner workings of
COMPASunderstandably so, given their commercial interests. We
have, however, found that a simple linear predictorlogistic regression
(LR) (see Materials and Methods)provided with the same seven features
as our participants (in the no-race condition), yields similar prediction
accuracy as COMPAS. As compared to COMPASsoverallaccuracy
of 65.4%, the LR classifier yields an overall testing accuracy of 66.6%.
This predictor yields similar results to COMPAS in terms of predictive
fairness [Table 2, (A) and (D) columns].
Despite using only 7 features as input, a standard linear predictor
yields similar results to COMPASs predictor with 137 features. We can
reasonably conclude that COMPAS is using nothing more sophisticated
than a linear predictor or its equivalent.
nature of the data, we trained a more powerful nonlinear support
vector machine (NL-SVM) on the same data. Somewhat surprisingly,
the NL-SVM yields nearly identical results to the linear classifier [Table 2,
column (C)]. If the relatively low accuracy of the linear classifier was
because the data are not linearly separable, then we would have
expected the NL-SVM to perform better. The failure to do so suggests
that the data are simply not separable, linearly, or otherwise.
Lastly, we wondered whether using an even smaller subset of the
7 features would be as accurate as using COMPASs 137 features. We
trained and tested an LR classifier on all possible subsets of the seven
features. A classifier based on only two featuresage and total number
of previous convictionsperforms as well as COMPAS; see Table 2
[column (B)]. The importance of these two criteria is consistent with
the conclusions of two meta-analysis studies that set out to determine,
in part, which criteria are most predictive of recidivism (8,9).
We have shown that commercial software that is widely used to
predict recidivism is no more accurate or fair than the predictions
of people with little to no criminal justice expertise who responded
to an online survey. Given that our participants, our classifiers, and
COMPAS all seemed to reach a performance ceiling of around 65%
accuracy, it is important to consider whether any improvement is
possible. We should note that our participants were each presented
with the same data for each defendant and were not instructed on
how to use these data in making a prediction. It remains to be seen
whether their prediction accuracy would improve with the addition
of guidelines that specify how much weight individual features should
be given. For example, a large-scale meta-analysis of approaches to
predicting recidivism of sexual offenders (10) found that actuarial
measures, in which explicit data and explicit combination rules are
used to combine the data into a single score, provide more accurate
predictions than unstructured measures in which neither explicit data
nor explicit combination rules are specified. It also remains to be seen
whether the addition of dynamic risk factors (for example, pro-offending
attitudes and socio-affective problems) would improve prediction
accuracy as previously suggested (11,12) (we note, however, that
COMPAS does use some dynamic risk factors that do not appear
to improve overall accuracy). Lastly, because pooling responses from
multiple participants yields higher accuracy than individual responses,
it remains to be seen whether a larger pool of participants will yield
even higher accuracy, or whether participants with criminal justice ex-
pertise would outperform those without.
Although Northpointe does not reveal the details of their COMPAS
software, we have shown that their prediction algorithm is equivalent
to a simple linear classifier. In addition, despite the impressive sound-
ing use of 137 features, it would appear that a linear classifier based on
only 2 featuresage and total number of previous convictionsis all
that is required to yield the same prediction accuracy as COMPAS.
The question of accurate prediction of recidivism is not limited to
COMPAS. A review of nine different algorithmic approaches to pre-
dicting recidivism found that eight of the nine approaches failed to
make accurate predictions (including COMPAS) (13). In addition, a
meta-analysis of nine algorithmic approaches found only moderate
levels of predictive accuracy across all approaches and concluded that
these techniques should not be solely used for criminal justice decision-
making, particularly in decisions of preventative detention (14).
Recidivism in this study, and for the purpose of evaluating
COMPAS, is operationalized with rearrest that, of course, is not a di-
rect measure of reoffending. As a result, differences in the arrest rate of
black and white defendants complicate the direct comparison of false-
positive and false-negative rates across race (black people, for example,
When considering using software such as COMPAS in making
decisions that will significantly affect the lives and well-being of
criminal defendants, it is valuable to ask whether we would put these
decisions in the hands of random people who respond to an online
survey because, in the end, the results from these two approaches
appear to be indistinguishable.
Our analysis was based on a database of 20132014 pretrial defendants
from Broward County, Florida (2). This database of 7214 defendants
Fig. 1. Human (no-race condition) versus COMPAS algorithmic predictions
(see also Table 1).
Dressel and Farid, Sci. Adv. 2018; 4: eaao5580 17 January 2018 3of5
contains individual demographic information, criminal history, the
COMPAS recidivism risk score, and each defendants arrest record
within a 2-year period following the COMPAS scoring. COMPAS
scores, ranging from 1 to 10, classify the risk of recidivism as low-risk
(1 to 4), medium-risk (5 to 7), or high-risk (8 to 10).
Our algorithmic assessment was based on this full set of 7214
defendants. Our human assessment was based on a random subset
of 1000 defendants, which was held fixed throughout all conditions.
This subset yielded similar overall COMPAS accuracy, false-positive
rate, and false-negative rate as the complete database (a positive pre-
diction is one in which a defendant is predicted to recidivate; a neg-
ative prediction is one in which they are predicted to not recidivate).
The COMPAS accuracy for this subset of 1000 defendants was 65.2%.
The average COMPAS accuracy on 10,000 random subsets of size
1000 each was 65.4% with a 95% confidence interval of (62.6, 68.1).
Human assessment
A descriptive paragraph for each of 1000 defendants was generated:
The defendant is a [SEX] aged [AGE]. They have been charged
with: [CRIME CHARGE]. This crime is classified as a [CRIMI-
NAL DEGREE]. They have been convicted of [NON-JUVENILE
PRIOR COUNT] prior crimes. They have [JUVENILE- FELONY
COUNT] juvenile felony charges and [JUVENILE-MISDEMEANOR
COUNT] juvenile misdemeanor charges on their record.
In a follow-up condition, the defendants race was included so
that the first line of the above paragraph read, The defendant is a
[RACE] [SEX] aged [AGE].
There were a total of 63 unique criminal charges including armed
robbery, burglary, grand theft, prostitution, robbery, and sexual assault.
The crime degree is either misdemeanoror felony.To ensure that
our participants understood the nature of each crime, the above para-
graph was followed by a short description of each criminal charge:
After reading the defendant description, participants were then
asked to respond either yesor noto the question Do you think this
person will commit another crime within 2 years?The participants
were required to answer each question and could not change their re-
sponseonceitwasmade.Aftereachanswer, the participants were given
two forms of feedback: whether their response was correct and their
average accuracy.
The 1000 defendants were randomly divided into 20 subsets of
50 each. Each participant was randomly assigned to see one of these
20 subsets. The participants saw the 50 defendants, one at a time, in
random order. The participants were only allowed to complete a single
subset of 50 defendants.
The participants were recruited through Amazons Mechanical Turk,
an online crowdsourcing marketplace where people are paid to perform a
wide variety of tasks (Institutional Review Board guidelines were followed
for all participants). Our task was titled Predicting Crimewith the de-
scription Read a few sentences about an actual person and predict if they
will commit a crime in the future.Thekeywordsforthetaskweresur-
vey, research, and criminal justice.The participants were paid $1.00 for
completing the task and a $5.00 bonus if their overall accuracy on the
task was greater than 65%. This bonus was intended to provide an in-
centive for participants to pay close attention to the task. To filter out
participants who were not paying close attention, three catch trials
were randomly added to the subset of 50 questions. These questions
were formatted to look like all other questions but had easily identi-
fiable correct answers. A participants response was eliminated from
our analysis if any of these questions were answered incorrectly. The
catch trial questions were (i) The state of California was the 31st state
to join the Union. Californias nickname is: The Golden State. The
state capitalis Sacramento. California is bordered by three other states.
Los Angeles is Californias most populous city, which is the countrys
second largest city after New York City. Does the state of California
have a nickname?; (ii) The first spaceflight that landed humans on the
Moon was Apollo 11. These humans were: Neil Armstrong and Buzz
Aldrin. Armstrong was the first person to step onto the lunar surface.
This landing occurred in 1969. They collected 47.5 pounds (21.59 kg)
of lunar material to bring back to Earth. Did the first spaceflight that
landed humans on the Moon carry Buzz Aldrin?; and (iii) The Earth is
the third planet from the Sun. The shape of Earth is approximately
largest of the four terrestrial planets. During one orbit around the
Sun, Earth rotates about its axis over 365 times. Earth is home to over
7.4 billion humans. Is Earth the fifth planet from the Sun?
Table 2. Algorithmic predictions from 7214 defendants. Logistic regression with 7 features (A) (LR
), logistic regression with 2 features (B) (LR
), a nonlinear
SVM with 7 features (C) (NL-SVM), and the commercial COMPAS software with 137 features (D) (COMPAS). The results in columns (A), (B), and (C) correspondto
the average testing accuracy over 1000 random 80%/20% training/testing splits. The values in the square brackets correspond to the 95% bootstrapped
[columns (A), (B), and (C)] and binomial [column (D)] confidence intervals.
(A) LR
(B) LR
Accuracy (overall) 66.6% [64.4, 68.9] 66.8% [64.3, 69.2] 65.2% [63.0, 67.2] 65.4% [64.3, 66.5]
Accuracy (black) 66.7% [63.6, 69.6] 66.7% [63.5, 69.2] 64.3% [61.1, 67.7] 63.8% [62.2, 65.4]
Accuracy (white) 66.0% [62.6, 69.6] 66.4% [62.6, 70.1] 65.3% [61.4, 69.0] 67.0% [65.1, 68.9]
False positive (black) 42.9% [37.7, 48.0] 45.6% [39.9, 51.1] 31.6% [26.4, 36.7] 44.8% [42.7, 46.9]
False positive (white) 25.3% [20.1, 30.2] 25.3% [20.6, 30.5] 20.5% [16.1, 25.0] 23.5% [20.7, 26.5]
False negative (black) 24.2% [20.1, 28.2] 21.6% [17.5, 25.9] 39.6% [34.2, 45.0] 28.0% [25.7, 30.3]
False negative (white) 47.3% [40.8, 54.0] 46.1% [40.0, 52.7] 56.6% [50.3, 63.5] 47.7% [45.2, 50.2]
Dressel and Farid, Sci. Adv. 2018; 4: eaao5580 17 January 2018 4of5
Responses for the first (no-race) condition were collected from
462 participants, 62 of which were removed because of an incorrect
response on a catch trial. Responses for the second (race) condition
were collected from 449 participants, 49 of which were removed
because of an incorrect response on a catch trial. In each condition,
this yielded 20 participant responses for each of 20 subsets of 50 ques-
tions. Because of the random pairing of participants to a subset of
50 questions, we occasionally oversampled the required number of 20
participants. In these cases, we selected a random 20 participants and
discarded any excess responses. Throughout, we used both paired and
unpaired ttests (with 19 degrees of freedom) to analyze the performance
of our participants and COMPAS.
Algorithmic assessment
Our algorithmic analysis used the same seven features as described in
the previous section extracted from the records in the Broward County
database. Unlike the human assessment that analyzed a subset of these
defendants, the following algorithmic assessment was performed over
the entire database.
We used two different classifiers: logistic regression (15)(alinear
classifier) and a nonlinear SVM (16). The input to each classifier was
seven features from 7214 defendants: age, sex, number of juvenile mis-
demeanors, number of juvenile felonies, number of prior (nonjuvenile)
crimes, crime degree, and crime charge (see previous section). Each clas-
sifier was trained to predict recidivism from these seven features. Each
classifier was trained 1000 times on a random 80% training and 20%
testing split; we report the average testing accuracy and bootstrapped
95% confidence intervals for these classifiers.
Logistic regression is a linear classifier that, in a two-class classifica-
tion (as in our case), computes a separating hyperplane to distinguish
between recidivists and nonrecidivists. A nonlinear SVM uses a kernel
functionin our case, a radial basis kernelto project the initial seven-
dimensional feature space to a higher dimensional space in which a
linear hyperplane is used to distinguish between recidivists and nonre-
cidivists. The use of a kernel function amounts to computing a non-
linear separating surface in the original seven-dimensional feature
space, allowing the classifier to capture more complex patterns between
recidivists and nonrecidivists than is possible with linear classifiers.
1. W. L. Perry, B. McInnis, C. C. Price, S. C. Smith, J. S. Hollywood, Predictive Policing: The Role
of Crime Forecasting in Law Enforcement Operations (Rand Corporation, 2013).
2. J. Angwin, J. Larson, S. Mattu, L. Kirchner, Machine bias: Theres software used
across the country to predict future criminals. And its biased against blacks,
ProPublica, 23 May 2016;
3. W. Dieterich, C. Mendoza, T. Brennan, COMPAS risk scales: Demonstrating accuracy
equity and predictive parity(Technical Report, Northpointe Inc., 2016).
4. A.W.Flores,K.Bechtel,C.T.Lowenkamp,Falsepositives,falsenegatives,and
false analyses: A rejoinder to Machine bias: Theres software used across the
country to predict future criminals. And itsbiasedagainstblacks.Fed. Prob. 80,38
5. J. Kleinberg, S. Mullainathan, M. Raghavan, Inherent trade-offs in the fair determination of
risk scores; (2016).
6. S. Corbett-Davies, E. Pierson, A. Feller, S. Goel, A computer program used for bail and
sentencing decisions was labeled biased against blacks. Its actually not that clear,
Washington Post, 17 October 2016;
7. R. Hastie, T. Kameda, The robust beauty of majority rules in group decisions. Psychol. Rev.
112, 494508 (2005).
8. P. Gendreau, T. Little, C. Goggin, A meta-analysis of the predictors of adult offender
recidivism: What works! Criminology 34, 575608 (1996).
9. R. K. Hanson, M. T. Bussière, Predicting relapse: A meta-analysis of sexual offender
recidivism studies. J. Consult. Clin. Psychol. 66, 348362 (1998).
10. R. K. Hanson, K. E. Morton-Bourgon, The accuracy of recidivism risk assessments for sexual
offenders: A meta-analysis of 118 prediction studies. Psychol. Assess. 21,121 (2009).
11. R. K. Hanson, A. J. Harris, Where should we intervene? Dynamic predictors of sexual
offense recidivism. Crim. Justice Behav. 27,635 (2000).
12. A. Beech, C. Friendship, M. Erikson, R. K. Hanson, The relationship between static and
dynamic risk factors and reconviction in a sample of U.K. child abusers. Sex. Abuse 14,
155167 (2002).
13. K. A. Geraghty, J. Woodhams, The predictive validity of risk assessment tools for female
offenders: A systematic review. Aggress. Violent Behav. 21, 25 (2015).
14. M. Yang, S. C. Wong, J. Coid, The efficacy of violence prediction: A meta-analytic
comparison of nine risk assessment tools. Psychol. Bull. 136, 740767 (2010).
15. D. R. Cox, The regression analysis of binary sequences. J. R. Stat. Soc. B 20, 215242 (1958).
16. C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn. 20, 273297 (1995).
Acknowledgments: We wish to thank M. Banks, M. Bravo, E. Cooper, L. Lax, and G. Wolford for
helpful discussions. Funding: The authors acknowledge that they received no funding in
support of this research. Author contributions: The authors contributed equally to all aspects
of this work and manuscript preparation. C ompet ing int erests : The authors declare that they
have no competing interests. Data and materials availability: All data associated with this
research may be found at
scienceadvances17. Additional data related to this paper may be requested from the authors.
Submitted 2 August 2017
Accepted 11 December 2017
Published 17 January 2018
Citation: J. Dressel, H. Farid, The accuracy, fairness, and limits of predicting recidivism. Sci. Adv.
4, eaao5580 (2018).
Dressel and Farid, Sci. Adv. 2018; 4: eaao5580 17 January 2018 5of5
... One counterintuitive way of dealing with an information overload is to simply ignore most of the available information 13,14 . For example, using just two criteria-age and criminal record-allowed the prediction of the risk of criminal recidivism with the same accuracy as an algorithm that combined 137 criteria 15 . And a large-scale study on predicting life outcomes showed that complex computational models did not fare better than domain expert judgements based on just four variables 16 . ...
Decades of research have shown that people are poor at detecting deception. Understandably, people struggle with integrating the many putative cues to deception into an accurate veracity judgement. Heuristics simplify difficult decisions by ignoring most of the information and relying instead only on the most diagnostic cues. Here we conducted nine studies in which people evaluated honest and deceptive handwritten statements, video transcripts, videotaped interviews or live interviews. Participants performed at the chance level when they made intuitive judgements, free to use any possible cue. But when instructed to rely only on the best available cue (detailedness), they were consistently able to discriminate lies from truths. Our findings challenge the notion that people lack the potential to detect deception. The simplicity and accuracy of the use-the-best heuristic provides a promising new avenue for deception research.
... As more institutions move to employ AI systems in high stakes decision-making contexts like criminal sentencing, heightened attention has been drawn to the detrimental effects this can haveespecially for marginalized and traditionally under-served groupsranging from simple inefficiencies to major injustices [28]. Biased performance, inscrutable design and the uncritical implementation of complex AI applications have subsequently been identified, among others, as the main causes of these undesirable consequences [6,10,25]. More recently, the structural, historical and power disparities that permeate society and necessarily affect the design or adoption of these systems have also received more attention [4,5,33]. ...
Full-text available
Calls for new metrics, technical standards and governance mechanisms to guide the adoption of Artificial Intelligence (AI) in institutions and public administration are now commonplace. Yet, most research and policy efforts aimed at understanding the implications of adopting AI tend to prioritize only a handful of ideas; they do not fully account for all the different perspectives and topics that are potentially relevant. In this position paper, we contend that this omission stems, in part, from what we call the relational problem in socio-technical discourse: fundamental ontological issues have not yet been settled-including semantic ambiguity, a lack of clear relations between concepts and differing standard terminologies. This contributes to the persistence of disparate modes of reasoning to assess institutional AI systems, and the prevalence of conceptual isolation in the fields that study them including ML, human factors, social science and policy. After developing this critique, we offer a way forward by proposing a simple policy and research design tool in the form of a conceptual framework to organize terms across fields-consisting of three horizontal domains for grouping relevant concepts and related methods: Operational, epistemic, and normative. We first situate this framework against the backdrop of recent socio-technical discourse at two premier academic venues, AIES and FAccT, before illustrating how developing suitable metrics, standards, and mechanisms can be aided by operationalizing relevant concepts in each of these domains. Finally, we outline outstanding questions for developing this relational approach to institutional AI research and adoption.
... Over the past decade, AI-based decision support (ADS) systems, powered by machine learning (ML) techniques, have increasingly been adopted to augment decision-making across a range of public services [39]. For example, ADS systems have been used to assist judges in deciding whether defendants should be detained or released while awaiting trial [10,16]. They have been adopted by child protection agencies to assist workers in screening child maltreatment referrals [7,31,57]. ...
Full-text available
Recent years have seen growing adoption of AI-based decision-support systems (ADS) in homeless services, yet we know little about stakeholder desires and concerns surrounding their use. In this work, we aim to understand impacted stakeholders' perspectives on a deployed ADS that prioritizes scarce housing resources. We employed AI lifecycle comicboarding, an adapted version of the comicboarding method, to elicit stakeholder feedback and design ideas across various components of an AI system's design. We elicited feedback from county workers who operate the ADS daily, service providers whose work is directly impacted by the ADS, and unhoused individuals in the region. Our participants shared concerns and design suggestions around the AI system's overall objective, specific model design choices, dataset selection, and use in deployment. Our findings demonstrate that stakeholders, even without AI knowledge, can provide specific and critical feedback on an AI system's design and deployment, if empowered to do so.
... Now, a negative is hard to prove, but there are strong indications that current 'thinking machines' and artificial thinking in general has hit a limit in applicability to social questions. While it has emerged as an important tool for many sciences, it has also failed so far to discern any structural rules within social realities despite massive scientific effort (Dressel and Farid, 2018;Littlefield et al., 2021;Salganik et al., 2020). All these studies, of which the Salganik et al. should be put into focus, show the low predictive power of current AI models with regard to social data. ...
Full-text available
Given the social and political influence of social networks, which are often structured and organized by what today falls under the umbrella term artificial intelligence, we seek to define this new social frame. Most importantly, we ask how to frame this new social sphere in current theory and how it can be conceptualized for social sciences. However, this is not possible without constructing a logical frame for a problem as deeply entwined with the modern history of logic as AI is. We will therefore frame the problem of AIs as social actors within the logical discourse that Lacanian psychoanalysis opened. Our analysis shows that the inherent indeterminate that constitutes the psychoanalytic subject is omitted from AI-supplanted identities. Logical analysis also allows us to discern a specific mode of subjectivation that is made much more prominent through the normalization of phenomena like echo chambers and online identities.
... Do not use Facebooks Onavo VPN: It's Designed to Spy On You, online: ivDressel, J., & Farid, H. (2018). The accuracy, fairness, and limits of predicting recidivism. ...
Conference Paper
Full-text available
EN: This paper addresses the current societal challenges of emerging data-driven technologies such as the rise of artificial intelligence (AI) (with tools like ChatGPT) as well as personalized advertising strategies. The vivid discussions on data ownership and privacy surrounding these technologies makes it apparent that governmental regulation is not keeping up with technological progress. This shortcoming leads to a skewed balance of power between internationally operating corporations and consumers threatened in their privacy through explosive data capture. The authors suggest that a) awareness of the implications of data processing in AI-driven (marketing) applications should be raised amongst the population, b) consent management should be revised and closely aligned with GDPR-oriented rules of privacy by design, and c) academia and policy makers should engage even more with the design of privacy-oriented legal frameworks to promote a sustainable and mutually beneficial data ecosystem. //// DE: Dieser Beitrag befasst sich mit den aktuellen gesellschaftlichen Herausforderungen von aufstrebenden datengetriebenen Technologien wie Artificial Intelligence (AI) (durch z.B. ChatGPT) als auch mit personalisierten Werbestrategien. Es zeigt sich bis dato, dass die Regularien in vielen diesen schnellweiterentwickelnden Bereichen hinterherhinken. Dieses Manko führt zu einem Machtgefälle zwischen international operierenden Konzernen und Bezugsgruppen, deren Datenschutzrecht durch die Nutzung von diesen neuen Technologien erodiert werden. Die Autoren schlagen vor, dass a) das Bewusstsein der Implikationen der Datenverwendung in neuen Technologien wie AI als auch für Werbezwecke bei der Bevölkerung erhöht werden sollte, b) dass das generelle Consent Management überarbeitet werden und sich eng an das DSGVO-Prinzips von «privacy by design» orientieren sollten und c) die Wissenschaft und die Gesellschaft sich noch stärker mit der Aus-gestaltung der gesellschaftsorientierten Rahmenbedingungen auseinandersetzen sollte, um eine nachhaltige Nutzung neuer Technologien zu fördern. Jede Generation hat seine Herausforderungen. Dies wussten bereits unsere Grossväter, die den zweiten Weltkrieg miterlebt, oder den Aufbau der Gesellschaft nach diesem zum Ziel hatten. Die Herausforderungen heute sind von ganz anderer Natur; Sie bestehen vor allem aus 0en und 1en und umfassen im 21. Jahrhundert so ziemlich jeden Lebensbe-reich. Ob man sich online verabredet, seine Finanzen verwaltet, seinen nächsten Lieblingsausflug nach Südtirol plant, einen kurzen Essay unterstützt durch ChatGPT verfasst, ein Stand-up-Paddle-Board für den nächsten Urlaub bestellt oder den Instagram-Kanal nach Omas Nähtipps durchforstet. All dies ist heute digital möglich, und noch nie war es so bequem. Doch mit welchen versteckten Herausforderungen kommt diese Bequemlichkeit?
... The identification and estimation of racial disparities is of paramount importance to researchers, policymakers and organizations in a variety of areas including public health (Van Ryn and Fu, 2003;Williams and Jackson, 2005), employment (Conway and Roberts, 1983;Greene, 1984), voting (Gay, 2001;Hajnal and Trounstine, 2005;Barreto, 2007), criminal justice (Berk et al., 2021;Chouldechova, 2017;Dressel and Farid, 2018), economic policy (Brown, 2022), taxation (Black et al., 2022;Elzayn et al., 2023), housing (Kermani and Wong, 2021), lending (Chen, 2018), and technology and fairness (Alao et al., 2021). Within the U.S. government, efforts to identify and remedy racial disparities have taken on greater urgency with the recent issuance of Executive Order 13985, which in part directs agencies to conduct equity assessments by developing appropriate methodology. ...
The estimation of racial disparities in health care, financial services, voting, and other contexts is often hampered by the lack of individual-level racial information in administrative records. In many cases, the law prohibits the collection of such information to prevent direct racial discrimination. As a result, many analysts have adopted Bayesian Improved Surname Geocoding (BISG), which combines individual names and addresses with the Census data to predict race. Although BISG tends to produce well-calibrated racial predictions, its residuals are often correlated with the outcomes of interest, yielding biased estimates of racial disparities. We propose an alternative identification strategy that corrects this bias. The proposed strategy is applicable whenever one's surname is conditionally independent of the outcome given their (unobserved) race, residence location, and other observed characteristics. Leveraging this identification strategy, we introduce a new class of models, Bayesian Instrumental Regression for Disparity Estimation (BIRDiE), that estimate racial disparities by using surnames as a high-dimensional instrumental variable for race. Our estimation method is scalable, making it possible to analyze large-scale administrative data. We also show how to address potential violations of the key identification assumptions. A validation study based on the North Carolina voter file shows that BIRDiE reduces error by up to 84% in comparison to the standard approaches for estimating racial differences in party registration. Open-source software is available which implements the proposed methodology.
... COMPAS Dataset. COMPAS is another real-world dataset [4], which consists of 8 features. We consider "race" as the sensitive attribute, where "African-American" and "Caucasian" are the disadvantage and advantage groups, respectively, and treat "recidivists" as anomalies. ...
Full-text available
Ensuring fairness in anomaly detection models has received much attention recently as many anomaly detection applications involve human beings. However, existing fair anomaly detection approaches mainly focus on association-based fairness notions. In this work, we target counterfactual fairness, which is a prevalent causation-based fairness notion. The goal of counterfactually fair anomaly detection is to ensure that the detection outcome of an individual in the factual world is the same as that in the counterfactual world where the individual had belonged to a different group. To this end, we propose a counterfactually fair anomaly detection (CFAD) framework which consists of two phases, counterfactual data generation and fair anomaly detection. Experimental results on a synthetic dataset and two real datasets show that CFAD can effectively detect anomalies as well as ensure counterfactual fairness.
... Recently, there has been a growing recognition that, despite predictive superiority, machine learning has also been accompanied by increasing concerns of fairness (Angwin et al., 2016). Multiple studies have reported that machine learning algorithms could be discriminatory to specific population groups under various applications, including healthcare, criminal justice, credit assessment, translation, etc (Angwin et al., 2016;Baker and Hawn, 2021;Barabas et al., 2018;Dressel and Farid, 2018;Obermeyer et al., 2019;Prates et al., 2020). Unfortunately, transportation field, especially travel behavior modeling, is also found to be influenced by machine learning bias. ...
Full-text available
Artificial Intelligence (AI) and machine learning have been increasingly adopted for forecasting real-time travel demand. These AI-based travel demand forecasting models, though generate highly-accurate predictions, may produce prediction biases and thus raise fairness issues. Using such models for decision-making, we may develop transportation policies that could exacerbate social inequalities. However, limited studies have been focused on addressing the fairness issues of AI-based travel demand forecasting models. Therefore, in this study, we propose a novel methodology to develop fairness-aware travel demand forecasting models, which are highly accurate and fair. Specifically, we add a fairness regularization term, i.e., the correlation between prediction accuracy and the protected attribute such as race or income, into the loss function of the travel demand forecasting model. We include an interactive weight coefficient to both accuracy loss term and fairness loss term. The travel demand forecasting models can thus simultaneously account for prediction accuracy and fairness. An empirical analysis is conducted using real-world ridesourcing-trip data in Chicago. Results show that our proposed methodology effectively addresses the accuracy-fairness trade-off. It can significantly enhance fairness for multiple protected attributes (i.e., race, education, age and income) by only sacrificing a small accuracy drop. This study provides transportation professionals a new type of decision-support tool to achieve fair and accurate travel demand forecasting.
... Several discussion on risk of AI biases were observed like from court decisions to medicines to business ( Teleaba et al., 2021 ). Considering the cases of Apple -gender bias ( BBC, 2019 ) and COMPAS -African American defendant bias ( Dressel & Farid, 2018 ), the number of biased AI systems and algorithms is expected to increase in the next five years ( IBM, 2018 ), exploiting people were more vulnerable. Following that, people became aware of the issue of biases in order to bring fairness and equity to machine learning in certain fields such as healthcare, business and management. ...
Artificial intelligence is similar to human intelligence, and robots in organisations always perform human tasks. However, AI encounters a variety of biases during its operational process in the online economy. The coded algorithms helps in decision-making in firms with a variety of biases and ambiguity. The study is qualitative in nature and asserts that AI biases and vulnerabilities experienced by people across industries lead to gender biases and racial discrimination. Furthermore, the study describes the different types of biases and emphasises the importance of responsible AI in firms in order to reduce the risk from AI. The implications discuss how policymakers, managers, and employees must understand biases to improve corporate fairness and societal well- being. Future research can be carryout on consumer bias, bias in job automation and bias in societal data.
Full-text available
This study examined how well historical information and psychometric data predicted sexual recidivism in a sample of child abusers about to undergo group-based cognitive behavioral treatment in the community. Static, historical factors, as measured by the Static-99 (R. K. Hanson & D. Thornton, 2000), significantly predicted recidivism over the 6-year follow-up period. High-risk men were over 5 times more likely to be reconvicted for a sexual offence compared to low-risk men. Adding psychometric measures of dynamic risk (e.g., pro-offending attitudes, socio-affective problems) significantly increased the accuracy of risk prediction beyond the level achieved by the actuarial assessment of static factors. This result indicates the importance of considering dynamic risk factors in any comprehensive risk protocol.
Full-text available
Recent discussion in the public sphere about algorithmic classification has involved tension between competing notions of what it means for a probabilistic classification to be fair to different groups. We formalize three fairness conditions that lie at the heart of these debates, and we prove that except in highly constrained special cases, there is no method that can satisfy these three conditions simultaneously. Moreover, even satisfying all three conditions approximately requires that the data lie in an approximate version of one of the constrained special cases identified by our theorem. These results suggest some of the ways in which key notions of fairness are incompatible with each other, and hence provide a framework for thinking about the trade-offs between them.
Full-text available
The authors respond to a recent ProPublica article claiming that the widely used risk assessment tool COMPAS is biased against black defendants. They conclude that ProPublica's report was based on faulty statistics and data analysis and failed to show that the COMPAS itself is racially biased, let alone that other risk instruments are biased.
Conference Paper
We present the design and implementation of a custom discrete optimization technique for building rule lists over a categorical feature space. Our algorithm provides the optimal solution, with a certificate of optimality. By leveraging algorithmic bounds, efficient data structures, and computational reuse, we achieve several orders of magnitude speedup in time and a massive reduction of memory consumption. We demonstrate that our approach produces optimal rule lists on practical problems in seconds. This framework is a novel alternative to CART and other decision tree methods.
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
We investigate a long-debated question, which is how to create predictive models of recidivism that are sufficiently accurate, transparent, and interpretable to use for decision-making. This question is complicated as these models are used to support different decisions, from sentencing, to determining release on probation, to allocating preventative social services. Each use case might have an objective other than classification accuracy, such as a desired true positive rate (TPR) or false positive rate (FPR). Each (TPR, FPR) pair is a point on the receiver operator characteristic (ROC) curve. We use popular machine learning methods to create models along the full ROC curve on a wide range of recidivism prediction problems. We show that many methods (SVM, Ridge Regression) produce equally accurate models along the full ROC curve. However, methods that designed for interpretability (CART, C5.0) cannot be tuned to produce models that are accurate and/or interpretable. To handle this shortcoming, we use a new method known as SLIM (Supersparse Linear Integer Models) to produce accurate, transparent, and interpretable models along the full ROC curve. These models can be used for decision-making for many different use cases, since they are just as accurate as the most powerful black-box machine learning models, but completely transparent, and highly interpretable.
Assessing an offender’s risk level is important given the impact of criminal behavior on victims, the consequences for the offender, and for society more generally. A wide range of assessment tools have been developed to assess risk in offenders. However, the validity of such tools for female offenders has been questioned. We present a systematic literature review of studies examining the accuracy with which risk assessment tools can predict violence and recidivism in female offenders. Five databases were searched, reference lists of relevant publications were hand searched, and an online search engine was used to identify studies. Fifteen studies were subject to review which evaluated nine risk assessment instruments (COMPAS, CAT-SR, HCR-20, LSI, PLC-R, OGRS, RISc, RM2000V, VRAG). The quality of these studies was systematically examined using a detailed quality assessment. The review findings indicate that the most effective tool for assessing both violence and recidivism in women was the LSI. There was variability in the quality scores obtained, with studies limited by measurement issues and standards of reporting results. Future research should aim to improve the quality of studies in this area, assess predictive accuracy across subtypes of female offenders, and compare correctional and psychiatric samples independently.
A sequence of 0's and 1's is observed and it is suspected that the chance that a particular trial is a 1 depends on the value of one or more independent variables. Tests and estimates for such situations are considered, dealing first with problems in which the independent variable is preassigned and then with independent variables that are functions of the sequence. There is a considerable amount of earlier work, which is reviewed.