ArticlePDF AvailableLiterature Review

Reliance on metrics is a fundamental challenge for AI

Authors:

Abstract

Through a series of case studies, we review how the unthinking pursuit of metric optimization can lead to real-world harms, including recommendation systems promoting radicalization, well-loved teachers fired by an algorithm, and essay grading software that rewards sophisticated garbage. The metrics used are often proxies for underlying, unmeasurable quantities (e.g., “watch time” of a video as a proxy for “user satisfaction”). We propose an evidence-based framework to mitigate such harms by (1) using a slate of metrics to get a fuller and more nuanced picture; (2) conducting external algorithmic audits; (3) combining metrics with qualitative accounts; and (4) involving a range of stakeholders, including those who will be most impacted.
Perspective
Reliance on metrics
is a fundamental challenge for AI
Rachel L. Thomas
1
and David Uminsky
2,
*
1
Queensland University of Technology, Brisbane, QLD, Australia
2
University of Chicago, Chicago, IL, USA
*Correspondence: uminsky@uchicago.edu
https://doi.org/10.1016/j.patter.2022.100476
SUMMARY
Through a series of case studies, we review how the unthinking pursuit of metric optimization can lead to real-
world harms, including recommendation systems promoting radicalization, well-loved teachers fired by an
algorithm, and essay grading software that rewards sophisticated garbage. The metrics used are often prox-
ies for underlying, unmeasurable quantities (e.g., ‘‘watch time’’ of a video as a proxy for ‘‘user satisfaction’’).
We propose an evidence-based framework to mitigate such harms by (1) using a slate of metrics to get a fuller
and more nuanced picture; (2) conducting external algorithmic audits; (3) combining metrics with qualitative
accounts; and (4) involving a range of stakeholders, including those who will be most impacted.
INTRODUCTION
Metrics can play a central role in decision making across data-
driven organizations, and their advantages and disadvantages
have been widely studied.
1,2
Metrics play an even more central
role in artificial intelligence (AI) algorithms, and as such their risks
and disadvantages are heightened. Some of the most alarming
instances of AI algorithms run amok, such as recommendation
algorithms contributing to radicalization,
3
teachers described
as ‘‘creative and motivating’’ being fired by an algorithm,
4
or
essay grading software that rewards sophisticated garbage
5
all
result from overemphasizing metrics. We have to understand
this dynamic in order to understand the urgent risks we are fac-
ing as a result of misuse of AI.
At their heart, what most current AI approaches do is opti-
mize metrics. The practice of optimizing metrics is not new or
unique to AI, yet AI can be particularly efficient (even too effi-
cient) at doing so. This unreasonable effectiveness at opti-
mizing metrics results in one of the grand challenges in AI
design and ethics: the metric optimization central to AI often
leads to manipulation, gaming, a focus on short-term quantities
(at the expense of longer-term concerns), and other undesir-
able consequences, particularly when done in an environment
designed to exploit people’s impulses and weaknesses. More-
over, this challenge also yields, in parallel, an equally grand
contradiction in AI development: optimizing metrics results in
far from optimal outcomes.
Some of the issues with metrics are captured by Goodhart’s
law, ‘‘When a measure becomes a target, it ceases to be a
good measure.’’
6,7
We examine in this paper, through a review
of a series of real-world case studies, how Goodhart’s law, as
well as additional consequences of AI’s reliance on optimizing
metrics, is already having an impact in society. We find from
this review the following well-supported principles:
dAny metric is just a proxy for what you really care about.
dMetrics can, and will, be gamed.
dMetrics tend to overemphasize short-term concerns.
dMany online metrics are gathered in highly addictive envi-
ronments.
Given the harms of overemphasizing metrics, it is important to
work to mitigate these issues that will remain in most use cases
THE BIGGER PICTURE The success of current artificial intelligence (AI) approaches such as deep learning
centers on their unreasonable effectiveness at metric optimization, yet overemphasizing metrics leads to a
variety of real-world harms, including manipulation, gaming, and a myopic focus on short-term qualities
and inadequate proxies. This principle is classically captured in Goodhart’s law: when a measure becomes
the target, it ceases to be an effective measure. Current AI approaches have we aponized Goodhart’s law by
centering on optimizing a particular measure as a target. This poses a grand contradiction within AI design
and ethics: optimizing metrics results in far from optimal outcomes. It is crucial to understand this dynamic in
order to mitigate the risks and harms we are facing as a result of misuse of AI.
Concept: Basic principles of a new
data science output observed and reported
ll
OPEN ACCESS
Patterns 3, May 13, 2022 ª2022 1
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
of AI. We conclude by proposing a framework for the healthier
use of metrics that includes:
dUse slate of metrics to get a fuller picture.
dSet up mechanisms for independent, external algorithmic
audits.
dCombine metrics with qualitative accounts.
dInvolve a range of stakeholders, including those who will be
most impacted.
ON METRICS, MACHINE LEARNING, AND AI
The formal mathematical definition of a metric
8
is a rule that, for
each pair of elements pand qin a set E, associates to them a
real-number d(p,q), which has the following properties:
dðp;qÞR0
dðp;qÞ=0if and only if p =q
dðp;qÞ=dðq;pÞ
dðp;rÞRdðp;qÞ+dðq;rÞ
(Equation 1)
A metric can be used to measure the magnitude of any
element in the set for which it is defined by evaluating the dis-
tance between that element and zero.
This concept of a metric for an organization is often used inter-
changeably with key performance indicators (KPIs), which can
be defined as ‘‘the quantifiable measures an organization uses
to determine how well it meets its declared operational and stra-
tegic goals’’.
9
In practice, metrics and KPIs are also used more
generally to refer to measurements made in the realm of software
products or business, such as page views/impressions, click-
through rates, time spent watching, quarterly earnings, and
more. Here, ‘‘metric’’ is used to refer to something that can be
measured and quantified as a numeric value. Although they are
not identical, the mathematical definition of metric (which is rele-
vant for formal machine learning [ML] work that involves optimi-
zation for a cost or loss function) is very much linked to the more
informal business usage of the term metric.
ML has been defined as follows: ‘‘A computer program is said
to learn from experience Ewith respect to some class of tasks T
and performance measure P, if its performance at tasks in T,as
measured by P, improves with experience E.’’
10,11
Here, the per-
formance measure Pmust be a well-defined quantitative mea-
sure, such as accuracy or error rate. Defining and choosing
this measure can involve a number of choices, such as whether
to give partial credit for some answers, how to weigh false pos-
itives relative to false negatives, the penalties for frequent me-
dium mistakes relative to rare large mistakes, and more.
In the context of deep learning (DL), training a model is the pro-
cess of:
11
Finding the parameters qof a neural network that signifi-
cantly reduce a cost function J(q), which typically includes
a performance measure evaluated on the entire training
set, as well as additional regularization terms.
That is, in the case of DL or ML more broadly, model training is
explicitly defined around the process of optimizing metrics (in
this case, minimizing cost). When one unpacks the common
cost functions optimized in the training of ML/DL models,
including mean squared error (MSE), mean absolute error
(MAE), etc., they are constructed using classic metrics that
satisfy the mathematical definition (Equation 1). Thus, by their
design, it is this singular pursuit of minimizing the ‘‘error’’ of our
algorithms that explicitly links metrics in the use of AI.
Goodhart’s law and AI
Goodhart’s law is often given as one of the limitations of metrics:
a measure (or metric) that becomes a target ceases to be a good
measure. This is a fundamental challenge with metrics and a
relevant lens to see the shortcomings of AI’s reliance on metrics.
This law has appeared in various forms since economist Charles
Goodhart first proposed it in 1975 in response to how monetary
metrics broke down after central banks adopted them as targets.
This arose out of attempts in the early 1970s to control inflation
by choosing metrics with a stable relationship to inflation as tar-
gets. Central banks from a number of countries did this, and in
most cases the relationship between inflation and the metric
chosen broke down once that metric became a target. In his en-
try on Goodhart’s law in The Encyclopedia of Central Banking,
6
Goodhart refers to the popular phrasing of his namesake law,
‘‘When a measure becomes a target, it ceases to be a good mea-
sure.’’
7
Even when participants can recognize the limitations or
absurdity of the metrics in use, they may still end up acting on
and internalizing such metrics, as analyzed in the case study of
how U.S. News & World Report rankings have completely
pervaded law schools.
12
As noted in the previous section, the target of training an ML/
DL model is defined as improving or optimizing our cost
function J(q). That is, by definition, DL is a process in which a
measure is the target. Thus, there is an increased likelihood
that any risks of optimizing metrics are heightened by AI,
13
and Goodhart’s law grows increasingly relevant. The core act
of optimizing in the process of developing an AI tool places
Goodhart’s Law at the center of concern in the use of AI. Sim-
ple examples abound here. Consider a computer vision binary
classification task that includes severely imbalanced data (e.g.,
a dataset that contains 998 images of a dog and 2 images of a
cat). Selection of our metric that defines a ‘‘success algorithm’’
is littered with Goodhart’s law potholes. If we define success
with recall,
R=True Positive=ðTrue Positive +False NegativesÞ;
then we simply can game our classification task by replacing a
computer vision algorithm with an algorithm that ignores the
input and always gives the answer ‘‘dog,’’ which will achieve a
recall of R = 99.8%. Although the downside risk seems minimal
on such a toy example, the implications are clearer if one con-
siders similar tasks of imbalanced medical imaging tasks asso-
ciated with cancer diagnosis. For example, convolutional neural
networks used to identify pneumonia in chest X-rays have hospi-
tal-system-specific biases and are susceptible to worse perfor-
mance on X-rays from hospitals outside their training set.
14
This toy example is clearly myopic in its focus on recall alone,
rather than a broader slate of considerations. However, this
myopia is often found even in more complex algorithmic
ll
OPEN ACCESS
2Patterns 3, May 13, 2022
Perspective
systems, with a focus on a limited metric (or small set of metrics)
often eclipsing other relevant concerns.
PREVIOUS RELATED WORK
As part of a much broader survey of the field of AI ethics, Green
15
refers to concerns of ‘‘Goodharting’’ the explanation function; for
instance, if an algorithm learns to produce answers that humans
find appealing as opposed to accurate answers.
Work by Manheim and Garrabrant
16
develops a taxonomy of
four distinct failure modes that are all contained in Goodhart’s
law. Here Goodhart’s law is framed in terms of a metric being
chosen as a proxy for a goal, and the collapse that occurs with
that proxy as a target: (1) regressional Goodhart, in which the dif-
ference between the proxy and the goal is important; (2) extremal
Goodhart, in which the relationship between the proxy and the
goal is different when taken to an extreme; (3) causal Goodhart,
in which there is a non-causal relationship between the proxy
and goal, and intervening on one may fail to impact the other;
and (4) adversarial Goodhart, in which adversaries attempt to
game the chosen proxy. The authors then provide subcategories
within causal Goodhart, based on different underlying relation-
ships between the proxy and the goal, and different subcate-
gories of adversarial Goodhart, based on the behavior of the
regulator and the agent. Mathematical formulations for the failure
modes are presented.
A related issue is ‘‘reward hacking’’ in which an algorithm
games its reward function, learning a ‘‘clever’’ way of optimizing
the function that subverts the designer/programmer’s intent.
This is most often talked about in the field of reinforcement
learning. For example, in a reimplementation of AlphaGo, the
computer player learned to pass forever if passing was an al-
lowed move. Victoria Krakovna
17
maintains a spreadsheet with
dozens of such examples. Although this is a subset of the types
of case we will consider, reward hacking typically refers to the al-
gorithm’s internal behavior, whereas here we are primarily inter-
ested in systems that involve a mix of human and algorithmic
behavior.
Work by Jacobs and Wallach
18
draws on the social sciences in
framing how many harms caused by computational systems are
the result of a mismatch between an unobservable theoretical
construct (e.g., "creditworthiness," "teacher quality," "risk to so-
ciety") and how it is operationalized. They offer a taxonomy of
many forms of validity; a measurement model may fail if it is
invalid across any of these dimensions. This work illustrates
that it is not just that we overemphasize metrics, but rather
that those metrics are often an invalid way of capturing the qual-
ity we care about.
The issue of over-optimizing single metrics is so worrisome
that recent work
19
has suggested replacing direct optimization
with ‘‘q-quantilizers,’’ which practically attempt to replace
offering the deterministic highest utility maximizer with random
alternatives weighted by their utility. The idea purports to miti-
gate Goodharting but at the cost of lower utility and is sensitive
to how well understood what a ‘‘safe’’ base distribution of ac-
tions looks like. Slee
13
goes further to state that in many cases,
AI’s reliance on pure incentive maximization is incompatible
with actually achieving optimal outcomes for the intended
use cases.
Our goal here is to elucidate different characteristics of how
Goodhart’s law manifests in a series of real-world cases studies,
aspects of how our online environment and current business
practices are exacerbating these failures, and propose a frame-
work toward mitigating the harms caused by overemphasis of
metrics within AI.
WE CANNOT MEASURE THE THINGS THAT
MATTER MOST
Metrics are typically just a proxy for what we really care about.
Mullainathan and Obermeyer
20
cover an interesting example:
the researchers investigate which factors in someone’s elec-
tronic medical record are most predictive of a future stroke,
and they find that several of the most predictive factors (such
as accidental injury, a benign breast lump, or colonoscopy) do
not make sense as risk factors for stroke. It turned out that the
model was simply identifying people who utilize health care a
lot. They did not actually have data of who had a stroke (a phys-
iological event in which regions of the brain are denied new ox-
ygen); they had data about who had access to medical care,
chose to go to a doctor, were given the needed tests, and had
this billing code added to their chart. But a number of factors in-
fluence this process: who has health insurance or can afford their
co-pay, who can take time off of work or find childcare, gender
and racial biases that impact who gets accurate diagnoses, cul-
tural factors, and more. As a result, the model was largely picking
out people who utilized health care versus who did not.
This is an example of the common phenomenon of having to
use proxies: you want to know what content users like, so you
measure what they click on. You want to know which teachers
are most effective, so you measure their students’ test scores.
You want to know about crime, so you measure arrests. These
things are not the same. Many things we do care about cannot
be measured. Metrics can be helpful, but we cannot forget that
they are just proxies. As another example, Google used hours
spent watching YouTube as a proxy for how happy users were
with the content, writing on the Google blog that ‘‘If viewers
are watching more YouTube, it signals to us that they’re happier
with the content they’ve found’’.
21
Guillaume Chaslot,
22
founder
of independent watch group AlgoTransparency and an AI engi-
neer who formerly worked at Google/YouTube, shares how
this had the side effect of incentivizing conspiracy theories,
because convincing users that the rest of the media is lying
kept them watching more YouTube.
METRICS CAN, AND WILL, BE GAMED
It is almost inevitable that metrics will be gamed, particularly
when they are given too much power. A detailed case study
about ‘‘a system of governance of public services that combined
targets with an element of terror’’ developed for England’s health
care system in the 2000s covers many ways that gaming can
occur.
23
They then detail many specific examples of how gaming
manifested in the English health care system. For example, tar-
gets around emergency department wait times led some hospi-
tals to cancel scheduled operations in order to draft extra staff to
the emergency department, to require patients to wait in queues
of ambulances, and to turn stretchers into ‘‘beds’’ by putting
ll
OPEN ACCESS
Patterns 3, May 13, 2022 3
Perspective
them in hallways. There were also significant discrepancies in
numbers reported by hospitals versus those reported by pa-
tients; for instance, around wait times, according to official
numbers, 90% of patients were seen in less than 4 h, but only
69% of patients said they were seen in less than 4 h when
surveyed.
23
As education policy in the United States began overemphasiz-
ing student test scores as the primary way to evaluate teachers,
there have been widespread scandals of teachers and principals
cheating by altering students’ scores in Georgia, Indiana, Mas-
sachusetts, Nevada, Virginia, Texas, and elsewhere.
24
One
consequence of this is that teachers who do not cheat may be
penalized or even fired when it appears student test scores
have dropped to more average levels under their instruction.
4
When metrics are given undue importance, attempts to game
those metrics become common.
Jacobs and Wallach
18
formalize this notion, drawing on con-
cepts from the social sciences, to illustrate how attempts at
formal metrics are attempts to operationalize what are often un-
observable, theoretical constructs. One of the case studies
included in their paper is the ‘‘value-added’’ model of measuring
teacher effectiveness that relies primarily on change in student
test scores from year to year. They highlight a number of ways
that this measurement model is invalid. The measure lacks
test-retest reliability, with teachers’ ratings often fluctuating
wildly from year to year. It also fails convergent validity because
the measurements do not correlate with other measures of
teacher effectiveness, such as assessments by principals, col-
leagues, and parents of students. Research
19
showed that the
value-added model gives lower scores to teachers at schools
with higher proportions of students who are not white or are of
low socioeconomic status, suggesting that this method is un-
fairly biased and will exacerbate inequality when used in decision
making. This is a strong indication that the measure lacks conse-
quential validity, in that it fails to identify these downstream con-
sequences.
A modern AI case study can be drawn from recommendation
systems, which are widely used across many platforms to rank
and promote content for users. Platforms are rife with attempts
to game their algorithms, to show up higher in search results or
recommended content, through fake clicks, fake reviews, fake
followers, and more.
25
There are entire marketplaces for pur-
chasing fake reviews and fake followers, etc.
During one week in April 2019, Chaslot collected 84,695
videos from YouTube and analyzed the number of views and
the number of channels from which they were recommended.
26
The state-owned media outlet Russia Today, abbreviated RT,
was an extreme outlier in how much YouTube’s algorithm had
selected it to be recommended by a wide variety of other
YouTube channels. According to Harwell and Timberg:
26
Chaslot said in an interview that while the RT video ulti-
mately did not get massive viewership—only about
55,000 views—the numbers of recommendations suggest
that Russians have grown adept at manipulating
YouTube’s algorithm, which uses machine-learning soft-
ware to surface videos it expects viewers will want to
see. The result, Chaslot said, could be a gradual, subtle
elevation of Russian views online because such videos
result in more recommendations and, ultimately, more
views that can generate more advertising revenue
and reach.
Automatic essay grading software currently used in at least
22 US states focuses primarily on metrics such as sentence
length, vocabulary, spelling, and subject-verb agreement, but
is unable to evaluate aspects of writing that are hard to quantify,
such as creativity. As a result, gibberish essays randomly gener-
ated by computer programs to contain lots of sophisticated
words score well. Essays from students in mainland China,
which do well on essay length and sophisticated word choice,
received higher scores from the algorithms than from expert hu-
man graders, suggesting that these students may be using
chunks of pre-memorized text.
5
METRICS OVEREMPHASIZE SHORT-TERM CONCERNS
It is much easier to measure short-term quantities: click-through
rates, month-over-month churn, and quarterly earnings. Many
long-term trends have a complex mix of factors and are tougher
to quantify. While short-term incentives have led YouTube’s al-
gorithm to promote pedophilia,
27
white supremacy,
3
and flat-
earth theories,
28
the long-term impact on user trust will not be
positive. Similarly, Facebook has been the subject of years’
worth of privacy scandals, political manipulation, and facilitating
genocide,
29
which is now having a longer-term negative impact
on Facebook’s ability to recruit new engineers.
30
Simply measuring what users click on is a short-term concern
and does not take into account factors like the potential long-
term impact of a long-form investigative article that may have
taken months to research and that could help shape a reader’s
understanding of a complex issue and even lead to significant
societal changes.
The Wells Fargo account fraud scandal provides a case study
of how letting metrics replace strategy can harm a business.
31
After identifying cross-selling as a measure of long-term
customer relationships, Wells Fargo went overboard empha-
sizing the cross-selling metric: intense pressure on employees
combined with an unethical sales culture led to 3.5 million fraud-
ulent deposit and credit card accounts being opened without
customers’ consent. The metric of cross-selling is a much
more short-term concern compared with the loftier goal of
nurturing long-term customer relationships. Overemphasizing
metrics removes our focus from long-term concerns such as
our values, trust, and reputation and our impact on society and
the environment, and myopically focuses on the short-term.
MANY METRICS GATHER DATA OF WHAT WE DO IN
HIGHLY ADDICTIVE ENVIRONMENTS
It matters which metrics we gather and in what environment we
do so. Metrics such as what users click on, how much time they
spend on sites, and ‘‘engagement’’ are heavily relied on by tech
companies as proxies for user preference and are used to drive
important business decisions. Unfortunately, these metrics are
gathered in environments engineered to be highly addictive,
laden with dark patterns, and where financial and design deci-
sions have already greatly circumscribed the range of options.
ll
OPEN ACCESS
4Patterns 3, May 13, 2022
Perspective
Although this is not a characteristic inherent to metrics, it is a
current reality of many of the metrics used by tech companies
today. A large-scale study analyzing approximately 11,000
shopping websites found 1,818 dark patterns present on 1,254
websites, 11.1% of the total sites.
32
These dark patterns
included obstruction, misdirection, and misrepresenting user ac-
tions. The study found that more popular websites were more
likely to feature these dark patterns.
Zeynep Tufekci (as quoted by Lewis) compares recommenda-
tion algorithms (such as YouTube choosing which videos to
auto-play for you and Facebook deciding what to put at the
top of your newsfeed) to a cafeteria shoving junk food into chil-
dren’s faces:
33
This is a bit like an autopilot cafeteria in a school that has
figured out children have sweet-teeth, and also like fatty
and salty foods. So you make a line offering such food,
automatically loading the next plate as soon as the bag
of chips or candy in front of the young person has been
consumed.
As those selections get normalized, the output becomes ever
more extreme: ‘‘So the food gets higher and higher in sugar, fat
and salt – natural human cravings – while the videos recommen-
ded and auto-played by YouTube get more and more bizarre or
hateful.’’
33
Too many of our online environments are like this, with metrics
capturing that we love sugar, fat, and salt, not taking into account
that we are in the digital equivalent of a food desert and that com-
panies have not been required to put nutrition labels on what they
are offering. Such metrics are not indicative of what we would
prefer in a healthier or more empowering environment.
A FRAMEWORK FOR A HEALTHIER USE OF METRICS
All this is not to say that we should throw metrics out altogether.
Data can be valuable in helping us understand the world, test hy-
potheses, and move beyond gut instincts or hunches. Metrics
can be useful when they are in their proper context and place.
We propose a few mechanisms for addressing these issues:
dUse a slate of metrics to get a fuller picture
dConduct external algorithmic audits.
dCombine with qualitative accounts.
dInvolve a range of stakeholders, including those who will be
most impacted.
Use a slate of metrics to get a fuller picture and reduce
gaming
One way to keep metrics in their place is to consider a slate
of many metrics for a fuller picture (and resist the temptation
to try to boil these down to a single score). For instance,
knowing the rates at which tech companies hire people
from under-indexed groups is a very limited data point. For
evaluating diversity and inclusion at tech companies, we
need to know comparative promotion rates, cap table owner-
ship, retention rates (many tech companies are revolving
doors driving people from under-indexed groups away with
their toxic cultures), number of harassment victims silenced
by nondisclosure agreements (NDAs), rates of under-leveling,
and more. Even then, all these data should still be combined
with listening to first-person experiences of those working at
these companies.
Likierman
1
indicates that using a diverse slate of metrics is one
strategy to avoid gaming:
It helps to diversify your metrics, because it’s a lot harder
to game several of them at once. [International law firm]
Clifford Chance replaced its single metric of billable hours
with seven criteria on which to base bonuses: respect and
mentoring, quality of work, excellence in client service,
integrity, contribution to the community, commitment to
diversity, and contribution to the firm as an institution.
Likierman also cites the example of Japanese telecommunica-
tions company SoftBank using performance metrics defined for
three distinct time horizons to make them harder to game.
External algorithmic audits
To address the risk that revealing details of an algorithm’s inner
workings could lead to increased gaming, some have proposed
external audits as a mechanism to provide accountability and in-
dependent verification while maintaining privacy.
34–37
External
audits conducted without companies’ cooperation have been
effectively used in investigative journalism, with many powerful
examples from Pro-Publica
34
and The Markup
38
and academia,
including the Gender Shades
39
study, which led to real-world
legislation. There is a growing group of independent firms that
can be hired by companies for voluntary audits, but no consis-
tent standards around what such audits involve. Currently, com-
panies are not incentivized to find or correct their issues, and the
audit firms they pay are not well incentivized to provide negative
findings.
40
Just as tax audits that evaluate a company’s adher-
ence to a compliance process (and not just submitted financial
documents) lead to better outcomes, effective algorithmic audits
will need to consider the company’s processes and not just final
performance on a benchmark.
41
Toothless and inconsistent au-
dits run the risk of legitimizing harmful tech.
42
Meaningful regula-
tion, consistency, and consequences for harmful or discrimina-
tory algorithms could make audits more impactful.
37,40
In the
context of discussing YouTube and Facebook’s recommenda-
tion systems, Zuckerman
43
suggests external audits as an
approach that could provide accountability while avoiding the
gaming that open sourcing these algorithms would invite, draw-
ing parallels with the financial industry:
Much in the same way that an accounting firm can come in
and look at privileged information. It has a fiduciary re-
sponsibility for the company, so it’s not going to share it
around, but it also has professional responsibility to be
bound by certain standards of accounting and to raise
their hand when those things are being violated .To
me, in many ways, it may be more realistic than trying to
open source these algorithms, which gets a little hard
because it makes them highly gameable.
Combine with qualitative accounts
Columbia professor and New York Times Chief Data Scientist
Chris Wiggins
44
wrote that quantitative measures should always
ll
OPEN ACCESS
Patterns 3, May 13, 2022 5
Perspective
be combined with qualitative information, ‘‘Since we cannot
know in advance every phenomenon users will experience, we
cannot know in advance what metrics will quantify these phe-
nomena. To that end, data scientists and ML engineers must
partner with or learn the skills of user experience research, giving
users a voice.’’
Proposals such as Model Cards for Model Reporting
45
and Da-
tasheets for Datasets
46
can be viewed in line with this thinking.
These works acknowledge that the metrics typically accompa-
nying models, such as performance on a particular dataset, are
insufficient to cover the many complex interactions that can
occur in real-world use, as the model is applied to different pop-
ulations and in different use cases, and as the use potentially
veers away from the initial intent. Mitchell et al.
45
and Gebru
et al.
46
propose documenting much richer and more comprehen-
sive details about a given model or dataset, including more qual-
itative aspects, such as intended use cases, ethical consider-
ations, underlying assumptions, caveats, and more.
Some of the most important qualitative information will come
from those most impacted by an algorithmic system and in-
cludes key stakeholders who are often overlooked.
Involve a range of stakeholders, including those who will
be most impacted
Another key to keeping metrics in their proper place is to keep
domain experts and those who will be most impacted closely
involved in their development and use. Empowering a diverse
group of stakeholders to understand the implications and under-
lying assumptions of AI models is one of the goals of Model
Cards for Model Reporting.
45
We suggest going even further
and including these stakeholders in the initial development pro-
cess of these metrics in the first place.
Tool three in the Markkula Center’s Ethical Toolkit for
Engineering/Design Practice
47
is to ‘‘expand the ethical circle’
to include the input of all stakeholders. They suggest a number
of questions to ask about this topic, including:
dWhose interests, desires, skills, experiences, and values
have we simply assumed, rather than actually consulted?
Why have we done this, and with what justification?
dWho are all the stakeholders who will be directly affected
by our product? How have their interests been protected?
How do we know what their interests really are—have
we asked?
dWho/which groups and individuals will be indirectly
affected in significant ways? How have their interests
been protected? How do we know what their interests
really are—have we asked?
Although its focus is on tech policy, the Diverse Voices paper
48
provides a detailed methodology on how to elicit the expertise
and feedback of underrepresented populations that would be
useful in improving the design and implementation of metrics.
Recently there has been a focus within ML to draw on con-
cepts from participatory design to reframe how we build ML
systems. At International Conference on Machine Learning
(ICML) 2020, a Participatory Approaches to Machine Learning
(PAML) workshop was held for the first time. The workshop or-
ganizers highlighted how the fields of algorithmic fairness and
human-centered ML often focus on centralized solutions, which
increases the power of system designers, and called instead
for ‘‘more democratic, cooperative, and participatory systems.’’
Research at the workshop covered the importance of action-
able recourse
49
(not just explaining why a decision was
made, but giving those affected actions that can change the
outcome), contestability, and moving beyond fairness to shift-
ing power.
50
However, there were warnings
42
that ‘‘participa-
tion is not a design fix for machine learning’’ and about the lim-
itations of preference elicitation for participatory algorithm
design.
51
An additional argument for the importance of including a range
of stakeholders comes from the fragile and elastic nature of DL
algorithms combined with the incompatible incentives the
owners of these algorithms have to address this fragility.
13
Spe-
cifically:
If subjects follow their incentives then the algorithm
ceases to function as designed. To sustain their accuracy,
algorithms need external rules to limit permissible re-
sponses. These rules form a set of guardrails which imple-
ment value judgments, keeping algorithms functioning by
constraining the actions of subjects.
Slee proposes that external guardrails on how users are al-
lowed to engage with the algorithms are a necessary corrective
measure for the associated gaming and abuses cited above, but
that the algorithm owners themselves are incentivized to create
guardrails that do not align with their algorithms, an act of regu-
latory arbitrage.
13
Given that guardrails are a restriction on the
user, transparent and just corrective measures will depend on
how effective the ethical circle has been expanded in the design
and creation of the guardrails. Slee
13
gives the example of the
incompatible incentives Facebook faces in addressing ethical is-
sues with its News Feed Algorithm, and suggests that journalists
not tempted by the financial incentives driving Facebook would
be better equipped to address this.
Although it is impossible to simply oppose metrics, the harms
caused when metrics are overemphasized include manipulation,
gaming, a focus on short-term outcomes to the detriment of
longer-term values, and other harmful consequences, particu-
larly when done in an environment designed to exploit people’s
impulses and weaknesses, such as most of our online
ecosystem. The unreasonable effectiveness of metric optimiza-
tion in current AI approaches is a fundamental challenge to the
field and yields an inherent contradiction: solely optimizing met-
rics leads to far from optimal outcomes. However, we provide
evidence in this paper that healthier use of metrics can be
created by (1) using a slate of metrics to get a fuller and more
nuanced picture; (2) externally auditing algorithms as a means
of accountability; (3) combining metrics with qualitative ac-
counts; and (4) involving a range of stakeholders, including those
who will be most impacted. This framework may help address
the core paradox of metric optimization within AI, and not solely
relying on metric optimization may lead to a more optimal use
of AI.
DECLARATION OF INTERESTS
The authors declare no competing interests.
ll
OPEN ACCESS
6Patterns 3, May 13, 2022
Perspective
REFERENCES
1. Likierman, A. (2009). The five traps of performance measurement. Harv.
Bus. Rev. 87, 96–101.
2. Kaplan, R., and Norton, D. (1992). The balanced scorecard: measu res that
drive performance. Harv. Bus. Rev. 70, 71–79.
3. Ribeiro, M.H., Ottoni, R., West, R., Almeida, V.A.F., and Meira, W., Jr.
(2019). Auditing radicalization pathways on YouTube. Preprint at arXiv.
1–18, abs/1908.08313.
4. Turque, B. (2012). ‘Creative.motivating’ and fired. Wash. Post. March
6, 2012.
5. Ramineni, C., and Williamson, D. (2018). Understanding mean score differ-
ences between the e-raterautomated scoring engine and humans for
demographically based groups in the GREgeneral test. ETS Res. Rep.
Ser. 2018, 1–31.
6. Goodhart, C. (2015). Goodhart’s law. In The Encyclopedia of Central
Banking, L. Rochon and S. Rossi, eds. (Edward Elgar Publishing),
pp. 227–228.
7. Strathern, M. (1997). ‘Improving ratings’: audit in the British University sys-
tem. Eur. Rev. 5, 305–321.
8. Rosenlicht, M. (1968). Introduction to Analysis (Dover).
9. Schrage, M., and Kiron, D. (2018). Leading with next-generation key per-
formance indicators. MIT Sloan Manag. Rev. 16, 1–2.
10. Mitchell, T.M. (1997). Machine Learning (McGraw-Hill).
11. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning Book
(MIT Press).
12. Espeland, W.N., and Sauder, M. (2007). Rankings and reactivity: how pub-
lic measures recreate social worlds. Am. J. Soc. 113, 1–40.
13. Slee, T. (2019). The incompatible incentives of private sector AI. In Oxford
Handbook of Ethics of Artificial Intelligence, M. Dubber, F. Pasquale, and
S. Das, eds. (Oxford University Press).
14. Zech, J., Badgeley, M., Liu, M., Costa, A., Titano, J., and Oermann, E.
(2018). Variable generalization performance of a deep learning model to
detect pneumonia in chest radiographs: a cross-sectional study. PLoS
Med. 15, e1002683.
15. Green, B. (2018). Ethical reflections on artificial intelligence. Sci. Fides
6, 9–31.
16. Manheim, D., and Garrabrant, S. (2019). Categorizing variants of Good-
hart’s law. https://arxiv.org/abs/1803.04585.
17. Krakovna, V. (2019). Specification Gaming Examples in AI - master list:
sheet 1. https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC
3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzux
o8bjOxCG84dAg/pubhtml.
18. Jacobs, A., and Wallach, H. (2019). Measurement and fairness. In Pro-
ceedings of the Conference on Fairness, Accountability, and Transpar-
ency (FAT* ’21) (ACM), pp. 375–385.
19. Taylor, J. (2016). Quantilizers: a safer alternative to maximizers for limited
optimization. In The Workshops of the Thirtieth AAAI Conference on Artifi-
cial Intelligence AI, Ethics, and Society, Technical Report WS-16-02.
20. Mullainathan, S., and Obermeyer, Z. (2017). Does machine learning auto-
mate moral hazard and error? Am. Econ. Rev. 107, 476–480.
21. Meyerson, E. (2012). YouTube now: why we focus on watch time.
YouTube Creator Blog. August 10, 2012. https://youtube-creators.
googleblog.com/2012/08/youtube-now-why-we-focus-on-watch-
time.html.
22. Chaslot,G. (2018). How algorithmscan learn to discreditthe media. Medium.
Feb 1, 2018. https://medium.com/@guillaumechaslot/how-algorithms-can-
learn-to-discredit-the-media-d1360157c4fa.
23. Bevan, G., and Hood, C. (2006). What’s measured is what matters: targets
and gaming in the English public health care system. Public Adm. 84,
517–538.
24. Gabriel, T. (2010). Under pressure, teachers tamper with tests. New York
Times. June 10, 2010.
25. Tufekci, Z. (2019). The Imperfect Truth about Finding Facts in a World of
Fakes (Wired), March 2019.
26. Harwell, D., and Timberg, C. (2019). YouTube recommended a Russian
media site thousands of times for analysis of Mueller’s report, a watchdog
group says. Wash. Post, April 26, 2019.
27. Fisher, M., and Taub, A. (2019). On YouTube’s digital playground, an open
gate for pedophiles. New York Times. June 3, 2019.
28. Landrum, A. (2018). Believing in A Flat earth. In 2018 AAAS Annual Meeting
(AAAS), February 2018.
29. Vaidhyanathan, S. (2018). Antisocial Media: How Facebook Disconnects
Us and Undermines Democracy (Oxford University Press).
30. Bowles, N. (2018). ‘I don’t really want to work for Facebook.’ so say some
computer science students. New York Times. Nov. 15, 2018.
31. Harris, M., and Tayler, B. (2019). Don’t let metrics undermine your busi-
ness. Harv. Bus. Rev.
32. Mathur, A., Acar, G., Friedman, M., Lucherini, E., Mayer, J., Chetty, M.,
and Narayanan, A. (2019). Dark patterns at scale: findings from a Crawl
of 11K shopping websites. Proc. ACM Hum. Comput. Interact. 3, 32.
https://doi.org/10.1145/3359183.
33. Lewis, P. (2018). ‘Fiction is outperforming reality’: how YouTube’s algo-
rithm distorts truth. Guardian, Feb 2, 2018.
34. Angwin, J. (2016). Making algorithms accountable (ProPublica), Aug
1, 2016.
35. Zuckerman, E. (2020). The Case for Digital Public Infrastructure (Knight
First Amendment Institute at Columbia University), January 17, 2020.
36. Brundag e, M., Avin, S., Wang, J., Belfield, H., Krueger, G., Hadfield, G.,
Khlaaf, H., Yang, J., Toner, H., Fong, R., et al. (2020). Towards Trustworthy
AI Development: Mechanisms for Supporting Verifiable Claims, https://
arxiv.org/abs/2004.07213.
37. Engler, A. (2021). Auditing employment algorithms for discrimination.
Brookings Inst. Rep.
38. Angwin, J. (2021). Our first year: auditing algorithms. Markup Hello World.
Feb 27, 2021.
39. Buolamwini, J., and Gebru, T. (2018). Gender Shades: intersectional accu-
racy disparities in commercial gender classification. In Proceedings of the
1st Conference on Fairness, Accountability and Transparency (PMLR),
pp. 77–91.
40. Ng, A. (2021). Can auditing eliminate bias from algorithms? Markup,
February 23, 2021.
41. Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., and Denton, E.
(2020). Saving face: investigating the ethical concerns of facial recognitio n
auditing. In AAAI/ACM AI Ethics and Society conference.
42. Sloane, M., Moss, E., Awomolo, O., and Forlano, L. (2020). Participation is
not a design fix for machine learning. In Participatory Approaches to ML
Workshop at International Congress for Machine Learning 2020.
43. Zuckerman, E. (2020). Interview with Kevin Roose of the New York Times
(Reimagining the Internet (5). The Initiative for Digital Public Infrastructure
at UMass Amherst).
44. Wiggins, C. (2018). Ethical Principles, OKRs, and KPIs: What YouTube and
Facebook Could Learn from Tukey (Columbia University Data Science
Institute Blog). https://datascience.columbia.edu/ethical-principles-okrs-
and-kpis-what-youtube-and-facebook-could-learn-tukey.
45. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson,
B., Spitzer, E., Raji, I.D., and Gebru, T. (2019). Model cards for model
ll
OPEN ACCESS
Patterns 3, May 13, 2022 7
Perspective
reporting. In Proceedings of the Conference on Fairness, Accountability,
and Transparency (FAT* ’19) (ACM), pp. 220–229.
46. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J., Wallach, H., Dau-
mee
´, H., III, and Crawford, K. (2021). Communications of the ACM 64,
86–92. https://doi.org/10.1145/3458723.
47. Vallor, S., Green, B., and Raicu, I. (2018). Ethics in Technology Practice.
(The Markkula Center for Applied Ethics at Santa Clara University).
https://www.scu.edu/ethics/.
48. Young, M., Magassa, L., and Friedman, B. (2019). Toward inclusive tech
policy design: a method for underrepresented voices to strengthen tech
policy documents. Ethics Inf. Technol. https://doi.org/10.1007/s10676-
019-09497-z.
49. Ustun, B., Spangher, A., and Liu, Y. (2018). Actionable recourse in linear
classification. In ACM Conference on Fairness, Accountability and Trans-
parency (FAT2019).
50. Watson-Daniels, J. (2020). Beyond fairness and ethics: towards agency
and shifting power. In Participatory Approaches to ML Workshop at Inter-
national Congress for Machine Learning.
51. Robertson, S., and Salehi, N. (2020). What if I don’t like any of the choices?
In The Limits of Preference Elicitation for Participatory Algorithm Design.
Participatory Approaches to ML Workshop at International Congress for
Machine Learning.
About the authors
Rachel L. Thomas, PhD, is a professor of practice at Queensland University
of Technology and co-founder of fast.ai, which created one of the world’s
most popular deep learning courses. Previously, she was founding director
of the Center for Applied Data Ethics at University of San Francisco. Rachel
earned her PhD in mathematics at Duke University and was selected by For-
bes as one of 20 Incredible Women in AI. Follow her on twitter at @ma-
th_rachel.
David Uminsky, PhD, is a Senior Research Associate in Computer Science
and Executive Director of the University of Chicago’s Data Science Institute.
He was previously an Associate Professor of Mathematics and founding Exec-
utive Director of the Data Institute at the University of San Francisco (USF).
David’s research interests are in the area of data science and applied mathe-
matics. He was selected by the National Academy of Sciences as a Kavli Fron-
tiers of Science Fellow. Prior to joining USF, David was a combined National
Science Foundation (NSF) and University of California (UC) President’s Fellow
at UCLA, where he was awarded the Chancellor’s Award for outstanding post-
doctoral research.
ll
OPEN ACCESS
8Patterns 3, May 13, 2022
Perspective
... This is often framed in terms of Goodhart's Law, which in one formulation holds that, "When a measure becomes a target, it ceases to be a good measure"(Strathern 1997, p. 308). For discussion of AI and Goodhart's Law, see Manheim and Garrabrant 2019;Thomas and Uminsky 2022. Content courtesy of Springer Nature, terms of use apply. ...
Article
Full-text available
A catastrophic AI threat model is a rigorous exploration of some particular mechanisms by which AI could potentially lead to catastrophic outcomes. In this article, I explore a polycrisis threat model. According to this model, AI will lead to a series of harms like disinformation and increased concentration of wealth and power. Interactions between these different harms will make things worse than they would have been had each harm operated in isolation. And the interacting harms will ultimately cause or constitute a catastrophe. My aim in this paper is not to defend the inevitability of such a polycrisis occurring. Instead, I aspire merely to establish that polycrisis-driven catastrophe is sufficiently plausible that it calls for further exploration. In doing so, I hope to emphasise that alongside worries about AI takeover, those concerned about catastrophic risk from AI should also take seriously worries about extreme power concentration and systemic disempowerment of humanity.
... We found that algorithmic benchmarks and metrics are still the primary methods for evaluating GPT's output quality in PCG, particularly for non-text content. In the broader AI field, over-reliance on quantitative metrics has been criticized for potentially obscuring genuine emotional responses and interactive experiences [54]. Users may also exhibit biases toward AI-generated content [102], [208], being either overly critical or excessively lenient compared to human-made content. ...
Article
Due to GPT's impressive generative capabilities, its applications in games are expanding rapidly. To offer researchers a comprehensive understanding of the current applications and identify both emerging trends and unexplored areas, this paper introduces an updated scoping review of 177 articles, 122 of which were published in 2024, to explore GPT's potential for games. By coding and synthesizing the papers, we identify five prominent applications of GPT in current game research: procedural content generation, mixed-initiative game design, mixed-initiative gameplay, playing games, and game user research. Drawing on insights from these application areas and emerging research, we propose future studies should focus on expanding the technical boundaries of the GPT models and exploring the complex interaction dynamics between them and users. This review aims to illustrate the state of the art in innovative GPT applications in games, offering a foundation to enrich game development and enhance player experiences through cutting-edge AI innovations.
... 142 It is important to note that defining and weighting diverse and context-specific model criteria is not strictly a scientific question but a matter of balancing the priorities of multiple stakeholders. 143 Optimizing agricultural yields, for example, can lead to biodiversity loss, degraded soils, and increased pollution. 144,145 Therefore, deciding on criteria by which models are evaluated should take place in an open conversation between ML researchers, crop modelers, stakeholders from multiple communities, and experts from other scientific disciplines such as soil scientists, climatologists, agronomists, and ecologists. ...
... The data preprocessing, augmentation, and EDA phase is crucial to the success of an ML project, as it is often considered the central reason for doubting the overall quality of the work by many data scientists [60]. While ML practitioners use various metrics to assess the goodness of fit of the model to the data, there are currently no reliable metrics to assess the phenomenological fidelity of the data to the underlying context, which is whether the data is a good representation of the problem [61,62]. Furthermore, measures of goodness of data are still under research. ...
Article
Full-text available
To maximize business value from artificial intelligence and machine learning (ML) systems, understanding what leads to the effective development and deployment of ML systems is crucial. While prior research primarily focused on technical aspects, important issues related to improving decision‐making across ML workflows have been overlooked. This paper introduces a “normative‐descriptive‐prescriptive” decision framework to address this gap. Normative guidelines outline best practices, descriptive dimensions describe actual decision‐making, and prescriptive elements provide recommendations to bridge gaps. The three‐step framework analyzes decision‐making in key ML pipeline phases, identifying gaps and offering prescriptions for improved model building. Key descriptive findings include rushed problem‐solving with convenient data, use of inaccurate success metrics, underestimation of downstream impacts, limited roles of subject matter experts, use of non‐representative data samples, prioritization of prediction over explanation, lack of formal verification processes, and challenges in monitoring production models. The paper highlights biases, incentive issues, and systematic disconnects in decision‐making across the ML pipeline as contributors to descriptive shortcomings. Practitioners can use the framework to pinpoint gaps, develop prescriptive interventions, and build higher quality, ethical, and legally compliant ML systems.
... In addition to producing accurate forecasts, models must be reliable in real-world settings for adoption by 60 stakeholders (van der Velde and Nisini, 2019). The evaluation metrics should closely represent the needs of stakeholders and allow a more granular breakdown of model performance (Thomas and Uminsky, 2022;Burnell et al., 2023) -for example, the model's ability to capture yield variability in years with climate extremes (Watson, 2022). To avoid overestimation of model skill, the evaluation procedure must take into account the specific challenges arising from the use of spatio-temporal data that does not satisfy independent and identically distributed assumptions (Meyer and Pebesma, 2022;Sweet et al., 2023;Kapoor and ...
Preprint
Full-text available
In-season, pre-harvest crop yield forecasts are essential for enhancing transparency in commodity markets and improving food security. They play a key role in increasing resilience to climate change and extreme events and thus contribute to the United Nations’ Sustainable Development Goal 2 of zero hunger. Pre-harvest crop yield forecasting is a complex task, as several interacting factors contribute to yield formation, including in-season weather variability, extreme events, long-term climate change, soil, pests, diseases and farm management decisions. Several modeling approaches have been employed to capture complex interactions among such predictors and crop yields. Prior research for in-season, pre-harvest crop yield forecasting has primarily been case-study based, which makes it difficult to compare modeling approaches and measure progress systematically. To address this gap, we introduce CY-Bench (Crop Yield Benchmark), a comprehensive dataset and benchmark to forecast maize and wheat yields at a global scale. CY-Bench was conceptualized and developed within the Machine Learning team of the Agricultural Model Intercomparison and Improvement Project (AgML) in collaboration with agronomists, climate scientists, and machine learning researchers. It features publicly available sub-national yield statistics and relevant predictors—such as weather data, soil characteristics, and remote sensing indicators—that have been pre-processed, standardized, and harmonized across spatio-temporal scales. With CY-Bench, we aim to: (i) establish a standardized framework for developing and evaluating data-driven models across diverse farming systems in more than 25 countries across six continents; (ii) enable robust and reproducible model comparisons that address real-world operational challenges; (iii) provide an openly accessible dataset to the earth system science and machine learning communities, facilitating research on time series forecasting, domain adaptation, and online learning. The dataset (https://doi.org/10.5281/zenodo.11502142, (Paudel et al., 2025a)) and accompanying code (https://github.com/WUR-AI/AgML-CY-Bench, (Paudel et al., 2025b))) are openly available to support the continuous development of advanced data driven models for crop yield forecasting to enhance decision-making on food security.
... This problem is an instance of Goodhart's law, the adage that "when a measure becomes a target, it ceases to be a good measure". Initially presented in the context of economics, it also applies in the context of evaluating LLMs [7,8]. Unfortunately, its importance in this context often goes unappreciated. ...
Preprint
Full-text available
Large language models (LLMs) regularly demonstrate new and impressive performance on a wide range of language, knowledge, and reasoning benchmarks. Such rapid progress has led many commentators to argue that LLM general cognitive capabilities have likewise rapidly improved, with the implication that such models are becoming progressively more capable on various real-world tasks. Here I summarise theoretical and empirical considerations to challenge this narrative. I argue that inherent limitations with the benchmarking paradigm, along with specific limitations of existing benchmarks, render benchmark performance highly unsuitable as a metric for generalisable competence over cognitive tasks. I also contend that alternative methods for assessing LLM capabilities, including adversarial stimuli and interpretability techniques, have shown that LLMs do not have robust competence in many language and reasoning tasks, and often fail to learn representations which facilitate generalisable inferences. I conclude that benchmark performance should not be used as a reliable indicator of general LLM cognitive capabilities.
Preprint
Full-text available
Understanding the decisions made and actions taken by increasingly complex AI system remains a key challenge. This has led to an expanding field of research in explainable artificial intelligence (XAI), highlighting the potential of explanations to enhance trust, support adoption, and meet regulatory standards. However, the question of what constitutes a "good" explanation is dependent on the goals, stakeholders, and context. At a high level, psychological insights such as the concept of mental model alignment can offer guidance, but success in practice is challenging due to social and technical factors. As a result of this ill-defined nature of the problem, explanations can be of poor quality (e.g. unfaithful, irrelevant, or incoherent), potentially leading to substantial risks. Instead of fostering trust and safety, poorly designed explanations can actually cause harm, including wrong decisions, privacy violations, manipulation, and even reduced AI adoption. Therefore, we caution stakeholders to beware of explanations of AI: while they can be vital, they are not automatically a remedy for transparency or responsible AI adoption, and their misuse or limitations can exacerbate harm. Attention to these caveats can help guide future research to improve the quality and impact of AI explanations.
Article
We outline a four-step process for ML/AI developers to align development choices with multiple values, by adapting a widely-utilized framework from bioethics: (1) identify the values that matter, (2) specify identified values, (3) find solution spaces that allow for maximal alignment with identified values, and 4) make hard choices if there are unresolvable trade-offs between the identified values. Key to this approach is identifying resolvable trade-offs between values (Step 3). We survey ML/AI methods that could be used to this end, identifying approaches at each stage of the development process. All steps should be guided by community engagement. The framework outlines what it means to build a value-aligned ML/AI system, providing development teams with practical guidance to maximize the chances their work has desirable impacts.
Article
Full-text available
Information retrieval (IR) technologies and research are undergoing transformative changes. It is our perspective that the community should accept this opportunity to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as social and political sciences, and should be co-developed with cross-disciplinary scholars, legal and policy experts, civil rights and social justice activists, and artists, among others. In this perspective paper, we motivate why the community must consider this radical shift in how we do research and what we work on, and sketch a path forward towards this transformation.
Chapter
This chapter explores the intricate relationship between artificial intelligence (AI) and sustainable growth in the financial sector. This chapter examines several uses of AI, such as risk assessment, portfolio optimisation, and climate risk analysis. It highlights the transformative role of AI in fostering a future where economic prosperity harmonises with environmental sustainability and societal growth. The utilisation of AI is driving the financial sector towards a future characterised by responsibility and sustainability, as seen by the implementation of personalised, sustainable investment portfolios and the establishment of transparent supply chains. The chapter highlights the significant revolutionary capabilities of AI, emphasising the importance for stakeholders to adopt this paradigm shift and actively contribute to a global landscape where financial prosperity fosters overall societal welfare.
Technical Report
Full-text available
a serious look at the future purpose, role and influence of KPIs in a machine learning era...identifies a fundamental disruption - called 'the big flip' - where KPIs shift from being outputs for human decision to inputs for machine learning (models).....
Article
Full-text available
To be successful, policy must anticipate a broad range of constituents. Yet, all too often, technology policy is written with primarily mainstream populations in mind. In this article, drawing on Value Sensitive Design and discount evaluation methods, we introduce a new method—Diverse Voices—for strengthening pre-publication technology policy documents from the perspective of underrepresented groups. Cost effective and high impact, the Diverse Voices method intervenes by soliciting input from “experiential” expert panels (i.e., members of a particular stakeholder group and/or those serving that group). We first describe the method. Then we report on two case studies demonstrating its use: one with a white paper on augmented reality technology with expert panels on people with disabilities, people who were formerly or currently incarcerated, and women; and the other with a strategy document on automated driving vehicle technologies with expert panels on youth, non-car drivers, and extremely low-income people. In both case studies, panels identified significant shortcomings in the pre-publication documents which, if addressed, would mitigate some of the disparate impact of the proposed policy recommendations on these particular stakeholder groups. Our discussion includes reflection on the method, evidence for its success, its limitations, and future directions.
Article
Dark patterns are user interface design choices that benefit an online service by coercing, steering, or deceiving users into making unintended and potentially harmful decisions. We present automated techniques that enable experts to identify dark patterns on a large set of websites. Using these techniques, we study shopping websites, which often use dark patterns to influence users into making more purchases or disclosing more information than they would otherwise. Analyzing ~53K product pages from ~11K shopping websites, we discover 1,818 dark pattern instances, together representing 15 types and 7 broader categories. We examine these dark patterns for deceptive practices, and find 183 websites that engage in such practices. We also uncover 22 third-party entities that offer dark patterns as a turnkey solution. Finally, we develop a taxonomy of dark pattern characteristics that describes the underlying influence of the dark patterns and their potential harm on user decision-making. Based on our findings, we make recommendations for stakeholders including researchers and regulators to study, mitigate, and minimize the use of these patterns.
Conference Paper
Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance characteristics. In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information. While we focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, we provide cards for two supervised models: One trained to detect smiling faces in images, and one trained to detect toxic comments in text. We propose model cards as a step towards the responsible democratization of machine learning and related artificial intelligence technology, increasing transparency into how well artificial intelligence technology works. We hope this work encourages those releasing trained machine learning models to accompany model releases with similar detailed evaluation numbers and other relevant documentation.