Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
Perspective
Reliance on metrics
is a fundamental challenge for AI
Rachel L. Thomas
1
and David Uminsky
2,
*
1
Queensland University of Technology, Brisbane, QLD, Australia
2
University of Chicago, Chicago, IL, USA
*Correspondence: uminsky@uchicago.edu
https://doi.org/10.1016/j.patter.2022.100476
SUMMARY
Through a series of case studies, we review how the unthinking pursuit of metric optimization can lead to real-
world harms, including recommendation systems promoting radicalization, well-loved teachers fired by an
algorithm, and essay grading software that rewards sophisticated garbage. The metrics used are often prox-
ies for underlying, unmeasurable quantities (e.g., ‘‘watch time’’ of a video as a proxy for ‘‘user satisfaction’’).
We propose an evidence-based framework to mitigate such harms by (1) using a slate of metrics to get a fuller
and more nuanced picture; (2) conducting external algorithmic audits; (3) combining metrics with qualitative
accounts; and (4) involving a range of stakeholders, including those who will be most impacted.
INTRODUCTION
Metrics can play a central role in decision making across data-
driven organizations, and their advantages and disadvantages
have been widely studied.
1,2
Metrics play an even more central
role in artificial intelligence (AI) algorithms, and as such their risks
and disadvantages are heightened. Some of the most alarming
instances of AI algorithms run amok, such as recommendation
algorithms contributing to radicalization,
3
teachers described
as ‘‘creative and motivating’’ being fired by an algorithm,
4
or
essay grading software that rewards sophisticated garbage
5
all
result from overemphasizing metrics. We have to understand
this dynamic in order to understand the urgent risks we are fac-
ing as a result of misuse of AI.
At their heart, what most current AI approaches do is opti-
mize metrics. The practice of optimizing metrics is not new or
unique to AI, yet AI can be particularly efficient (even too effi-
cient) at doing so. This unreasonable effectiveness at opti-
mizing metrics results in one of the grand challenges in AI
design and ethics: the metric optimization central to AI often
leads to manipulation, gaming, a focus on short-term quantities
(at the expense of longer-term concerns), and other undesir-
able consequences, particularly when done in an environment
designed to exploit people’s impulses and weaknesses. More-
over, this challenge also yields, in parallel, an equally grand
contradiction in AI development: optimizing metrics results in
far from optimal outcomes.
Some of the issues with metrics are captured by Goodhart’s
law, ‘‘When a measure becomes a target, it ceases to be a
good measure.’’
6,7
We examine in this paper, through a review
of a series of real-world case studies, how Goodhart’s law, as
well as additional consequences of AI’s reliance on optimizing
metrics, is already having an impact in society. We find from
this review the following well-supported principles:
dAny metric is just a proxy for what you really care about.
dMetrics can, and will, be gamed.
dMetrics tend to overemphasize short-term concerns.
dMany online metrics are gathered in highly addictive envi-
ronments.
Given the harms of overemphasizing metrics, it is important to
work to mitigate these issues that will remain in most use cases
THE BIGGER PICTURE The success of current artificial intelligence (AI) approaches such as deep learning
centers on their unreasonable effectiveness at metric optimization, yet overemphasizing metrics leads to a
variety of real-world harms, including manipulation, gaming, and a myopic focus on short-term qualities
and inadequate proxies. This principle is classically captured in Goodhart’s law: when a measure becomes
the target, it ceases to be an effective measure. Current AI approaches have we aponized Goodhart’s law by
centering on optimizing a particular measure as a target. This poses a grand contradiction within AI design
and ethics: optimizing metrics results in far from optimal outcomes. It is crucial to understand this dynamic in
order to mitigate the risks and harms we are facing as a result of misuse of AI.
Concept: Basic principles of a new
data science output observed and reported
ll
OPEN ACCESS
Patterns 3, May 13, 2022 ª2022 1
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
of AI. We conclude by proposing a framework for the healthier
use of metrics that includes:
dUse slate of metrics to get a fuller picture.
dSet up mechanisms for independent, external algorithmic
audits.
dCombine metrics with qualitative accounts.
dInvolve a range of stakeholders, including those who will be
most impacted.
ON METRICS, MACHINE LEARNING, AND AI
The formal mathematical definition of a metric
8
is a rule that, for
each pair of elements pand qin a set E, associates to them a
real-number d(p,q), which has the following properties:
dðp;qÞR0
dðp;qÞ=0if and only if p =q
dðp;qÞ=dðq;pÞ
dðp;rÞRdðp;qÞ+dðq;rÞ
(Equation 1)
A metric can be used to measure the magnitude of any
element in the set for which it is defined by evaluating the dis-
tance between that element and zero.
This concept of a metric for an organization is often used inter-
changeably with key performance indicators (KPIs), which can
be defined as ‘‘the quantifiable measures an organization uses
to determine how well it meets its declared operational and stra-
tegic goals’’.
9
In practice, metrics and KPIs are also used more
generally to refer to measurements made in the realm of software
products or business, such as page views/impressions, click-
through rates, time spent watching, quarterly earnings, and
more. Here, ‘‘metric’’ is used to refer to something that can be
measured and quantified as a numeric value. Although they are
not identical, the mathematical definition of metric (which is rele-
vant for formal machine learning [ML] work that involves optimi-
zation for a cost or loss function) is very much linked to the more
informal business usage of the term metric.
ML has been defined as follows: ‘‘A computer program is said
to learn from experience Ewith respect to some class of tasks T
and performance measure P, if its performance at tasks in T,as
measured by P, improves with experience E.’’
10,11
Here, the per-
formance measure Pmust be a well-defined quantitative mea-
sure, such as accuracy or error rate. Defining and choosing
this measure can involve a number of choices, such as whether
to give partial credit for some answers, how to weigh false pos-
itives relative to false negatives, the penalties for frequent me-
dium mistakes relative to rare large mistakes, and more.
In the context of deep learning (DL), training a model is the pro-
cess of:
11
Finding the parameters qof a neural network that signifi-
cantly reduce a cost function J(q), which typically includes
a performance measure evaluated on the entire training
set, as well as additional regularization terms.
That is, in the case of DL or ML more broadly, model training is
explicitly defined around the process of optimizing metrics (in
this case, minimizing cost). When one unpacks the common
cost functions optimized in the training of ML/DL models,
including mean squared error (MSE), mean absolute error
(MAE), etc., they are constructed using classic metrics that
satisfy the mathematical definition (Equation 1). Thus, by their
design, it is this singular pursuit of minimizing the ‘‘error’’ of our
algorithms that explicitly links metrics in the use of AI.
Goodhart’s law and AI
Goodhart’s law is often given as one of the limitations of metrics:
a measure (or metric) that becomes a target ceases to be a good
measure. This is a fundamental challenge with metrics and a
relevant lens to see the shortcomings of AI’s reliance on metrics.
This law has appeared in various forms since economist Charles
Goodhart first proposed it in 1975 in response to how monetary
metrics broke down after central banks adopted them as targets.
This arose out of attempts in the early 1970s to control inflation
by choosing metrics with a stable relationship to inflation as tar-
gets. Central banks from a number of countries did this, and in
most cases the relationship between inflation and the metric
chosen broke down once that metric became a target. In his en-
try on Goodhart’s law in The Encyclopedia of Central Banking,
6
Goodhart refers to the popular phrasing of his namesake law,
‘‘When a measure becomes a target, it ceases to be a good mea-
sure.’’
7
Even when participants can recognize the limitations or
absurdity of the metrics in use, they may still end up acting on
and internalizing such metrics, as analyzed in the case study of
how U.S. News & World Report rankings have completely
pervaded law schools.
12
As noted in the previous section, the target of training an ML/
DL model is defined as improving or optimizing our cost
function J(q). That is, by definition, DL is a process in which a
measure is the target. Thus, there is an increased likelihood
that any risks of optimizing metrics are heightened by AI,
13
and Goodhart’s law grows increasingly relevant. The core act
of optimizing in the process of developing an AI tool places
Goodhart’s Law at the center of concern in the use of AI. Sim-
ple examples abound here. Consider a computer vision binary
classification task that includes severely imbalanced data (e.g.,
a dataset that contains 998 images of a dog and 2 images of a
cat). Selection of our metric that defines a ‘‘success algorithm’’
is littered with Goodhart’s law potholes. If we define success
with recall,
R=True Positive=ðTrue Positive +False NegativesÞ;
then we simply can game our classification task by replacing a
computer vision algorithm with an algorithm that ignores the
input and always gives the answer ‘‘dog,’’ which will achieve a
recall of R = 99.8%. Although the downside risk seems minimal
on such a toy example, the implications are clearer if one con-
siders similar tasks of imbalanced medical imaging tasks asso-
ciated with cancer diagnosis. For example, convolutional neural
networks used to identify pneumonia in chest X-rays have hospi-
tal-system-specific biases and are susceptible to worse perfor-
mance on X-rays from hospitals outside their training set.
14
This toy example is clearly myopic in its focus on recall alone,
rather than a broader slate of considerations. However, this
myopia is often found even in more complex algorithmic
ll
OPEN ACCESS
2Patterns 3, May 13, 2022
Perspective
systems, with a focus on a limited metric (or small set of metrics)
often eclipsing other relevant concerns.
PREVIOUS RELATED WORK
As part of a much broader survey of the field of AI ethics, Green
15
refers to concerns of ‘‘Goodharting’’ the explanation function; for
instance, if an algorithm learns to produce answers that humans
find appealing as opposed to accurate answers.
Work by Manheim and Garrabrant
16
develops a taxonomy of
four distinct failure modes that are all contained in Goodhart’s
law. Here Goodhart’s law is framed in terms of a metric being
chosen as a proxy for a goal, and the collapse that occurs with
that proxy as a target: (1) regressional Goodhart, in which the dif-
ference between the proxy and the goal is important; (2) extremal
Goodhart, in which the relationship between the proxy and the
goal is different when taken to an extreme; (3) causal Goodhart,
in which there is a non-causal relationship between the proxy
and goal, and intervening on one may fail to impact the other;
and (4) adversarial Goodhart, in which adversaries attempt to
game the chosen proxy. The authors then provide subcategories
within causal Goodhart, based on different underlying relation-
ships between the proxy and the goal, and different subcate-
gories of adversarial Goodhart, based on the behavior of the
regulator and the agent. Mathematical formulations for the failure
modes are presented.
A related issue is ‘‘reward hacking’’ in which an algorithm
games its reward function, learning a ‘‘clever’’ way of optimizing
the function that subverts the designer/programmer’s intent.
This is most often talked about in the field of reinforcement
learning. For example, in a reimplementation of AlphaGo, the
computer player learned to pass forever if passing was an al-
lowed move. Victoria Krakovna
17
maintains a spreadsheet with
dozens of such examples. Although this is a subset of the types
of case we will consider, reward hacking typically refers to the al-
gorithm’s internal behavior, whereas here we are primarily inter-
ested in systems that involve a mix of human and algorithmic
behavior.
Work by Jacobs and Wallach
18
draws on the social sciences in
framing how many harms caused by computational systems are
the result of a mismatch between an unobservable theoretical
construct (e.g., "creditworthiness," "teacher quality," "risk to so-
ciety") and how it is operationalized. They offer a taxonomy of
many forms of validity; a measurement model may fail if it is
invalid across any of these dimensions. This work illustrates
that it is not just that we overemphasize metrics, but rather
that those metrics are often an invalid way of capturing the qual-
ity we care about.
The issue of over-optimizing single metrics is so worrisome
that recent work
19
has suggested replacing direct optimization
with ‘‘q-quantilizers,’’ which practically attempt to replace
offering the deterministic highest utility maximizer with random
alternatives weighted by their utility. The idea purports to miti-
gate Goodharting but at the cost of lower utility and is sensitive
to how well understood what a ‘‘safe’’ base distribution of ac-
tions looks like. Slee
13
goes further to state that in many cases,
AI’s reliance on pure incentive maximization is incompatible
with actually achieving optimal outcomes for the intended
use cases.
Our goal here is to elucidate different characteristics of how
Goodhart’s law manifests in a series of real-world cases studies,
aspects of how our online environment and current business
practices are exacerbating these failures, and propose a frame-
work toward mitigating the harms caused by overemphasis of
metrics within AI.
WE CANNOT MEASURE THE THINGS THAT
MATTER MOST
Metrics are typically just a proxy for what we really care about.
Mullainathan and Obermeyer
20
cover an interesting example:
the researchers investigate which factors in someone’s elec-
tronic medical record are most predictive of a future stroke,
and they find that several of the most predictive factors (such
as accidental injury, a benign breast lump, or colonoscopy) do
not make sense as risk factors for stroke. It turned out that the
model was simply identifying people who utilize health care a
lot. They did not actually have data of who had a stroke (a phys-
iological event in which regions of the brain are denied new ox-
ygen); they had data about who had access to medical care,
chose to go to a doctor, were given the needed tests, and had
this billing code added to their chart. But a number of factors in-
fluence this process: who has health insurance or can afford their
co-pay, who can take time off of work or find childcare, gender
and racial biases that impact who gets accurate diagnoses, cul-
tural factors, and more. As a result, the model was largely picking
out people who utilized health care versus who did not.
This is an example of the common phenomenon of having to
use proxies: you want to know what content users like, so you
measure what they click on. You want to know which teachers
are most effective, so you measure their students’ test scores.
You want to know about crime, so you measure arrests. These
things are not the same. Many things we do care about cannot
be measured. Metrics can be helpful, but we cannot forget that
they are just proxies. As another example, Google used hours
spent watching YouTube as a proxy for how happy users were
with the content, writing on the Google blog that ‘‘If viewers
are watching more YouTube, it signals to us that they’re happier
with the content they’ve found’’.
21
Guillaume Chaslot,
22
founder
of independent watch group AlgoTransparency and an AI engi-
neer who formerly worked at Google/YouTube, shares how
this had the side effect of incentivizing conspiracy theories,
because convincing users that the rest of the media is lying
kept them watching more YouTube.
METRICS CAN, AND WILL, BE GAMED
It is almost inevitable that metrics will be gamed, particularly
when they are given too much power. A detailed case study
about ‘‘a system of governance of public services that combined
targets with an element of terror’’ developed for England’s health
care system in the 2000s covers many ways that gaming can
occur.
23
They then detail many specific examples of how gaming
manifested in the English health care system. For example, tar-
gets around emergency department wait times led some hospi-
tals to cancel scheduled operations in order to draft extra staff to
the emergency department, to require patients to wait in queues
of ambulances, and to turn stretchers into ‘‘beds’’ by putting
ll
OPEN ACCESS
Patterns 3, May 13, 2022 3
Perspective
them in hallways. There were also significant discrepancies in
numbers reported by hospitals versus those reported by pa-
tients; for instance, around wait times, according to official
numbers, 90% of patients were seen in less than 4 h, but only
69% of patients said they were seen in less than 4 h when
surveyed.
23
As education policy in the United States began overemphasiz-
ing student test scores as the primary way to evaluate teachers,
there have been widespread scandals of teachers and principals
cheating by altering students’ scores in Georgia, Indiana, Mas-
sachusetts, Nevada, Virginia, Texas, and elsewhere.
24
One
consequence of this is that teachers who do not cheat may be
penalized or even fired when it appears student test scores
have dropped to more average levels under their instruction.
4
When metrics are given undue importance, attempts to game
those metrics become common.
Jacobs and Wallach
18
formalize this notion, drawing on con-
cepts from the social sciences, to illustrate how attempts at
formal metrics are attempts to operationalize what are often un-
observable, theoretical constructs. One of the case studies
included in their paper is the ‘‘value-added’’ model of measuring
teacher effectiveness that relies primarily on change in student
test scores from year to year. They highlight a number of ways
that this measurement model is invalid. The measure lacks
test-retest reliability, with teachers’ ratings often fluctuating
wildly from year to year. It also fails convergent validity because
the measurements do not correlate with other measures of
teacher effectiveness, such as assessments by principals, col-
leagues, and parents of students. Research
19
showed that the
value-added model gives lower scores to teachers at schools
with higher proportions of students who are not white or are of
low socioeconomic status, suggesting that this method is un-
fairly biased and will exacerbate inequality when used in decision
making. This is a strong indication that the measure lacks conse-
quential validity, in that it fails to identify these downstream con-
sequences.
A modern AI case study can be drawn from recommendation
systems, which are widely used across many platforms to rank
and promote content for users. Platforms are rife with attempts
to game their algorithms, to show up higher in search results or
recommended content, through fake clicks, fake reviews, fake
followers, and more.
25
There are entire marketplaces for pur-
chasing fake reviews and fake followers, etc.
During one week in April 2019, Chaslot collected 84,695
videos from YouTube and analyzed the number of views and
the number of channels from which they were recommended.
26
The state-owned media outlet Russia Today, abbreviated RT,
was an extreme outlier in how much YouTube’s algorithm had
selected it to be recommended by a wide variety of other
YouTube channels. According to Harwell and Timberg:
26
Chaslot said in an interview that while the RT video ulti-
mately did not get massive viewership—only about
55,000 views—the numbers of recommendations suggest
that Russians have grown adept at manipulating
YouTube’s algorithm, which uses machine-learning soft-
ware to surface videos it expects viewers will want to
see. The result, Chaslot said, could be a gradual, subtle
elevation of Russian views online because such videos
result in more recommendations and, ultimately, more
views that can generate more advertising revenue
and reach.
Automatic essay grading software currently used in at least
22 US states focuses primarily on metrics such as sentence
length, vocabulary, spelling, and subject-verb agreement, but
is unable to evaluate aspects of writing that are hard to quantify,
such as creativity. As a result, gibberish essays randomly gener-
ated by computer programs to contain lots of sophisticated
words score well. Essays from students in mainland China,
which do well on essay length and sophisticated word choice,
received higher scores from the algorithms than from expert hu-
man graders, suggesting that these students may be using
chunks of pre-memorized text.
5
METRICS OVEREMPHASIZE SHORT-TERM CONCERNS
It is much easier to measure short-term quantities: click-through
rates, month-over-month churn, and quarterly earnings. Many
long-term trends have a complex mix of factors and are tougher
to quantify. While short-term incentives have led YouTube’s al-
gorithm to promote pedophilia,
27
white supremacy,
3
and flat-
earth theories,
28
the long-term impact on user trust will not be
positive. Similarly, Facebook has been the subject of years’
worth of privacy scandals, political manipulation, and facilitating
genocide,
29
which is now having a longer-term negative impact
on Facebook’s ability to recruit new engineers.
30
Simply measuring what users click on is a short-term concern
and does not take into account factors like the potential long-
term impact of a long-form investigative article that may have
taken months to research and that could help shape a reader’s
understanding of a complex issue and even lead to significant
societal changes.
The Wells Fargo account fraud scandal provides a case study
of how letting metrics replace strategy can harm a business.
31
After identifying cross-selling as a measure of long-term
customer relationships, Wells Fargo went overboard empha-
sizing the cross-selling metric: intense pressure on employees
combined with an unethical sales culture led to 3.5 million fraud-
ulent deposit and credit card accounts being opened without
customers’ consent. The metric of cross-selling is a much
more short-term concern compared with the loftier goal of
nurturing long-term customer relationships. Overemphasizing
metrics removes our focus from long-term concerns such as
our values, trust, and reputation and our impact on society and
the environment, and myopically focuses on the short-term.
MANY METRICS GATHER DATA OF WHAT WE DO IN
HIGHLY ADDICTIVE ENVIRONMENTS
It matters which metrics we gather and in what environment we
do so. Metrics such as what users click on, how much time they
spend on sites, and ‘‘engagement’’ are heavily relied on by tech
companies as proxies for user preference and are used to drive
important business decisions. Unfortunately, these metrics are
gathered in environments engineered to be highly addictive,
laden with dark patterns, and where financial and design deci-
sions have already greatly circumscribed the range of options.
ll
OPEN ACCESS
4Patterns 3, May 13, 2022
Perspective
Although this is not a characteristic inherent to metrics, it is a
current reality of many of the metrics used by tech companies
today. A large-scale study analyzing approximately 11,000
shopping websites found 1,818 dark patterns present on 1,254
websites, 11.1% of the total sites.
32
These dark patterns
included obstruction, misdirection, and misrepresenting user ac-
tions. The study found that more popular websites were more
likely to feature these dark patterns.
Zeynep Tufekci (as quoted by Lewis) compares recommenda-
tion algorithms (such as YouTube choosing which videos to
auto-play for you and Facebook deciding what to put at the
top of your newsfeed) to a cafeteria shoving junk food into chil-
dren’s faces:
33
This is a bit like an autopilot cafeteria in a school that has
figured out children have sweet-teeth, and also like fatty
and salty foods. So you make a line offering such food,
automatically loading the next plate as soon as the bag
of chips or candy in front of the young person has been
consumed.
As those selections get normalized, the output becomes ever
more extreme: ‘‘So the food gets higher and higher in sugar, fat
and salt – natural human cravings – while the videos recommen-
ded and auto-played by YouTube get more and more bizarre or
hateful.’’
33
Too many of our online environments are like this, with metrics
capturing that we love sugar, fat, and salt, not taking into account
that we are in the digital equivalent of a food desert and that com-
panies have not been required to put nutrition labels on what they
are offering. Such metrics are not indicative of what we would
prefer in a healthier or more empowering environment.
A FRAMEWORK FOR A HEALTHIER USE OF METRICS
All this is not to say that we should throw metrics out altogether.
Data can be valuable in helping us understand the world, test hy-
potheses, and move beyond gut instincts or hunches. Metrics
can be useful when they are in their proper context and place.
We propose a few mechanisms for addressing these issues:
dUse a slate of metrics to get a fuller picture
dConduct external algorithmic audits.
dCombine with qualitative accounts.
dInvolve a range of stakeholders, including those who will be
most impacted.
Use a slate of metrics to get a fuller picture and reduce
gaming
One way to keep metrics in their place is to consider a slate
of many metrics for a fuller picture (and resist the temptation
to try to boil these down to a single score). For instance,
knowing the rates at which tech companies hire people
from under-indexed groups is a very limited data point. For
evaluating diversity and inclusion at tech companies, we
need to know comparative promotion rates, cap table owner-
ship, retention rates (many tech companies are revolving
doors driving people from under-indexed groups away with
their toxic cultures), number of harassment victims silenced
by nondisclosure agreements (NDAs), rates of under-leveling,
and more. Even then, all these data should still be combined
with listening to first-person experiences of those working at
these companies.
Likierman
1
indicates that using a diverse slate of metrics is one
strategy to avoid gaming:
It helps to diversify your metrics, because it’s a lot harder
to game several of them at once. [International law firm]
Clifford Chance replaced its single metric of billable hours
with seven criteria on which to base bonuses: respect and
mentoring, quality of work, excellence in client service,
integrity, contribution to the community, commitment to
diversity, and contribution to the firm as an institution.
Likierman also cites the example of Japanese telecommunica-
tions company SoftBank using performance metrics defined for
three distinct time horizons to make them harder to game.
External algorithmic audits
To address the risk that revealing details of an algorithm’s inner
workings could lead to increased gaming, some have proposed
external audits as a mechanism to provide accountability and in-
dependent verification while maintaining privacy.
34–37
External
audits conducted without companies’ cooperation have been
effectively used in investigative journalism, with many powerful
examples from Pro-Publica
34
and The Markup
38
and academia,
including the Gender Shades
39
study, which led to real-world
legislation. There is a growing group of independent firms that
can be hired by companies for voluntary audits, but no consis-
tent standards around what such audits involve. Currently, com-
panies are not incentivized to find or correct their issues, and the
audit firms they pay are not well incentivized to provide negative
findings.
40
Just as tax audits that evaluate a company’s adher-
ence to a compliance process (and not just submitted financial
documents) lead to better outcomes, effective algorithmic audits
will need to consider the company’s processes and not just final
performance on a benchmark.
41
Toothless and inconsistent au-
dits run the risk of legitimizing harmful tech.
42
Meaningful regula-
tion, consistency, and consequences for harmful or discrimina-
tory algorithms could make audits more impactful.
37,40
In the
context of discussing YouTube and Facebook’s recommenda-
tion systems, Zuckerman
43
suggests external audits as an
approach that could provide accountability while avoiding the
gaming that open sourcing these algorithms would invite, draw-
ing parallels with the financial industry:
Much in the same way that an accounting firm can come in
and look at privileged information. It has a fiduciary re-
sponsibility for the company, so it’s not going to share it
around, but it also has professional responsibility to be
bound by certain standards of accounting and to raise
their hand when those things are being violated .To
me, in many ways, it may be more realistic than trying to
open source these algorithms, which gets a little hard
because it makes them highly gameable.
Combine with qualitative accounts
Columbia professor and New York Times Chief Data Scientist
Chris Wiggins
44
wrote that quantitative measures should always
ll
OPEN ACCESS
Patterns 3, May 13, 2022 5
Perspective
be combined with qualitative information, ‘‘Since we cannot
know in advance every phenomenon users will experience, we
cannot know in advance what metrics will quantify these phe-
nomena. To that end, data scientists and ML engineers must
partner with or learn the skills of user experience research, giving
users a voice.’’
Proposals such as Model Cards for Model Reporting
45
and Da-
tasheets for Datasets
46
can be viewed in line with this thinking.
These works acknowledge that the metrics typically accompa-
nying models, such as performance on a particular dataset, are
insufficient to cover the many complex interactions that can
occur in real-world use, as the model is applied to different pop-
ulations and in different use cases, and as the use potentially
veers away from the initial intent. Mitchell et al.
45
and Gebru
et al.
46
propose documenting much richer and more comprehen-
sive details about a given model or dataset, including more qual-
itative aspects, such as intended use cases, ethical consider-
ations, underlying assumptions, caveats, and more.
Some of the most important qualitative information will come
from those most impacted by an algorithmic system and in-
cludes key stakeholders who are often overlooked.
Involve a range of stakeholders, including those who will
be most impacted
Another key to keeping metrics in their proper place is to keep
domain experts and those who will be most impacted closely
involved in their development and use. Empowering a diverse
group of stakeholders to understand the implications and under-
lying assumptions of AI models is one of the goals of Model
Cards for Model Reporting.
45
We suggest going even further
and including these stakeholders in the initial development pro-
cess of these metrics in the first place.
Tool three in the Markkula Center’s Ethical Toolkit for
Engineering/Design Practice
47
is to ‘‘expand the ethical circle’’
to include the input of all stakeholders. They suggest a number
of questions to ask about this topic, including:
dWhose interests, desires, skills, experiences, and values
have we simply assumed, rather than actually consulted?
Why have we done this, and with what justification?
dWho are all the stakeholders who will be directly affected
by our product? How have their interests been protected?
How do we know what their interests really are—have
we asked?
dWho/which groups and individuals will be indirectly
affected in significant ways? How have their interests
been protected? How do we know what their interests
really are—have we asked?
Although its focus is on tech policy, the Diverse Voices paper
48
provides a detailed methodology on how to elicit the expertise
and feedback of underrepresented populations that would be
useful in improving the design and implementation of metrics.
Recently there has been a focus within ML to draw on con-
cepts from participatory design to reframe how we build ML
systems. At International Conference on Machine Learning
(ICML) 2020, a Participatory Approaches to Machine Learning
(PAML) workshop was held for the first time. The workshop or-
ganizers highlighted how the fields of algorithmic fairness and
human-centered ML often focus on centralized solutions, which
increases the power of system designers, and called instead
for ‘‘more democratic, cooperative, and participatory systems.’’
Research at the workshop covered the importance of action-
able recourse
49
(not just explaining why a decision was
made, but giving those affected actions that can change the
outcome), contestability, and moving beyond fairness to shift-
ing power.
50
However, there were warnings
42
that ‘‘participa-
tion is not a design fix for machine learning’’ and about the lim-
itations of preference elicitation for participatory algorithm
design.
51
An additional argument for the importance of including a range
of stakeholders comes from the fragile and elastic nature of DL
algorithms combined with the incompatible incentives the
owners of these algorithms have to address this fragility.
13
Spe-
cifically:
If subjects follow their incentives then the algorithm
ceases to function as designed. To sustain their accuracy,
algorithms need external rules to limit permissible re-
sponses. These rules form a set of guardrails which imple-
ment value judgments, keeping algorithms functioning by
constraining the actions of subjects.
Slee proposes that external guardrails on how users are al-
lowed to engage with the algorithms are a necessary corrective
measure for the associated gaming and abuses cited above, but
that the algorithm owners themselves are incentivized to create
guardrails that do not align with their algorithms, an act of regu-
latory arbitrage.
13
Given that guardrails are a restriction on the
user, transparent and just corrective measures will depend on
how effective the ethical circle has been expanded in the design
and creation of the guardrails. Slee
13
gives the example of the
incompatible incentives Facebook faces in addressing ethical is-
sues with its News Feed Algorithm, and suggests that journalists
not tempted by the financial incentives driving Facebook would
be better equipped to address this.
Although it is impossible to simply oppose metrics, the harms
caused when metrics are overemphasized include manipulation,
gaming, a focus on short-term outcomes to the detriment of
longer-term values, and other harmful consequences, particu-
larly when done in an environment designed to exploit people’s
impulses and weaknesses, such as most of our online
ecosystem. The unreasonable effectiveness of metric optimiza-
tion in current AI approaches is a fundamental challenge to the
field and yields an inherent contradiction: solely optimizing met-
rics leads to far from optimal outcomes. However, we provide
evidence in this paper that healthier use of metrics can be
created by (1) using a slate of metrics to get a fuller and more
nuanced picture; (2) externally auditing algorithms as a means
of accountability; (3) combining metrics with qualitative ac-
counts; and (4) involving a range of stakeholders, including those
who will be most impacted. This framework may help address
the core paradox of metric optimization within AI, and not solely
relying on metric optimization may lead to a more optimal use
of AI.
DECLARATION OF INTERESTS
The authors declare no competing interests.
ll
OPEN ACCESS
6Patterns 3, May 13, 2022
Perspective
REFERENCES
1. Likierman, A. (2009). The five traps of performance measurement. Harv.
Bus. Rev. 87, 96–101.
2. Kaplan, R., and Norton, D. (1992). The balanced scorecard: measu res that
drive performance. Harv. Bus. Rev. 70, 71–79.
3. Ribeiro, M.H., Ottoni, R., West, R., Almeida, V.A.F., and Meira, W., Jr.
(2019). Auditing radicalization pathways on YouTube. Preprint at arXiv.
1–18, abs/1908.08313.
4. Turque, B. (2012). ‘Creative.motivating’ and fired. Wash. Post. March
6, 2012.
5. Ramineni, C., and Williamson, D. (2018). Understanding mean score differ-
ences between the e-raterautomated scoring engine and humans for
demographically based groups in the GREgeneral test. ETS Res. Rep.
Ser. 2018, 1–31.
6. Goodhart, C. (2015). Goodhart’s law. In The Encyclopedia of Central
Banking, L. Rochon and S. Rossi, eds. (Edward Elgar Publishing),
pp. 227–228.
7. Strathern, M. (1997). ‘Improving ratings’: audit in the British University sys-
tem. Eur. Rev. 5, 305–321.
8. Rosenlicht, M. (1968). Introduction to Analysis (Dover).
9. Schrage, M., and Kiron, D. (2018). Leading with next-generation key per-
formance indicators. MIT Sloan Manag. Rev. 16, 1–2.
10. Mitchell, T.M. (1997). Machine Learning (McGraw-Hill).
11. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning Book
(MIT Press).
12. Espeland, W.N., and Sauder, M. (2007). Rankings and reactivity: how pub-
lic measures recreate social worlds. Am. J. Soc. 113, 1–40.
13. Slee, T. (2019). The incompatible incentives of private sector AI. In Oxford
Handbook of Ethics of Artificial Intelligence, M. Dubber, F. Pasquale, and
S. Das, eds. (Oxford University Press).
14. Zech, J., Badgeley, M., Liu, M., Costa, A., Titano, J., and Oermann, E.
(2018). Variable generalization performance of a deep learning model to
detect pneumonia in chest radiographs: a cross-sectional study. PLoS
Med. 15, e1002683.
15. Green, B. (2018). Ethical reflections on artificial intelligence. Sci. Fides
6, 9–31.
16. Manheim, D., and Garrabrant, S. (2019). Categorizing variants of Good-
hart’s law. https://arxiv.org/abs/1803.04585.
17. Krakovna, V. (2019). Specification Gaming Examples in AI - master list:
sheet 1. https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC
3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzux
o8bjOxCG84dAg/pubhtml.
18. Jacobs, A., and Wallach, H. (2019). Measurement and fairness. In Pro-
ceedings of the Conference on Fairness, Accountability, and Transpar-
ency (FAT* ’21) (ACM), pp. 375–385.
19. Taylor, J. (2016). Quantilizers: a safer alternative to maximizers for limited
optimization. In The Workshops of the Thirtieth AAAI Conference on Artifi-
cial Intelligence AI, Ethics, and Society, Technical Report WS-16-02.
20. Mullainathan, S., and Obermeyer, Z. (2017). Does machine learning auto-
mate moral hazard and error? Am. Econ. Rev. 107, 476–480.
21. Meyerson, E. (2012). YouTube now: why we focus on watch time.
YouTube Creator Blog. August 10, 2012. https://youtube-creators.
googleblog.com/2012/08/youtube-now-why-we-focus-on-watch-
time.html.
22. Chaslot,G. (2018). How algorithmscan learn to discreditthe media. Medium.
Feb 1, 2018. https://medium.com/@guillaumechaslot/how-algorithms-can-
learn-to-discredit-the-media-d1360157c4fa.
23. Bevan, G., and Hood, C. (2006). What’s measured is what matters: targets
and gaming in the English public health care system. Public Adm. 84,
517–538.
24. Gabriel, T. (2010). Under pressure, teachers tamper with tests. New York
Times. June 10, 2010.
25. Tufekci, Z. (2019). The Imperfect Truth about Finding Facts in a World of
Fakes (Wired), March 2019.
26. Harwell, D., and Timberg, C. (2019). YouTube recommended a Russian
media site thousands of times for analysis of Mueller’s report, a watchdog
group says. Wash. Post, April 26, 2019.
27. Fisher, M., and Taub, A. (2019). On YouTube’s digital playground, an open
gate for pedophiles. New York Times. June 3, 2019.
28. Landrum, A. (2018). Believing in A Flat earth. In 2018 AAAS Annual Meeting
(AAAS), February 2018.
29. Vaidhyanathan, S. (2018). Antisocial Media: How Facebook Disconnects
Us and Undermines Democracy (Oxford University Press).
30. Bowles, N. (2018). ‘I don’t really want to work for Facebook.’ so say some
computer science students. New York Times. Nov. 15, 2018.
31. Harris, M., and Tayler, B. (2019). Don’t let metrics undermine your busi-
ness. Harv. Bus. Rev.
32. Mathur, A., Acar, G., Friedman, M., Lucherini, E., Mayer, J., Chetty, M.,
and Narayanan, A. (2019). Dark patterns at scale: findings from a Crawl
of 11K shopping websites. Proc. ACM Hum. Comput. Interact. 3, 32.
https://doi.org/10.1145/3359183.
33. Lewis, P. (2018). ‘Fiction is outperforming reality’: how YouTube’s algo-
rithm distorts truth. Guardian, Feb 2, 2018.
34. Angwin, J. (2016). Making algorithms accountable (ProPublica), Aug
1, 2016.
35. Zuckerman, E. (2020). The Case for Digital Public Infrastructure (Knight
First Amendment Institute at Columbia University), January 17, 2020.
36. Brundag e, M., Avin, S., Wang, J., Belfield, H., Krueger, G., Hadfield, G.,
Khlaaf, H., Yang, J., Toner, H., Fong, R., et al. (2020). Towards Trustworthy
AI Development: Mechanisms for Supporting Verifiable Claims, https://
arxiv.org/abs/2004.07213.
37. Engler, A. (2021). Auditing employment algorithms for discrimination.
Brookings Inst. Rep.
38. Angwin, J. (2021). Our first year: auditing algorithms. Markup Hello World.
Feb 27, 2021.
39. Buolamwini, J., and Gebru, T. (2018). Gender Shades: intersectional accu-
racy disparities in commercial gender classification. In Proceedings of the
1st Conference on Fairness, Accountability and Transparency (PMLR),
pp. 77–91.
40. Ng, A. (2021). Can auditing eliminate bias from algorithms? Markup,
February 23, 2021.
41. Raji, I.D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., and Denton, E.
(2020). Saving face: investigating the ethical concerns of facial recognitio n
auditing. In AAAI/ACM AI Ethics and Society conference.
42. Sloane, M., Moss, E., Awomolo, O., and Forlano, L. (2020). Participation is
not a design fix for machine learning. In Participatory Approaches to ML
Workshop at International Congress for Machine Learning 2020.
43. Zuckerman, E. (2020). Interview with Kevin Roose of the New York Times
(Reimagining the Internet (5). The Initiative for Digital Public Infrastructure
at UMass Amherst).
44. Wiggins, C. (2018). Ethical Principles, OKRs, and KPIs: What YouTube and
Facebook Could Learn from Tukey (Columbia University Data Science
Institute Blog). https://datascience.columbia.edu/ethical-principles-okrs-
and-kpis-what-youtube-and-facebook-could-learn-tukey.
45. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson,
B., Spitzer, E., Raji, I.D., and Gebru, T. (2019). Model cards for model
ll
OPEN ACCESS
Patterns 3, May 13, 2022 7
Perspective
reporting. In Proceedings of the Conference on Fairness, Accountability,
and Transparency (FAT* ’19) (ACM), pp. 220–229.
46. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J., Wallach, H., Dau-
mee
´, H., III, and Crawford, K. (2021). Communications of the ACM 64,
86–92. https://doi.org/10.1145/3458723.
47. Vallor, S., Green, B., and Raicu, I. (2018). Ethics in Technology Practice.
(The Markkula Center for Applied Ethics at Santa Clara University).
https://www.scu.edu/ethics/.
48. Young, M., Magassa, L., and Friedman, B. (2019). Toward inclusive tech
policy design: a method for underrepresented voices to strengthen tech
policy documents. Ethics Inf. Technol. https://doi.org/10.1007/s10676-
019-09497-z.
49. Ustun, B., Spangher, A., and Liu, Y. (2018). Actionable recourse in linear
classification. In ACM Conference on Fairness, Accountability and Trans-
parency (FAT2019).
50. Watson-Daniels, J. (2020). Beyond fairness and ethics: towards agency
and shifting power. In Participatory Approaches to ML Workshop at Inter-
national Congress for Machine Learning.
51. Robertson, S., and Salehi, N. (2020). What if I don’t like any of the choices?
In The Limits of Preference Elicitation for Participatory Algorithm Design.
Participatory Approaches to ML Workshop at International Congress for
Machine Learning.
About the authors
Rachel L. Thomas, PhD, is a professor of practice at Queensland University
of Technology and co-founder of fast.ai, which created one of the world’s
most popular deep learning courses. Previously, she was founding director
of the Center for Applied Data Ethics at University of San Francisco. Rachel
earned her PhD in mathematics at Duke University and was selected by For-
bes as one of 20 Incredible Women in AI. Follow her on twitter at @ma-
th_rachel.
David Uminsky, PhD, is a Senior Research Associate in Computer Science
and Executive Director of the University of Chicago’s Data Science Institute.
He was previously an Associate Professor of Mathematics and founding Exec-
utive Director of the Data Institute at the University of San Francisco (USF).
David’s research interests are in the area of data science and applied mathe-
matics. He was selected by the National Academy of Sciences as a Kavli Fron-
tiers of Science Fellow. Prior to joining USF, David was a combined National
Science Foundation (NSF) and University of California (UC) President’s Fellow
at UCLA, where he was awarded the Chancellor’s Award for outstanding post-
doctoral research.
ll
OPEN ACCESS
8Patterns 3, May 13, 2022
Perspective