Content uploaded by Hubert Etienne
Author content
All content in this area was uploaded by Hubert Etienne on Mar 23, 2023
Content may be subject to copyright.
Content uploaded by Hubert Etienne
Author content
All content in this area was uploaded by Hubert Etienne on Mar 21, 2023
Content may be subject to copyright.
Original Manuscript
Social Science Computer Review
2023, Vol. 0(0) 1–17
© The Author(s) 2023
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/08944393231164329
journals.sagepub.com/home/ssc
Performative Quantification:
Design Choices Impact the
Lessons of Empirical Surveys
About the Ethics of
Autonomous Vehicles
Hubert Etienne
1,2,3
and Florian Cova
4
Abstract
In recent years, researchers have emphasized the relevance of data about commonsense moral
judgments for ethical decision-making, notably in the context of debates about autonomous
vehicles (AVs). As such, the results of empirical studies such as the Machine Moral Experiment have
been influential in debates about the ethics of AVs and some researchers have even put forward
methods to automatize ethical decision-making on the basis of such data. In this paper, we argue
that data collection is not a neutral process, and the difference in study design can change
participants’answers and the ethical conclusions that can be drawn from them. After showing that
participants’individual answers are stable in the sense that providing them with a second occasion
to reflect on their answers does not change them (Study 1), we show that different conclusions
regarding participants’moral preferences can be reached when participants are given a third
option allowing AVs to behave randomly (Study 2), and that preference for this third option can be
increased in the context of a collective discussion (Study 3). We conclude that design choices will
influence the lessons that can be drawn from surveys about participants’moral judgments about
AVs and that these choices are not morally neutral.
Keywords
AI ethics, Autonomous vehicles, Empirical ethics, Moral judgements, Measurement, Surveys
1
Department of Philosophy, Ecole Normale Sup´
erieure, Paris, France
2
Laboratory of Computer Sciences (LIP6), Sorbonne Universit´
e, Paris France
3
Facebook AI Research, Paris, France
4
Department of Philosophy, Universit´
e de Genève, Geneva, Switzerland
Corresponding Author:
Hubert Etienne, Department of Philosophy, Ecole Normale Sup´
erieure, 45 rue d’Ulm, Paris 75230, France.
Email: hae@meta.com
Introduction
Online platforms have become a common way to collect and quantify political and moral opinions
to inform decision-makers, conduct research, or even automate decision-making. However,
critical sociology has highlighted the non-neutrality of quantification methods used for statistical
analysis, resulting in numerous controversies statisticians have debated (Desrosières, 2008). A
famous example of this critique is Bourdieu’s claim that “public opinion does not exist”because
opinion surveys are associated with political constructions strongly determined by their meth-
odology, the questions’framing and goals (Bourdieu, 1972). However, this call for prudence,
which characterized the social sciences’adoption of statistical methods, was unheeded by those
who developed computational approaches to social choice based on machine-learning models.
Indeed, despite relying on copious amount of data about people’s moral and social preferences,
these approaches have been particularly blind to the extent to which design choices and
methodological limitations in surveys’design may shape researchers’conclusions. As such
research typically aims to play a crucial role in shaping public debate and policy-making, our goal
in this paper is to draw attention to the way design choices can impact the conclusions of
computational approaches to social decision-making.
More precisely, we focus on recent attempts at automatizing social and moral decision-making
around autonomous vehicles (AVs). Through three studies, we provide experimental evidence that
subtle choices in survey design do impact participants’replies and the conclusions machine-
learning algorithms would draw while trying to deduce a general picture of social preferences from
their answers. Our results allow us to refute the conclusions of foundational experimental works
about people’s moral opinions on AVs dilemmas, demonstrating that more data does not necessary
lead to more accurate results, as attempts to aggregate moral opinions can fall in critical pitfalls. In
doing so, we also suggest better designs for online survey makers to collect accurate moral
opinions.
The recent development of AVs led academics to engage in numerous debates about the ethical
questions raised by this technology. However, most of these debates have focused on how AVs
should behave when they have to choose between courses of actions that would all lead to
inflicting harm on others. For example, should AVs decide to harm an animal rather than a human
being? One person rather than two? A young person rather than an old person?
One reason these questions have drawn so much attention is because such dilemmas have been
at the center of ambitious empirical studies. The Moral Machine Experiment (MME, Awad et al.,
2018) presented participants with a series of dilemmas involving AVs, in which they had to choose
whether an AV should go ahead and harm certain people, or swerve and injure other people. They
collected c. 40 million decisions from people in 233 countries and territories, what probably makes
it the largest available dataset on people’s intuitions about moral dilemmas. Based on this un-
precedented wealth of data, Awad and colleagues found several patterns in people’s preferences,
including preferences to save human beings rather than pets, several people rather than one person,
young people rather than old people and people who are healthy rather than people who are not.
As authors of the MME have stressed in several places (e.g., Bonnefon, 2019), these con-
clusions are supposed to be descriptive, not prescriptive: they describe how people prefer AVs to
be programmed, not how we should program them in the end. However, they still argue that these
data are relevant to normative debates about the way AVs should be programmed. A “modest”
argument in favor of this relevance relies on pragmatic considerations: if policymakers want
people to adopt AVs, then they should make sure that the behavior of AVs does not conflict with
people’s sense of morality. For example, Awad and colleagues (2018) write that “we need to have a
global conversation to express our preferences to the companies that will design moral algorithms,
and to the policymakers that will regulate them,”and that “we can embrace the challenges of
2Social Science Computer Review 0(0)
machine ethics as a unique opportunity to decide, as a community, what we believe to be right or
wrong; and to make sure that machines, unlike humans, unerringly follow these moral prefer-
ences.”Thus, results of the MME are morally relevant because they allow people to express their
preferences in the context of a collective ethical decision about the way AVs should be
programmed.
A more “ambitious”version of this approach proposes to automatize ethical decision-making
by collecting people’s judgments about such ethical issues and using aggregation methods to reach
“credible”ethical decisions. Thus, Noothigattu et al. (2018)’s Voting-based system (VBS) aims to
automatize moral decision-making in the context of dilemmas involving AVs: rather than pro-
viding AVs with general ethical principles people have agreed upon, we should simply provide
them with people’s opinion on ethical dilemmas and have the AV learn from these individual
choices to make their decisions. This more ambitious proposal has been resisted on several
grounds. For example, it has been criticized for being based on morally fallacious methodological
axioms—for example, endorsing Conitzer’s assumption (2017) that aggregating moral agents’
judgments “may result in a morally better system than that of any individual human, for example,
because idiosyncratic moral mistakes made by individual humans are washed out in the ag-
gregate”(Etienne, 2021;Greene et al., 2016).
Despite their differences, both approaches rest on the assumption that the data collected in the
context of the MME and similar studies accurately reflect participants’attitudes—and, more
importantly, the type of attitudes that might be relevant to the public reception of AVs. However,
the behavioral economists’literature on nudges, reminds us of the great sensitivity of respondents’
replies to survey designs (Thaler & Sunstein, 2008), emphasizing the critical importance of
collecting responses that accurately reflect people’s opinions. As such, there are at least five
dimensions on which the kind of data collected by the type of approach illustrated by the MME
might fall short of being relevant to ethical decision-making.
Afirst dimension is perspective (PT). Indeed, past research has suggested that moral intuitions
can be shaped by the point of view from which we approach moral issues. For example, research
on moral dilemmas suggest that participants’intuitions might depend on whether they approach
this problem from a first- or third-person perspective (Nadelhoffer & Feltz, 2008;Tobia et al.,
2013; but see Cova et al., 2021 for failure to replicate). Similarly, research on moral and political
reasoning suggests that people’s judgments can be modified by asking to take them certain specific
perspectives on moral and political issues—such as the perspective of an “impartial spectator”(see
Allard & Cova, forthcoming for a review of this research). In the context of AVs, Bonnefon and
colleagues (2016) observed that, though participants were supportive of AVs that might sacrifice
passengers to save others, they were reluctant to ride in vehicles designed to follow this moral
principle. Moreover, Frank and colleagues (2019) found that participants were less likely to
answer that AVs should sacrifice their passengers when asked to adopt the perspective of an AV
passenger (compared to the perspective of a pedestrian or observer). This suggests that what seems
acceptable might depend on the perspective participants adopt when reflecting about AVs.
However, we do not know which perspective participants naturally endorse when participating in
such studies, and whether this is one relevant to public discussion at large.
A second dimension is deliberation time (DT). Past research on moral intuitions have em-
phasized that our responses towards moral dilemmas sometimes pit against each other a quick,
intuitive answer and a slower, reflective one (Greene, 2014). In line with this “dual-process”
approach to moral judgment, it has been shown that people’s responses to moral dilemmas can
change when they are asked to take time to reflect before answering (Capraro et al., 2019;Suter &
Hertwig, 2011). Accordingly, recent research suggests that people’s response to moral dilemmas
involving AVs can change depending on whether they are asked to answer quickly or not: in a
dilemma asking participants whether the AV should sacrifice pedestrians or its passengers,
Etienne and Cova 3
participants who had to answer quickly (within 5s) were more likely to answer that it should
sacrifice its passengers (Frank et al., 2019). However, quick answers are not necessarily those
relevant to public debates, in which people are supposed to take the time to reflect and ponder
different factor. Additionally, most research on people’s judgments about AVs probe participants’
intuitions when first exposed to a certain problem. But people engaged in public deliberation are
more likely to be repeatedly exposed to the questions they are asked to answer, giving their more
time to deliberate. Thus, it might be that answers collected by the MME only reflect quick,
intuitive answers based on a single exposure and not the slower, more reflective answers based on
repeated exposure that are more relevant to public debate.
A third dimension is whether people reflect about abstract principles or concrete cases (AvC).
While public debates about AVs are likely to be framed in terms of abstract principles (e.g.,
“should we take age into consideration?”), most studies have focused on participants’responses to
particular cases. However, past research in moral psychology and experimental philosophy have
emphasized the fact that people can give radically different answers depending on whether
questions are about abstract principles or concrete cases (Freiman & Nichols, 2011;Nichols &
Knobe, 2007;Sinnott-Armstrong, 2008;Struchiner et al., 2020).
A fourth dimension is the number of options, and more precisely the presence of a third option
(3O). Past research has shown that introducing a third option in a moral dilemma can dramatically
switch participants’moral preferences (Wiegmann et al., 2020). In the context of moral dilemmas
including AVs, Bigman and Gray (2020) have shown that introducing a third option allowing the
AV to make a decision at random drastically reduced the relevance of certain factors such as age,
gender or social status. Awad and colleagues (2020) argued that participants’answers were biased
by the formulation of the Bigman and Gray’s third option, which made it more ethically attractive
(“Treat the lives of X and Y equally”). However, Bigman and Gray also ran a study in which the
third option was formulated in a more neutral way (“To decide who to kill and who to save without
considering whether it is X or Y”) and participants still showed a wide preference for this option.
Awad and colleagues (2020) also argued that, when they offered their participants to indicate their
preference between two options using a slider, very few participants chose to put the slider at the
middle, which would indicate that they are indifferent between two options. However, this
counterargument rests on a confusion: preferring that the choice between A and B be random is not
the same as being indifferent between A and B if one is forced to choose between A and B. This
confusion is based on the assumption that preference for a choice at random must be grounded in
indifference, while it is more likely to be grounded in a moral preference for impartiality. Finally,
recent studies have shown that participants are much more likely to be outraged by AVs that makes
decisions based on criteria such as age, gender or moral status, compared to AVs that makes
decisions at random (De Freitas & Cikara, 2021). This suggests that allowing participants to have
AVs programmed to choose at random might lead to a very different picture of public consensus
compared to the two-options method used by studies such as the MME.
Afifth and last dimension is the presence of a series of objections and a collective discussion
(DISC). Most people arguing for the relevance of empirical approaches to ethical debates on AVs
argue that such methods allow non-experts to take part in the collective discussion about the ethics
of AVs. However, most of the time, the data collected do not reflect the outcome of a public
discussion, but the aggregation of individual differences, formed in isolation. But psychological
studies show that the outcome of collective discussions is not similar to the mere aggregation of
individual answers, and that the results of public discussion are generally more efficient (Balliet,
2010). This is why some have argued that discussion might lead people to converge on “better
moral judgments”(Mercier, 2011). Thus, if one’s goal is to identify the moral principles on which
people would converge, it might be more interesting to collect judgments that have been formed as
the result of a discussion, rather than in isolation, and challenged by the review of objections.
4Social Science Computer Review 0(0)
For all these reasons, it might be that public opinion as it has been measured through studies
such as the MME might not be the most relevant to ethical decision-making about AVs, and that
changing some of the aforementioned parameters might lead people to converge on very different
options. If this is the case, then programming AVs so that they learn to make decisions based on the
answers people gave to the MME might not lead AVs to make “credible”decisions, but rather to
go against the ethical principles that are likely to be endorsed as the result of public discussion. To
investigate this possibility, we conducted three exploratory studies. In the first study, we explored
the impact of PT and DT, and had people answer dilemmas about AVs quickly and slowly. In the
second study, we explored the joint impact of AvC, 3O and DISC by comparing participants’
answers when collected according to the methods of the MME and when collected at the issue of a
collective discussion on abstract principles including the possibility of the third option. In the third
study, we explored how robust is participants’preference for their third option, and whether it is
favored by collective discussion.
Study 1: The Effect of Time Constraint on Participants’Judgments
First, we wanted to determine to which extent answers collected following the MME methods
were robust, or changed (i) depending on the perspective participants were asked to adopt, and (ii)
when participants were given more time to reflect on them.
Materials and Methods
We used an experimental paradigm similar to the one used by the MME. Participants were
presented with 16 (+1 control) moral dilemmas in which an AV experiences a brake failure,
preventing it from stopping safely on time. Respondents then had to choose whether the AV
should keep straight or swerve into the other lane, resulting in various consequences involving at
least one individual’s death (see Figure 1). Each scenario required respondents to arbitrate be-
tween different categories of victims according to nine criteria: gender (male vs. female), age
(young vs. adult vs. aged), body size (fat vs. fit), social status (executive vs. homeless, doctor vs.
criminal), nature (humans vs. pets), number (1 vs. 2 vs. 3 vs. 5), role in the dilemma (pedestrian vs.
AV passengers) and lawfulness (jaywalkers vs. lawful pedestrians) (see Table 1 for the full list of
combinations). Participants had to indicate their answer by choosing between two options:
Continue (Option 1) or Swerve (Option 2).
Figure 1. Example of scenario in Studies 1 and 2
Etienne and Cova 5
Participants were presented with the full set of scenarios twice. For the first presentation (Set 1),
participants were instructed to answer questions “as quickly as possible.”For the second presentation (Set
2), they were instructed to “take time to think about their answer.”To force them to take time to think,
there was an invisible time counter at the bottom of the vignette and participants could not submit their
answer before the counter reached zero. To investigate whether the effect of a second exposure on
participants’answers would depend on how much time they were asked to think about their answer, the
time they were asked to wait before answering varied across participants (10s, 20s, 30s, or 40s).
The way questions were framed varied across participants. One fourth of participants were
asked to answer as if they were designers of AVs, another fourth as if they were citizens answering
a national public consultation on AVs, and another fourth as if they were policymakers preparing
the regulation for the self-driving industry. The last fourth did not receive specific instructions.
Results
Participants were US and UK residents recruited through Prolific Academic. After excluding 46
participants who failed the attention check, we were left with 608 participants (278 women, 324
men, 6 “others”;M
age
= 30.69, SD
age
= 11.22).
Perspective-Taking. We first assessed the effect of perspective-taking on participants’answers. For
each vignette, we compared the proportion of participants’answers across all four conditions
using chi-square tests. Results are presented in Table S1 in Supplementary Materials. As can be
seen, we found no significant effect of condition.
First vs. Second Exposure. Participants’average response time for the first set of vignettes was 10.7
seconds (SD = 5.52). Participants’average response time for the second set of vignettes was 30.1
seconds (SD = 16.03). Whether participants had to answer 10, 20, 30 or 40 seconds before
Table 1. Percentage of Participants Who Chose “Option 2”at the First Presentation of Each Question for
Each Group and Each Case (Study 1). Rightmost column indicates the results of a Chi-Square test (Df = 3)
Comparing the distribution of answers between each group. Case 5 does not appear in the table, as it was an
attention check. Cases 14 and 15 do not appear because, due to programming error, their consequences
varied across presentations.
Scenario No instruction AV designer Citizen Policy-maker Chi-square
1 17.0% 18.1% 11.7% 10.8% p= .17
2 57.4% 57.2% 55.0 56.3% p= .97
3 19.9% 23.2% 23.4% 19.0% p= .70
4 7.8% 5.8% 7.0% 7.0% p= .93
6 26.2% 33.3% 23.4% 22.8% p= .15
7 8.5% 7.2% 4.1% 5.7% p= .41
8 92.2% 93.5% 86.5% 87.3% p= .12
9 62.4% 69.6% 58.5% 60.1% p= .21
10 95.0% 96.4% 90.1% 91.1% p= .10
11 27.7% 24.6% 28.1% 25.3% p= .88
12 68.1% 67.4% 59.1% 62.0% p= .29
13 66.0% 61.6% 57.9% 56.3% p= .33
16 26.2% 31.9% 26.9% 27.8% p= .72
17 39.7% 43.5% 43.9% 39.2% p= .77
N 141 138 171 158 —
6Social Science Computer Review 0(0)
answering the second set of vignettes did not significantly impact the percentage of participants
who changed their mind between the first and second set (see Table S1). We thus compared
participants’answers between the first and second exposure without considering the time par-
ticipants were given to answer the second set. The result of these comparisons can be found in
Table 2. As one can see, participants’answers were quite stable: out of 14 scenarios, only two
showed statistically significant differences in participants’answers between the first and second
exposure. Moreover, changes in participants’answers were quite small (around 3%). Finally, in
both cases, reflection tended to simply increase the tendency that was already observable in Set 1
rather than leading participants into another direction.
Discussion
In this first study, we investigated whether people’s answers to AVs dilemmas were robust against
two potential sources of variations: (i) perspective-taking, and (ii) repeated exposure providing
Table 2. Percentage of Participants Who Chose Option 2 (“Swerve”) at the First and Second Presentation
of Each Scenario (Study 1). For each scenario, we indicate the criteria that vary across the two options, and
the consequence of each choice. Rightmost column indicates the results of a McNemar’s Chi-Square test
(df = 1) Comparing the distribution of answers between each set. Case 5 does not appear in the table,
as it was an attention check. Cases 14 and 15 do not appear because, due to programming error, their
consequences varied across presentations.
Scenario Criterion Option 1 (Continue) Option 2 (Swerve) Set 1 (%) Set 2 (%) McNemar test
1 Gender 2 men 2 women 14.1 14.1 p= 1.00
2 Body size 1 athletic man
1 athletic woman
1 obese man
1 obese woman
56.4 59.2 p= .04*
3 Status 1 homeless man
1 homeless woman
1 executive man
1 executive woman
21.4 17.8 p< .001*
4 Status 1 criminal man
1 criminal woman
1 male doctor
1 female doctor
6.9 5.4 p= .08
6 Gender
Age
1 man
1 old woman
1 woman
1 old man
26.2 24.7 p= .38
7 Gender
Age
1 old woman
1 man
1 woman
1 young boy
6.3 7.1 p= .35
8 Number 2 women 1 woman 89.6 90.8 p= .27
9 Role 1 woman
1 man
1 female passenger
1 male passenger
62.3 65.1 p= .07
10 Number 3 men
2 women
1 man
1 woman
92.9 93.4 p= .30
11 Role 1 female passenger
1 male passenger
1 woman
1 man
26.5 26.6 p= .90
12 Role
Number
1 woman
1 man
1 male passenger 63.8 65.8 p= .15
13 Species
Role
Norm
1 jaywalking man 1 passenger pet 60.2 61.3 p= .47
16 Gender
Role
1 woman 2 women passengers 28.1 30.1 p= .23
17 Age
Role
1 old woman
1 old man
1 female passenger
1 male passenger
41.6 41.8 p= .92
Etienne and Cova 7
participants with more time to reflect on their answers. Overall, we did not observe a significant
effect of perspective-taking. However, our results are not at odd with the previous literature, which
focused on the passenger vs. pedestrian perspective. Here, we were more interested in perspectives
that were directly relevant to the public debate. The fact that participants in the “no instruction”
condition did not significantly differ from the “citizen”or “policymakers”conditions suggests that
the perspective participants naturally endorse does not impact their answers in a way that make
them irrelevant to public deliberation.
Presenting participants a second time with the AVs dilemmas and forcing them to take time to
reflect on their answer did not substantially alter their responses either. This absence of effect
might seem at odd with previous results (Capraro et al., 2019;Suter & Hertwig, 2011). However, it
should be noted that these studies were not concerned with AVs dilemmas (but dilemmas in-
volving human agents) and that they used a different methods: they compared participants that
were exposed a single time to dilemmas (and manipulated the time of this first exposure). The only
study that focused on AVs dilemmas was the one by Frank and colleagues (2019), but this study
contrasted very short response times (<5s) with more “usual”ones (<30s). Here, if we were
interested in contrasting participants’“normal”answers to MME-style experiments with longer
public deliberation that typically involve several exposures, we rather asked participants to think
longer than usual, where Frank and colleagues asked them to think shorter than usual. Our results
are thus compatible. However, our results suggests that giving participants an occasion to reflect
more on their answer by presenting them a second time with a given AV dilemma did not make a
substantial difference, and thus that their answer to the first presentation already is robust.
Study 2: Participants’Judgments after Collective
Discussions the Moral Relevance of Different Factors
In Study 2, our goal was to study participants’judgments about the ethics of AVs in a context that
would be closer to a public deliberation about the principles of AVs ethics. That is, participants
were asked to (i) have a collective discussion (DISC), about (ii) general abstract principles (AvC).
Because our means were limited, we did not systematically manipulate these factors but rather
introduced them all at once, to see whether the conclusions our study would yield would differ
from the one yielded by Study 1 and MME-style experiments at large.
Materials and Methods
At the beginning of the study, participants were presented with the same 16 + 1 scenarios as in
Study 1, and asked to answer them “as quickly as possible.”This was made to acquaint par-
ticipants with the kinds of dilemmas that are considered relevant for ethical debates about the
ethics of AVs.
After that, participants were immediately invited to join a video call to engage in a collective
online discussion with seven to fifteen other participants (discussions lasted around 15 minutes).
Participants were asked to discuss the relevance of nine criteria: gender (male vs. female), age
(young vs. adult vs. aged), body size (fat vs. fit), social status (executive vs. homeless, doctor vs.
criminal), nature (humans vs. pets), number (1 vs. 2 vs. 3 vs. 5), role in the dilemma (pedestrian vs.
AV passengers) and lawfulness (jaywalkers vs. lawful pedestrians) (see Box 1). In each question,
respondents were asked whether they thought the criterion was “morally relevant to make life
arbitrations in such situations”and, if so, which category should be sacrificed to spare the other
(e.g., sacrifice men to save women). After the collective discussion, each participant was asked to
indicate their answer by indicating for each criterion whether the criterion was morally relevant or
whether it was morally irrelevant and AVs should be programmed to choose at random, without
8Social Science Computer Review 0(0)
taking this criterion into account (see Box S1 in Supplementary Materials for the exact wording).
For half of participants, we asked them how confident they were about their reply (0 = Not
confident at all, 1 = Not much confident, 2 = Quite confident, 3 = Very confident), and how much
they understood that someone might feel differently about this criterion (0 = Not at all, 1 = Not
much, 2 = Quite, 3 = Very much).
Box 1. Instructions for Collective Discussion (Study 2)
Some of you answered that everything else being equal, [X] should be saved over [Y], and
others replied the opposite. Please use the next 90 seconds to discuss whether you think [Z]
is a morally relevant criterion to make such arbitrations, or not. And if so, should [X] or [Y]
be spared and why?
1. X = “men,”Y=“women,”Z=“gender”
2. X = “younger people,”Y=“older people,”Z=“age”
3. X = “fit,”Y=“larger,”Z=“body size”
4. X = “people with higher social status,”Y=“people with lower social status,”Z=
“social status”
5. Some of you answered that humans should always be saved over pets, while others
replied it may depend on the situation. Do you think some circumstances may allow
for exceptions or not?
6. Some of you answered that everything else being equal, lawful drivers and
pedestrians should be saved over jaywalkers, and others replied it should not make a
difference. Do you think abidance by the law is a morally relevant criterion to make
such arbitrations? And if so, would you allow for exceptions to this?
7. X = “pedestrians,”Y=“AV passengers,”Z=“that this”
8. Some of you answered that everything else being equal, the AV’s action of
swerving versus keeping straight is morally relevant, while others replied that it
does not matter. Would you change any of your replies if reaching the same
outcome implied the AV swerving instead of keeping straight and why?
9. Some of you answered that the AV should always be operated in a way to save the
greater number of people, while others disagree, arguing that it depends on the
situation, which should be assessed based on the criteria previously discussed.
What do you think about it?”
Results
Participants were US and UK residents recruited through Prolific Academic. After excluding 10
participants who failed the attention check, we were left with N = 190 participants (96 women, 94
men; M
age
= 31.4, SD
age
= 10.8).
Participants’judgments about the relevance of our nine criteria are presented in Table 3.
Discussion
Our results suggest that there was a strong consensus in our participants on the relevance of two
criteria: number of persons saved (saving the most people) and species (saving humans rather than
Etienne and Cova 9
pets). Two thirds of participants also considered age to be a relevant criterion. This is in line with
the results of the MME, which suggested that these two criteria had the most weight. However,
there was also a strong consensus on certain criteria being morally irrelevant: gender, body size,
and social status. For these criteria (which are the ones Etienne, 2021 identified as the least morally
relevant), most participants considered it best to leave the AV’s decision at chance.
However, training AVs to make decisions on the data collected by the MME would lead AVs to
take such criteria into account, leading them to go against the perceived consensus. This is because
the methodology of the MME only allows participants to signal indifference between two
outcomes, and not to express their commitment to impartiality and their preference for random
choices. Thus, an AV that would be trained on the kind of data we collected would behave in a
substantially different way from an AV trained on the data collected by the MME. This means that
design choice can substantially influence the outcome of data-driven, automated decision-making.
One question raised by our study is what drives people’s judgment that criteria such as
gender, body size and social status are irrelevant? Is it only the fact of offering participants a
third option that allows them to express their preference for random choice? Or did the fact
that we presented participants with abstract principles (rather than concrete cases) and that we
offered them the possibility to discuss with each other play a role? In Study 3, we investigated
the impact of introducing a third option (random choice) in concrete cases, rather than in
abstract ones. Moreover, we collected participants’answers before and after collective
discussions, to determine to which extent participating in a collective discussion led par-
ticipants to favor this third option.
Study 3: The Impact of Collective Discussion on Participants’
Preference for Random Choice
In Study 3, we still offered participants a third option (random choice) but presented this option in the
context of concrete scenarios rather than abstract principles. We then had participants read arguments
against the relevance of several criteria and engage in a collective discussion with other participants.
Materials and Methods
Participants were US and UK residents recruited through Prolific Academic. We asked 331
participants, of which 324 passed the attention check, to address the same set of 11 randomized
Table 3. Percentage of Participants Who Rated Each Criterion as Morally Relevant, Along With
Participants’Average Confidence in Their Answer, and Ratings of How Much They Understand That
Someone can Think Differently (Study 2).
Criterion Relevant (%) Confidence Understanding
Gender 24.2 2.57 (.60) 2.03 (.85)
Age 65.3 2.28 (.58) 2.21 (.71)
Body size 19.5 2.48 (.72) 1.89 (.96)
Social status 26.3 2.39 (.67) 1.99 (.94)
Specie 86.8 2.66 (.58) 1.73 (1.05)
Conformity to law 59.5 2.27 (.75) 2.02 (.87)
Role (pedestrian vs. passenger) 65.2 2.20 (.73) 2.06 (.74)
Going straight vs. swerving 55.3 2.16 (.72) 1.91 (.75)
Number 92.6 2.54 (.54) 1.52 (.93)
10 Social Science Computer Review 0(0)
dilemmas (+1 control question) three different times (Set 1, 2 and 3). Between Set 1 and Set 2,
participants were presented with seven objections to the main arguments that were brought up in
group discussions in Study 2 and are based on Etienne (2022)’s counterarguments. For example,
the objection against the relevance of gender to AV’s decisions was:
You may think that gender is a morally relevant criterion here. If so, and to be consistent with your
answer, you should be ready to either state that white people should be spared versus black people or
the contrary, that Muslims should be spared versus Catholics or the contrary, that homosexuals should
be spared versus heterosexuals or the contrary, or to explain what makes gender different from skin
colour, religious belief and sexual orientation so that the former one is morally relevant here whereas
the others are not.
(All seven objections can be found in Supplementary Materials.) After each objection,
participants were asked to rate the objection’s strength on a 5-point scale. Between Set 2 and
Set 3, they participated in a group discussion to express and justify their replies (as in Study
2).
Contrary to the abstract principles we used in Study 2, the concrete cases we used in Studies 1
and 2 present one disadvantage: when participants choose the Option (“Keep straight”), we do not
know whether this choice reveals a preference for saving people on the other tracks, or a mere
preference for inaction. As we saw in Study 2, 44.7% of participants answered that they con-
sidered this a relevant criterion. To correct for this shortcoming, we used concrete cases in which
participants had to choose between “turning left”or “turning right”(or “choosing at random”). An
example is presented in Figure 2.
Figure 2. Example of scenario in Study 3
Etienne and Cova 11
Finally, in Study 2, we have seen that the third option (“choosing randomly”) was the most
often selected for certain criteria. However, one could object that participants might be drawn
towards this answer because they feel like it does not need to be justified (contrary to other
answers). To test for this, a third of participants were asked to provide a justification for their
answer to all three sets (JUST). Another third received no particular instruction (CONTROL), and
the last third were asked to communicate a degree of confidence (DOC) for their replies (“how
confident do you feel about your reply?”) as well as a score of perceived consensus (“how much do
you think that others would agree with you?”) for each of the three sets.
Results
Frequency of “Random”Choice. The percentage of participants choosing the “random”option
for each vignette and each presentation (Set 1, Set 2 or Set 3) can be found in Table 4 .Ascan
be seen, we found a pattern of answers similar to the one we observed in Study 2: participants
tended to see Species, Number, Role, and Conformity to law (Norms) as relevant factor, but
rated Gender, Body size, and Social status as irrelevant factors. The main difference was that
participants tended to rate Age as an irrelevant factor after discussion (Set 3), while they
tended to rate it as relevant in Study 2. Overall, this suggests that the pattern of answers we
observed in Study 2 (and that challenged the conclusions of the MME) cannot be explained
only by the fact that we presented choices in an abstract way, rather than in a concrete way—
though we cannot exclude the possibility that presenting choices in an abstract or concrete
way might affect participants’choices (for a direct comparison of results of Studies 2 and 3,
see TableS2inSupplementaryMaterials).
Effect of Condition. We used 11 chi-square tests to investigate the impact of condition
(CONTROL, DOC and JUST) on distribution of participants’answers to Set 1. Out of 11
Table 4. Percentage of Participants Selecting the “Random”Option in Each Set (1–3) and Vignette.
* Indicates the result of a McNemar test comparing the percentage of participants selecting the “Random”
option between Set 1 and Set 3.
N° Criterion Side 1 Side 2 Set 1 (%) Set 2 (%) Set 3
Question 1 Gender Man Woman 75.0 81.8 85.2%***
Question 2 Age Young girl Old woman 45.4 56.8 56.5%***
Question 3 Body size Obese woman Athletic woman 63.3 70.4 80.2%***
Question 4 Status 2 homeless men 2 executive men 74.7 79.6 85.5%***
Question 5 Role 1 pedestrian 1 passenger 30.6 33.6 38.3%**
Question 6 Number 2 young girls 1 young girl 21.3 31.2 25.0%
Question 7 Role
Species
1 passenger 1 pet 8.6 11.4 10.8%
Question 8 Role
Norm
1 jaywalking man 1 passenger 19.1 21.6 21.9%
Question 9 Role
Number
2 pedestrians 1 passenger 13.6 14.5 14.5%
Question 11 Norm
Age
1 jaywalking young girl 1 woman 37.3 36.4 43.8%*
Question 12 Norm
Species
Role
1 jaywalking man 1 passenger pet 8.6 10.2 8.3%
12 Social Science Computer Review 0(0)
tests, only the one for vignette 3 (obese woman vs. athletic woman) came out significant.
However, this was not because the percentage of “random”answers significantly varied
across conditions (p= .15), but because participants asked to justify their answer were more
likely to choose to kill the athletic woman (see Table S3 in Supplementary Materials). Thus,
participants’tendency to choose the “random”optionwasrobustandremainedevenwhen
participants were asked to justify their answer.
Effect of Objection and Discussion. We compared participants’answers across the three sets (Set 1: initial
answers, Set 2: after objections, Set 3: after discussion). We found that, for all 11 vignettes, variance in
participants’answers was lower in Set 3 compared to Set 1: t(10) = 5.31, p= .0003. This means that the
procedure increased consensus across participants (see Table S4 in Supplementary Materials).
As can be seen in Table 4, the procedure (objection + collective discussion) produced sig-
nificant changes in participants’judgments. Overall, 30% of answers were modified at least once
across the three sets (see Section 3.4 in Supplementary Materials). For the four most controversial
criteria (age, gender, body size, and status), the procedure led more participants to endorse the
third option, and thus to treat these criteria as irrelevant. However, for role and norm-compliance,
it led more participants to consider these criteria as relevant, showing that the procedure did not
always favor the “random”option.
Confidence and Consensus Perception. Interestingly, participants were more confident in their
answers at the end of the procedure (Set 3), compared to their answers at the beginning of the
procedure (Set 1): t(111) = 11.03, p< .001. Their perception of consensus also significantly
increased between Set 1 and Set 3: t(111) = 7.84, p< .001, though this increase was mostly due
to the discussion, and not to their being faced with objections (see Table 5).
Discussion
In Study 2, we found a pattern of answers that suggested that criteria singled out as relevant in the
MME were deemed mostly irrelevant once a third option was introduced. However, it was not
possible to determine whether this was due only to the introduction of a third option, or whether this
was mostly due to the fact of presenting choices in an abstract and/or having participants engage in a
collective discussion. In Study 3, we found a similar pattern of answers in a concrete setting,
suggesting that this was not only due to the abstract presentation of Study 2 (though we cannot
exclude that presentation style might have effects, see Table S2 in Supplementary Materials).
Moreover, we found that engaging in a collective discussion tended to increase the choice of the
“random”option for the criteria judged more irrelevant. Still, the pattern of answers we spotted in
Study 2 was already visible before the collective discussion (in Set 1).
Participants’choice of the “random”option was not affected by their having to justify their
answer or indicate their degree of confidence. Moreover, participating in the collective discussion
raised participants’confidence in their answers. Overall, this suggests, against Awad and
colleagues (2020)’s suggestion, that participants’choice of the “random”order is not the re-
sult of mere bias that could be overcome by more reflection.
Table 5. Participants’Degree of Confidence and Perceived Consensus for Each Set (Study 3).
Set 1 Set 2 Set 3
Confidence 3.52 (.99) 3.78 (1.01) 4.22 (.69)
Consensus 3.29 (.75) 3.24 (.83) 3.84 (.77)
Etienne and Cova 13
Conclusion
In this paper, our goal was to show that attempts at automatizing ethical decision-making by
aggregating participants’answers to moral dilemmas face a serious methodological difficulty: data
collection is anything but a neutral process. Indeed, participants’replies can vary with experi-
mental designs, so that the way experiments are framed can lead machine-learning algorithms to
reach very different conclusions about what is the most “plausible”ethical answer to a dilemma.
More precisely, our goal was to explore whether collecting data using an experimental design
that more faithfully mirrored the context of a public discussion about AVs might change the
conclusions one could draw from such experiments. Indeed, most empirical studies on people’s
moral judgments about AVs are focusing on quick intuitions, generated in isolation by a single
exposure to each case, and with a limited range of options. However, these conditions differ
widely from the ones in which citizens engaging in a public debate would form their opinion about
the ethics of AVs. As the results of these experiments are increasingly used to bear on social
decisions, with the claim that they represent public opinion, this is problematic.
Thus, we sought whether changing the conditions in which participants’judgments are generated
to more closely mirror the conditions of public deliberation resulted in different conclusions re-
garding the “social consensus”about which factors should be relevant to AVs’behavior. For
example, in Study 1, we had participants take more time to think about their answers by exposing
them a second time to each vignette and forcing them to wait some time before answering, and by
asking to endorse them different perspectives beyond the mere perspective of driver or passenger
(such as policymaker or citizen). In this case, these changes did not make a difference.
However, in Studies 2 and 3, we showed that giving participants the possibility to mark certain
criteria as “morally irrelevant”andtoexpresstheirpreferenceforAVstomakerandomchoicesledtoa
pattern of answers (and to a picture of “social consensus”) that differed from the one observed by
previous studies offering only two options: we observed a strong consensus on the moral irrelevance of
certain criteria such as gender, body size, and social status. For gender, our results paint a picture in
which a minority of participants express a preference for saving women and a majority (around 75% in
Study 2 and 75–85% in Study 3) expresses a preference for random choice. This is very different from
a situation in which 60% prefer to save a woman compared to a man, and 40% prefer to save a man
compared to a woman—a situation in which there is no clear consensus. However, both situations will
be treated in a similar way by algorithms trained on data that do not provide participants with the
possibility to express their preference for random choices: in both cases, such algorithms will compute
a small preference, at the scale of the population, for saving women, while there seems to be a strong
consensus for not taking gender into account and allowing AVs to make a random decision. It will thus
go against the moral consensus by allowing AVs to favor women, even in a slight way.
The dismissal of a third, “random”option in previous studies about AVs dilemmas might be
due to a tendency to understand moral deliberation on the model of rational decision theory. From
the standpoint of rational decision theory, a choice at random between two options A and B can
only express one thing: indifference between A and B. However, in ethical decision-making,
choosing at random is not necessarily an expression of indifference—rather, it can express a strong
endorsement of moral values such as impartiality, or the commitment to treat all human beings as
having the same moral status, independently from their individual differences. Thus, rejecting the
need for a third, random option under the pretext that the same information can be obtained from
two-options survey (because it will manifest itself as indifference at the level of population)
already commits oneself to a particular view of ethical decision-making, according to which
ethical decision-making is similar to economic decisions.
Additionally, the assumption that preference for random choices (and thus impartiality) will
manifest itself as indifference in two-options surveys is unwarranted. When forced to choose
14 Social Science Computer Review 0(0)
between two options, participants who favor impartiality on moral grounds (and would select the
“random”choice in a three-options survey) might choose to rely on non-moral preferences. After
all, participants in the MME are simply asked what the AV should do—without specifying that this
“should”be a moral one (see Cova et al., 2019 on this particular methodological issue). Thus, the
two-options design might force participants to rely on personal preferences that they would not
themselves consider morally appropriate, thus increasing dissensus.
Moreover, decisions about AVs are not likely to be made in isolation: rather, as most people in
this literature emphasize, such decisions need to be the outcome of a public discussion. Thus, in
Study 3, we used a design that, in addition to providing participants with a random option, tried to
imitate the context of a public discussion: participants were provided with simplified versions of
arguments provided by ethicists against the relevance of certain criteria, and were asked to discuss
their answer with each other. We found that this procedure led participants to be more confident of
their individual answers, while reducing variance in participants’answers. Thus, the deliberative
process participants were invited to engage in allowed them to reach more stable and confident
replies they may better relate with and, thus, feel more responsible for.
Crucially, this procedure also led participants to significantly change their minds about the
relevance of certain criteria. For example, this procedure led them to reach an even stronger
consensus on the moral irrelevance of gender, body size, and social status. It also led to significant
difference in their assessment of the relevance of age: while more than half of participants
endorsed age as a morally relevant criterion at the beginning of the study, less than half did so at
the end of the study. Together, these results show that the consensus people are likely to reach
through a collective discussion cannot be reduced to the aggregation of individual answers made
in isolation, no matter how many individual answers have been collected.
Overall, the results of our studies suggest that we should be wary of using the empirical results
of studies such as the Moral Machine Experiment as a guide for ethical decision-making. On the
one hand, the design of these studies rests on unquestioned assumptions about the nature of ethical
decision-making, which might lead them to ignore clear popular consensus on “impartial”options.
On the other hand, the data are collected in a setting that cannot be considered equivalent to the
setting of an informed, collective discussion in which people might come to reject as unwarranted
and morally irrelevant the various biases identified by these studies. As we observed it, intro-
ducing a third option severely challenges Awad and colleagues’conclusions, showing that
surveys’design can either bring out dissensus which do not accurately capture people’s opinions,
or reveal consensus which better reflect them.
Finally, because surveys’design influence participants’replies, our studies also fall under this
limitation. Forcing people to take more time before submitting their responses does not necessarily
lead them to question these further, what could explain that we do not observe significant effect of
the reflection time when other works do. The convergence effect of the collective discussion could
also partly result from a pressure to conform to the majority opinion, rather than from a genuine
revision of one’s opinion. Therefore, distributing participants across discussion groups based on
their previous replies may either reinforce their opinions, if groups are formed to be likeminded, or
encourage them to revise these latter, if groups are built to represent diversified opinions and
participants explicitly asked to defend their previous replies. Overall, these limitations only
support our claim that surveys’design influence the way participants grow an opinion about a
topic, therefore the replies they provide. Further studies should investigate the opportunity to
frame their design to support respondents’critical thinking and develop more robust and
meaningful opinions. Such an effort is crucial nowadays as surveys are being increasingly used by
a wide range of actors to represent a so-called public opinion, and that such biased representations
of people’s opinions influence their actual opinions (e.g., by pressure to conform), as well as it is
used by decision-makers to justify societal choices.
Etienne and Cova 15
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or
publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or
publication of this article: This work was supported by Facebook Inc.
Supplemental Material
Supplemental material for this article is available online.
References
Allard, A., & Cova, F. (forthcoming). What experiments can teach us about justice and impartiality:
Vindicating experimental political philosophy. In H. Viciana, F. Aguiar, & A. Gait´
an (Eds.), Issues in
experimental moral philosophy. Routledge.
Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A., Bonnefon, J. F., & Rahwan, I. (2018). The
moral machine experiment. Nature,563(7729), 59–64. https://doi.org/10.1038/s41586-018-0637-6
Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A., Bonnefon, J. F., & Rahwan, I. (2020).
Reply to: Life and death decisions of autonomous vehicles. Nature,579(7797), E3–E5. https://doi.org/
10.1038/s41586-020-1988-3
Balliet, D. (2010). Communication and cooperation in social dilemmas: A meta-analytic review. Journal of
Conflict Resolution,54(1), 39–57. https://doi.org/10.1177/0022002709352443
Bigman, Y. E., & Gray, K. (2020). Life and death decisions of autonomous vehicles. Nature,579(7797),
E1–E2. https://doi.org/10.1038/s41586-020-1987-4
Bonnefon, J. F. (2019). La Voiture qui en savait trop. Humensciences.
Bonnefon, J. F., Shariff, A., & Rahwan, I. (2016). The social dilemma of autonomous vehicles. Science,
352(6293), 1573–1576. https://doi.org/10.1126/science.aaf2654
Bourdieu, P. (1972) L’opinion publique n’existe pas. Quelques remarques critiques sur les sondages
d’opinion, Temps Modernes, 318 (1973).
Capraro, V., Everett, J. A., & Earp, B. D. (2019). Priming intuition disfavors instrumental harm but not
impartial beneficence. Journal of Experimental Social Psychology,83(1), 142–149. https://doi.org/10.
1016/j.jesp.2019.04.006
Conitzer, V., Sinnott-Armstrong, W., Borg Jana, S., Deng, Y., & Kramer, M. (2017). Moral decision making
frameworks for artificial intelligence. In Proceedings of the 31st AAAI conference on artificial intel-
ligence (pp. 4831–4835), AAAI.
Cova, F., Boudesseul, J., & Lantian, A. (2019). Sounds fine, but No thanks!’: On distinguishing judgments
about action and acceptability in attitudes toward cognitive enhancement. AJOB Neuroscience,10(1),
57–59. https://doi.org/10.1080/21507740.2019.1595777
Cova, F., Strickland, B., Abatista, A. G. F, Allard, A., Andow, J., Attie, M., Beebe, J. R., Berni
unas, R.,
Boudesseul, J., Colombo, M., Cushman, F., D´
ıaz, R., van Dongen, N., Dranseika, V., Earp, B. D.,
Torres, A. G., Hannikainen, I. R., Hern´
andez-Conde, J. V., Hu, W., & Zhou, X. (2021). Estimating the
reproducibility of experimental philosophy. Review of Philosophy and Psychology,12(1), 9–44. https://
doi.org/10.1007/s13164-018-0400-9
De Freitas, J., & Cikara, M. (2021). Deliberately prejudiced self-driving vehicles elicit the most outrage.
Cognition,208(7729), 104555. https://doi.org/10.1016/j.cognition.2020.104555
Desrosières, A. (2008). Pour une sociologie historique de la quantification. L’argument statistique I, Paris.
Presses de l’Ecole des Mines.
16 Social Science Computer Review 0(0)
Etienne, H. (2021). The dark side of the ‘Moral Machine’and the fallacy of computational ethical decision-
making for autonomous vehicles. Law, Innovation and Technology,13(1), 85–107. https://doi.org/10.
1080/17579961.2021.1898310
Etienne, H. (2022). A Practical role-based approach for autonomous vehicles moral dilemmas. Big Data &
Society,9(2), 1380. https://doi.org/10.1177/20539517221123305
Frank, D. A., Chrysochou, P., Mitkidis, P., & Ariely, D. (2019). Human decision-making biases in the moral
dilemmas of autonomous vehicles. Scientific Reports,9(1), 13080–13119. https://doi.org/10.1038/
s41598-019-49411-7
Freiman, C., & Nichols, S. (2011). Is desert in the details? Philosophy and Phenomenological Research,
82(1), 121–133. https://doi.org/10.1111/j.1933-1592.2010.00387.x
Greene, J., Rossi, F., Tasioulas, J., Venable Kristen, B., & Williams, B. (2016). Embedding ethical principles
in collective decision support systems. In Proceedings of the 30th AAAI conference on artificial in-
telligence (pp. 4147–4151), AAAI.
Greene, J. D. (2014). Beyond point-and-shoot morality: Why cognitive (neuro) science matters for ethics.
Ethics,124(4), 695–726. https://doi.org/10.1086/675875
Mercier, H. (2011). What good is moral reasoning? Mind & Society,10(2), 131–148. https://doi.org/10.1007/
s11299-011-0085-6
Nadelhoffer, T., & Feltz, A. (2008). The actor–observer bias and moral intuitions: Adding fuel to Sinnott-
Armstrong’sfire. Neuroethics,1(2), 133–144. https://doi.org/10.1007/s12152-008-9015-7
Nichols, S., & Knobe, J. (2007). Moral responsibility and determinism: The cognitive science of folk
intuitions. Noˆ
us,41(4), 663–685. https://doi.org/10.1111/j.1468-0068.2007.00666.x
Noothigattu, R., Gaikwad, S., Awad, E., Dsouza, S., Rahwan, I., Ravikumar, P., & Procaccia, A. (2018). A
voting-based system for ethical decision making. Proceedings of the AAAI Conference on Artificial
Intelligence,32(1), 1587–1594. https://doi.org/10.1609/aaai.v32i1.11512
Sinnott-Armstrong, W. (2008). Abstract + concrete = paradox. In J. Knobe & S. Nichols (Eds.), Experimental
philosophy (pp. 209–230). Oxford University Press.
Struchiner, N., Almeida, G. F. C. F., & Hannikainen, I. R. (2020). Legal decision-making and the abstract/
concrete paradox. Cognition,205(4), 104421. https://doi.org/10.1016/j.cognition.2020.104421
Suter, R. S., & Hertwig, R. (2011). Time and moral judgment. Cognition,119(3), 454–458. https://doi.org/10.
1016/j.cognition.2011.01.018
Thaler, R., & Sunstein, C. (2008). Nudge: Improving decisions about health, wealth and happiness. Yale
University Press.
Tobia, K., Buckwalter, W., & Stich, S. (2013). Moral intuitions: Are philosophers experts? Philosophical
Psychology,26(5), 629–638. https://doi.org/10.1080/09515089.2012.696327
Wiegmann, A., Horvath, J., & Meyer, K. (2020). Intuitive expertise and irrelevant options. In T. Lombrozo, J.
Knobe, & S. Nichols (Eds.), Oxford studies in experimental philosophy (3). Oxford University Press.
Author Biographies
Hubert Etienne is a researcher in AI ethics in Meta’s Responsible AI team in New York.
Florian Cova is a postdoctoral researcher at the Centre Interfacultaire en Sciences Affectives,
University of Geneva.
Etienne and Cova 17
1
Supplementary Materials
All experiments were conducted between October 2021 and March 2022 and hosted on Qualtrics.
Respondents were recruited on Prolific Academic and the collected data was analysed on R Studio.
1. Supplementary Materials for Study 1
1.1. Effect of second exposure's duration
We looked at the effect of time constraint (10s vs. 20s vs. 30s vs. 40s) on participants’ answers to the
second time they were asked each question. We computed the number of times each participant
changed their mind between Set 1 and Set 2 and compared them across conditions using an ANOVA.
We found no significant effect of time constraint on the number of times people changed their mind:
F(1.606) = 0.136, p = 0.712. This suggests that being forced to think longer did not make participants
more likely to change their minds. Results are presented in Table S2.
10s
20s
30s
40s
Chi-square
1
->1: 3.1%
->2: 4.7%
->1: 4.6%
->2: 6.7%
->1: 6.4%
->2: 8.3%
->1: 8.3%
->2: 3.2%
p = .20
2
->1: 2.6%
->2: 5.8%
->1: 6.0%
->2: 6.6%
->1: 6.4%
->2: 10.1%
->1: 3.8%
->2: 7.6%
p = .45
3
->1: 5.8%
->2: 1.6%
->1: 7.3%
->2: 2.6%
->1: 5.5%
->2: 2.8%
->1: 5.1%
->2: 2.5%
p = .96
4
->1: 2.6%
->2: 3.1%
->1: 3.3%
->2: 0.7%
->1: 3.7%
->2: 0.0%
->1: 2.5%
->2: 1.3%
p = .38
6
->1: 11.5%
->2: 9.4%
->1: 7.3%
->2: 6.6%
->1: 7.3%
->2: 7.3%
->1: 10.2%
->2: 7.6%
p = .71
7
->1: 1.6%
->2: 3.1%
->1: 2.0%
->2: 2.6%
->1: 0.9%
->2: 1.8%
->1: 3.2%
->2: 3.2%
p =.86
8
->1: 2.6%
->2: 4.2%
->1: 2.6%
->2: 2.6%
->1: 3.7%
->2: 4.6%
->1: 2.5%
->2: 4.5%
p = .97
9
->1: 7.3%
->1: 6.6%
->1: 3.7%
->1: 3.8%
p = .67
2
->2: 7.3%
->2: 9.9%
->2: 9.2%
->2: 7.6%
10
->1: 1.6%
->2: 2.1%
->1: 3.3%
->2: 2.0%
->1: 0.9%
->2: 2.8%
->1: 0.0%
->2: 2.5%
p = .39
11
->1: 5.8%
->2: 5.2%
->1: 6.6%
->2: 7.3%
->1: 4.6%
->2: 2.3%
->1: 7.6%
->2: 9.6%
p = .31
12
->1: 4.2%
->2: 5.8%
->1: 5.3%
->2: 5.3%
->1: 3.7%
->2: 9.2%
->1: 5.7%
->2: 7.7%
p = 83
13
->1: 7.3%
->2: 9.4%
->1: 6.6%
->2: 7.9%
->1: 7.3%
->2: 10.1%
->1: 7.0%
->2: 5.7%
p = .89
14
->1: 7.9%
->2: 10.5%
->1: 7.3%
->2: 6.6%
->1: 13.8%
->2: 11.9%
->1: 6.4%
->2: 12.1%
p = .19
15
->1: 6.8%
->2: 8.4%
->1: 11.3%
->2: 6.0%
->1: 9.2%
->2: 9.2%
->1: 8.3%
->2: 7.0%
p = .78
16
->1: 5.8%
->2: 9.4%
->1: 6.6%
->2: 10.6%
->1: 8.3%
->2: 7.3%
->1: 8.9%
->2: 8.9%
p = .89
17
->1: 8.4%
->2: 8.9%
->1: 8.6%
->2: 7.3%
->1: 8.3%
->2: 9.2%
->1: 7.6%
->2: 8.3%
p = .99
N
191
151
109
157
-
Table S1. Percentage of participants who changed their answers to Option 1 (->1) and to Option 2 (->2) between
the first and second set of questions for each condition (10s, 20s, 30s, 40s) and each case (Study 1). Rightmost
column indicates the results of a Chi-square test (df = 6) comparing the distribution of participants’ behaviors
(changed to 1, changed to 2, did not change) between each condition. Case 5 does not appear in the table, as it
was an attention check.
2. Supplementary Materials for Study 2
2.1. Wording for questions about general criteria
Box S1. Wording for question about general criteria (Study 2)
Considering the previous scenarios, which of the following claims do you agree most with?
3
I consider that [X] is a morally relevant criterion to make life arbitrations in such situations and would (1) rather spare
[Y] versus [Z] whenever it is possible (2) rather spare [Z] versus [Y] whenever it is possible. (3) I do not consider that
[X] is a morally relevant criterion to make life arbitrations in such situations and would allow the autonomous vehicle
to select a random answer whenever possible.
Set 2-1: X = “gender,” Y= “a man,” Z = “a woman”
Set 2-2: X = “age,” Y= “the younger,” Z = “the elderly”
Set 2-3: X = “body size,” Y= “an athletic person,” Z = “a fat person”
Set 2-4: X = “social status,” Y= “an executive person,” Z = “a homeless person”
Set 2-5: X = “human/pets”
(1) would always spare humans versus pets whenever possible
(2) would not always spare humans over pets, considering that pets could be spared versus humans in some
situations.
(3) would allow the autonomous vehicle to select a random answer whenever it is possible
Set 2-6: X = “abidance by the law,” Y = “lawful drivers and pedestrians,” Z = “jaywalkers versus”
Set 2-7: X = “pedestrians/passengers,” Y = “pedestrians,” Z = “AV passengers”
Set 2-8: X = “going straight/swerving”
(1) would change some of my answers if reaching the desired outcomes implied allowing the car to swerve
instead of going straight.
(2) would not change my answers if reaching the desired outcomes implied allowing the car to swerve instead
of going straight.
Set 2-9: I consider that the amount of harm is the most important criterion to make life arbitrations in such
situations…
(1) and I would always choose the option that minimises the total amount of harm, regardless of other criteria.
(2) but I would not necessarily consider it as the most important one. I may choose the option that minimizes
the total amount of harm whenever possible but not systematically at the expense of other criteria.
(3) I do not consider that the amount of harm is necessarily morally relevant to make life arbitrations in such
situations and would allow the autonomous vehicle to select a random answer whenever possible
2.2. Replies Set 1 vs Set 2
The following tables show how participants answered the second set of questions (abstract principles)
in function of their answer to the first set of questions (concrete cases).
4
Gender (Set 1.1 vs Set 2.1)
Save men
irrelevant
Save women
Continue (kill male)
3
122
39
Swerve (kill female)
0
22
4
Body size (Set 1.2 vs Set 2.3)
Save fit person
Irrelevant
Save fat person
Continue (kill athletic)
16
79
1
Swerve (kill fat)
18
75
1
Social status (Set 1.3 vs Set 2.4)
Save homeless
Irrelevant
Save executives
Continue (kill homeless)
5
110
35
Swerve (kill executive)
3
30
7
Pedestrians vs. passengers (Set 1.9 vs Set 2.7)
Save pedestrians
Irrelevant
Save passengers
Continue (kill pedestrians)
19
34
18
Swerve (kill passengers)
76
32
11
Pedestrians vs. passengers (Set 1.11 vs Set 2.7)
Save pedestrians
Irrelevant
Save passengers
Continue (kill passengers)
84
48
10
5
Swerve (kill pedestrians)
11
18
19
Animals vs. Humans (Set 1.13 vs Set 2.5)
May save pet
Irrelevant
Always save humans
Continue (kill human)
12
14
43
Swerve (kill pet)
11
11
99
Minimising the number of victims (Set 1.8 vs Set 2.9)
Most important
Quite important
Irrelevant
Continue (kill 2)
41
21
7
Swerve (kill 1)
91
23
7
Minimising the number of victims (Set 1.10 vs Set 2.9)
Most important
Quite important
Irrelevant
Continue (kill 5)
6
5
6
Swerve (kill 2)
126
39
8
3. Supplementary Materials for Study 3
3.1. Wording for objections
Box S2. Wording for objections (Study 3)
OBJ-1: Men vs women
“You may think that gender is a morally relevant criterion here. If so, and to be consistent with your answer, you
should be ready to either state that white people should be spared versus black people or the contrary, that Muslims
should be spared versus Catholics or the contrary, that homosexuals should be spared versus heterosexuals or the
contrary, or to explain what makes gender different from skin colour, religious belief and sexual orientation so that the
former one is morally relevant here whereas the others are not.”
6
OBJ-2: Athletic vs fat people
“You may think that body size is a morally relevant criterion here. If so, what else could a society that allows
arbitrations potentially involving people’s deaths based on beauty or body image also end up allowing? What if it is
decided that beauty is represented by blond-haired blue-eyed people?”
OBJ-3: Homeless people vs executives
“You may think that social status is a morally relevant criterion here. If so, who should be in charge of defining the
social status scale and deciding which activity is socially valuable and which one is not? What else could a society
that allows arbitrations potentially involving people’s death based on social status do next? What if we had a social
credit score that could rank citizens from the most useful to the least one?”
OBJ-4: Younger vs elder people
“You may think that age is a morally relevant criterion here. If so, you may think so because of the following
argument: ‘young people should be spared because they had less time to enjoy life and more to lose in terms of
expected lifetime’. However
- there is a great uncertainty surrounding expected lifetime, as a young boy can die tomorrow from a disease and
a 70-year-old grandmother can live another 20 years. Furthermore, on average, women tend to live longer than
men in many countries; would you then agree to systematically spare them over men for such a reason?
- it is not possible to measure and compare each individual’s value of life together with their capacity to enjoy it,
as it is far too subjective. Would you systematically sacrifice someone with an expected extra 20 years of pure
bliss to allow someone else to suffer another 30 years of a hard life full of pain and humiliation?
- to be consistent with your claim to prioritize people with the higher remaining expected lifetime, you would also
have to accept sacrificing people suffering from severe incurable diseases associated with a very low lifetime
such as Huntington’s diseases or progeria.
Finally, do you think that self-driving cars could recognise pedestrians’ gender, age, body size or social status in
practice? While these criteria might be morally relevant, they could also be impossible to implement in practice.”
OBJ-5: Passengers vs pedestrians
“You may think that passengers should be spared versus pedestrians. If so, why would they have a higher right not to
be endangered than pedestrians crossing legally, while the issue comes from the vehicle’s brakes which are not
working, resulting in the vehicle itself being the origin of the harm here?”
OBJ-6: More vs fewer people
“You may think that the vehicle should be operated in such a way to hit the lower number of people.
If so, is your objective to reduce the total number of deaths or the total amount of harm? In other words, would you
accept 10 people ending up in wheelchairs to save one person’s life?
If you focus on reducing the number of deaths rather than the amount of harm, you may actually end up sparing elder
people versus youngsters as they tend to have greater chances of surviving. Is this consistent with your previous
reply?
7
How would you calculate and compare the probabilities for different types of consequences? Or better said, should
the vehicle run over 3 people with a 50% chance of breaking the first one’s legs, 80% chance of killing the second
and 50% chance of plunging the third one into a coma or 3 people with a 90% probability of making the first one
quadriplegic, 40% probability of killing the second and 70% probability of making the fourth one blind?
Finally, would you agree to hit a person legally engaged in the pedestrian pathway to spare two jaywalkers aware
that they are acting unlawfully and that this may be dangerous?
OBJ-7: Humans vs pets
“You may think that the vehicle should be operated in such a way to always sacrifice pets in the car to spare humans,
even when they are jaywalking. Let us agree on the idea that a human life’s value is always greater than an animal’s
but look at the question from a different angle.
Legally in Europe, pets are considered “property,” so that if one murders my pet, they can be charged for damaging
my property. Let us now introduce Green Monkey, who was an American racehorse sold for 16 million dollars in
2006.
Do you think it would be fair for Green Monkey’s owner to sacrifice its 16-million-dollars-value asset conveyed in his
vehicle to save the life of a jaywalker who intentionally broke the law, thus putting everyone at risk?”
3.2. Comparison of participants' choices for Studies 2 and 3
Does presenting choices in an abstract rather than in a concrete way make any difference in
participants' choice? To find out, we used Chi-Square test to compare the percentage of participants
who chose the 'random' option for each factor between the two studies. For Study 3, we used
participants' answers to Questions 1 to 8, in the third set (after discussion). Results are presented in
Table S2. As can be seen, we found a significant difference for three factors out of eight (Gender,
Norm compliance, and Number). In two cases (Age and Numbers), the Concrete presentation raised
the proportion of irrelevant/random answers, but in one case, it lowered this proportion (Norm
compliance). Thus, there was no overall consistent pattern. Note that, due to methodological
differences between Studies 2 and 3, these differences are not necessarily due to the
abstract/concrete difference.
Factor
Study 2 (Abstract)
Study 3 (Concrete)
Chi-Square test
Gender
75.8%
85.2%
p = .43
Age
34.7%
56.5%
p = .005**
Body size
80.5%
80.2%
p = .99
Social status
73.7%
85.5%
p = .31
Species
13.2%
10.8%
p = .57
Norm
40.5%
21.9%
p = .001**
Role
34.7%
38.3%
p = .65
Number
7.4%
25.0%
p < .001***
Table S2. Chi-square test assessing the impact of presentation (Abstract vs. Concrete) on participants’ choice of
answers to Set 1. In each cell, we present the percentage of participants who chose Side 1 vs. the percentage of
participants who chose Side 2. Rightmost column indicates the result of a Chi-Square test comparing the
distribution of answers across all three conditions.
3.3. Effect of condition on participants’ judgments to Set 1
8
To test the impact of condition (asking for justifications vs. asking for degree of confidence vs. a
control condition), we performed a chi-square test comparing the distribution of all three answers
(Side 1 / Side 2 / Random) between conditions for Set 1. Results are presented in Figure S3.
N°
CONTROL
DOC
JUST
Chi-square
1
20.3% v. 5.8%
19.6% v. 4.5%
18.3% v. 6.4%
p = .97
2
6.8% v. 50.5%
8.0% v. 48.2%
10.1% v. 40.4%
p = .62
3
18.4% v. 13.6%
13.4% v. 20.5%
11.0% v. 33.0%
p = .01*
4
16.5% v. 7.8%
15.2% v. 8.0%
19.3% v. 9.2%
p = .92
5
13.6% v. 54.4%
14.3% v. 56.3%
14.7% v. 55.0%
p = .99
6
5.8% v. 77.7%
3.6% v. 70.5%
3.7% v. 75.2%
p = .49
7
81.6% v. 10.7%
86.6% v. 6.3%
80.7% v. 8.3%
p = .62
8
32.0% v. 48.5%
36.6% v. 42.9%
42.2% v. 40.4%
p = .62
9
12.6% v. 70.9%
8.9% v. 75.0%
12.8% v. 78.9%
p = .31
11
51.5% v. 3.9%
62.5% v. 4.5%
58.7% v. 6.4%
p = .39
12
12.6% v. 78.6%
18.8% v. 74.1%
20.2% v. 69.7%
p = .54
Table S3. Chi-square test assessing the impact of condition (CONTROL, DOC, JUST) on participants’ choice of
answers to Set 1. In each cell, we present the percentage of participants who chose Side 1 vs. the percentage of
participants who chose Side 2. Rightmost column indicates the result of a Chi-Square test comparing the
distribution of answers across all three conditions.
3.4. Effect of objections and discussion on participants’ answers
To analyze the effect of argumentation (Set 1 vs. Set 2) and of discussion (Set 2 vs. Set 3), we scored
participants’ answers in the following way: -1 for “Side 1,” 0 for “Random” and 1 for “Side 2.” The
dispersion among replies appears between parentheses in Table S4.
Set 1
Set 2
Set 3
Set 1 vs. Set 2
Set 2 vs. Set 3
Set 1 vs. Set 3
Question 1
-0.14 (0.48)
-0.13 (0.41)
-0.12 (0.37)
+.01
+.01
+.02
9
Question 2
0.38 (0.64)
0.30 (0.58)
0.34 (0.56)
-.08*
+.04
-.04
Question 3
0.09 (0.60)
0.07 (0.54)
-0.01 (0.44)
-.02
-.08**
-.09**
Question 4
-0.09 (0.50)
-0.02 (0.45)
-0.03 (0.38)
+.06*
-.00
+.06*
Question 5
0.41 (0.73)
0.44 (0.69)
0.47 (0.63)
+.03
+.03
+.06
Question 6
0.70 (0.54)
0.59 (0.58)
0.69 (0.53)
-.11***
+.10**
-.01
Question 7
-0.75 (0.60)
-0.72 (0.61)
-0.79 (0.51)
+.03
-.07*
-.05
Question 8
0.06 (0.90)
-0.03 (0.89)
-0.01 (0.89)
-.09
+.02
-.07
Question 9
0.64 (0.68)
0.67 (0.64)
0.70 (0.61)
+.03
+.02
+.06
Question 11
-0.53 (0.59)
-0.56 (0.57)
-0.48 (0.57)
-.03
+.08**
+.05
Question 12
0.57 (0.77)
0.46 (0.83)
0.60 (0.74)
-.11*
+.15***
+.03
Table S4. Mean and standard deviations for participants’ answers to vignettes (coded). Impact of
objections and group discussion on respondent’s answers are shown in the three rightmost columns.*
indicates the results of paired t-tests.
3.5. Participants’ change in answers
70% of replies did not change from Set 1 to Set 3. Among the 30% that did, 70% operated a one-way
shift (e.g., Side 2, Side 1, Side 1) and 23% reverted to their initial response (e.g., Side 2, Random,
Side 2). Of the replies that changed definitively, 53% did so after OBJ and 47% after DISC (see
Annex 2.5). Fig. 5 presents how OBJ and DISC impacted respondents’ confidence and perception of
consensus for each category. We aggregated replies according to their evolution through the
experiment: no change (e.g., Side 1, Side 1, Side 1), one-way shift DISC (e.g., Side 1, Side 1, Side 2),
one-way shift OBJ (e.g., Side 1, Side 2, Side 2), comeback (e.g., Side 1, Side 2, Side 1) and lost (e.g.,
Side 1, Side 2, Random).
10
Figure S5. Effect of objections & discussion on respondent’s confidence & perceived consensus per response
type.
Respondents who do not change their replies are also those who claim to have the highest
degree of confidence at every single step, while those who change their replies twice have the lowest
degree of confidence at every step. In addition, confidence and perception of consensus are impacted
in similar ways by OBJ and DISC for all types of replies except for one-way shift OBJ. This type is
associated with the respondents who changed their minds after having reviewed objections; they have
the highest increase in confidence (+1.0pt) and are not impacted by the decrease in the perception of
consensus. Finally, whereas it seems clear to interpret the one-way shift as coherent with the
continuous increase in confidence, the comeback circuit seems more paradoxical. As we understand
it, OBJ convinces several respondents to change their minds before DISC brings them back to their
initial reply once reassured of it by social confirmation. Thus, the comeback path illustrates the
deliberative process participants are engaging with as they are making increasingly informed and
robust moral judgements.
Such changes could be seen irrelevant at the macro level either because some people come
back to their initial reply or because the switches from one answer to another in a given scenario
compensate for the switches in other scenarios. However, the changes express something
meaningful for moral judgements, which is not captured at the aggregate level: how individuals relate
to their decisions and may feel responsible for them when having to justify themselves. More than
helping a group converge towards consensus, the deliberative process proposed here, combining
OBJ and DISC, is about building meaning. It supports participants to in making decisions they can
relate to, that is, more robust judgements they will stick to, feel confident in, responsible for, and that
they can justify to others.
Figure S6. Distribution of types of changes in responses per scenario.
11
3.6. Participants’ ratings of objections’ strength
Grade (/5)
Criteria
Arg1
2.44 (1.73)
Sex
Arg2
2.42 (1.67)
Body size
Arg3
2.62 (1.72)
Social status (homeless)
Arg4
2.99 (1.42)
Age
Arg5
2.97 (1.51)
Pedestrian / passengers
Arg6
2.71 (1.35)
Number
Arg7
2.22 (1.56)
Pets vs human
Table S7. Objection’s strength graded by participants from 0 to 5
1
Supplementary materials
As it appears below, all texts in black were displayed to participants. Texts in blue are comments we are adding
to explain the procedure.
Experiment 1
The first experiment was composed of three sets of questions: Introduction set, Main set and Conclusion set.
All participants addressed them in the same order:
- Step 1: Introduction set
- Step 2: Main Set
- Step 3: Main Set
- Step 4: Conclusion set
In the introduction set, [Instruction*] was replaced by:
“Imagine that you are the designer of the self-driving vehicle. The vehicle is ready to be commercialised and
the last thing you need to do before this is to set up its ethical settings by selecting options 1 or 2 for each
one of the following scenarios.” in condition A
“Imagine that the autonomous driving industry has made giant progress and that self-driving cars are ready
to be commercialised. The government leads a national public consultation to determine which ethical
settings should be selected in case of unavoidable fatalities. As a citizen, you take part of this public survey,
selecting options 1 or 2 for each one of the following scenarios.” in condition B
“Imagine that you are a policy-maker preparing the regulation for the self-driving industry. Autonomous
vehicles are ready to be commercialised and you need to set up their ethical settings by selecting options 1
or 2 for each one of the following scenarios.” in condition C
In the Main set, [Instruction**] was replaced by:
“Please answer the following questions as quickly as possible, giving your own opinion.” for all participants
in step 2
““We will now ask you to answer another round of similar questions, but this time we want you to take
more time to think about your answer. You will only be able to submit it after 10 seconds.”
““We will now ask you to answer another round of similar questions, but this time we want you to take
more time to think about your answer. You will only be able to submit it after 20 seconds.”
““We will now ask you to answer another round of similar questions, but this time we want you to take
more time to think about your answer. You will only be able to submit it after 30 seconds.”
“We will now ask you to answer another round of similar questions, but this time we want you to take
more time to think about your answer. You will only be able to submit it after 40 seconds.” based on
participants’ groups in step 3
In both step 2 and 3 and for each scenarios, the question “what should the self-driving car do?” was preceded
by “as the car designer” in condition A, “as a policy-maker” in condition B, and “as part of the national
consultation” in condition C.
2