Conference PaperPDF Available

Female by Default? – Exploring the Effect of Voice Assistant Gender and Pitch on Trait and Trust Attribution

Authors:

Abstract and Figures

Gendered voice based on pitch is a prevalent design element in many contemporary Voice Assistants (VAs) but has shown to strengthen harmful stereotypes. Interestingly, there is a dearth of research that systematically analyses user perceptions of different voice genders in VAs. This study investigates gender-stereotyping across two different tasks by analyzing the influence of pitch (low, high) and gender (women, men) on stereotypical trait ascription and trust formation in an exploratory online experiment with 234 participants. Additionally, we deploy a gender-ambiguous voice to compare against gendered voices. Our findings indicate that implicit stereotyping occurs for VAs. Moreover, we can show that there are no significant differences in trust formed towards a gender-ambiguous voice versus gendered voices, which highlights their potential for commercial usage.
Content may be subject to copyright.
Female by Default? Exploring the Eect of Voice Assistant
Gender and Pitch on Trait and Trust Aribution
Suzanne Tolmeijer Naim Zierau Andreas Janson
tolmeijer@i.uzh.ch naim.zierau@unisg.ch andreas.janson@unisg.ch
Department of Informatics, University
IWI-HSG, University of St. Gallen IWI-HSG, University of St. Gallen
of Zurich St. Gallen, Switzerland St. Gallen, Switzerland
Zurich, Switzerland
Jalil Wahdatehagh Jan Marco Leimeister Abraham Bernstein
jalil@1001.digital janmarco.leimeister@unisg.ch bernstein@i.uzh.ch
1001.digital IWI-HSG, University of St. Gallen
Department of Informatics, University
Valley, Germany St. Gallen, Switzerland of Zurich
ABSTRACT
Gendered voice based on pitch is a prevalent design element in
many contemporary Voice Assistants (VAs) but has shown to strengthen
harmful stereotypes. Interestingly, there is a dearth of research that
systematically analyses user perceptions of dierent voice genders
in VAs. This study investigates gender-stereotyping across two
dierent tasks by analyzing the inuence of pitch (low, high) and
gender (women, men) on stereotypical trait ascription and trust for-
mation in an exploratory online experiment with 234 participants.
Additionally, we deploy a gender-ambiguous voice to compare
against gendered voices. Our ndings indicate that implicit stereo-
typing occurs for VAs. Moreover, we can show that there are no
signicant dierences in trust formed towards a gender-ambiguous
voice versus gendered voices, which highlights their potential for
commercial usage.
CCS CONCEPTS
Human-centered computing User interface design
;
Em-
pirical studies in HCI
;
Sound-based input / output
; Interaction
design theory, concepts and paradigms;
Social and professional
topics Gender.
KEYWORDS
Voice Assistants, Gender Stereotypes, Voice Design, Trust, Gender-
Ambiguous Voice
ACM Reference Format:
Suzanne Tolmeijer, Naim Zierau, Andreas Janson, Jalil Wahdatehagh, Jan
Marco Leimeister, and Abraham Bernstein. 2021. Female by Default?
Exploring the Eect of Voice Assistant Gender and Pitch on Trait and Trust
Attribution. In CHI Conference on Human Factors in Computing Systems
Extended Abstracts (CHI ’21 Extended Abstracts), May 8–13, 2021, Yokohama,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
Zurich, Switzerland
Japan. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3411763.
3451623
1 INTRODUCTION
Voice Assistants (VA), such as Google Assistant or Amazon Alexa,
promise to change the ways people perform tasks, use services,
and interact with organizations. The interactions of many users
with these agents, however, have yielded mixed results, indicating
high failure rates [
8
]. Hence, there has been a growing interest in
voice-based interactions in both research and practice [
25
]. Besides
the content of the interaction itself (i.e., ‘what is said?’), an ele-
ment that is central to interaction design of VAs is the voice (i.e.,
‘how it is said?’) [
27
]. In this regard, a prevalent trend is the use
of female over male voices, as companies cite anecdotal evidence
which suggests that female voices are favored by most users. Thus,
most leading VAs are exclusively female or female by default [
18
].
In fact, according to a recent study, 77% of all virtual assistants
manifested gender-specic cues that can be classied as feminine
[
7
]. However, a recent report by the UNESCO stresses that the gen-
dered design of most VAs could solidify harmful gender stereotypes
[
29
]. For instance, since people become used to interacting with
those agents in a commanding tone, humans might also (subcon-
sciously) mirror this behavior in their everyday conversations with
women [
4
]. One potential solution to this issue may lie in the use
of gender-ambiguous
1
voices [
29
]. Gender-ambiguous voice assis-
tants may not only help to combat hurtful gender stereotypes, but
also provide more inclusive design tools to represent voices outside
the binary gender identities. However, while studies on interaction
design with VAs are growing (e.g., [
15
,
22
,
23
,
32
]), there is a lack
of empirical insights on the perceptual eects of gendered (and
gender-ambiguous) voices based on para-lingual cues such as pitch.
Especially, to the best of our knowledge, no study has empirically
tested user perceptions and the technical feasibility of deploying
gender-ambiguous voices for VA design.
To address this shortcoming, we conducted an exploratory study
to empirically analyze the eects of (ambiguously) gendered voices
CHI ’21 Extended Abstracts, May 8–13, 2021, Yokohama, Japan
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. 1
In accordance with Sutton [
27
], we use the term ‘gender-ambiguous’ throughout this
ACM ISBN 978-1-4503-8095-9/21/05.. . $15.00
paper rather than calling a voice ‘genderless’: many cues in the sound and content of
https://doi.org/10.1145/3411763.3451623 VA speech can illicit gender ascription, even when the pitch is gender-neutral.
CHI ’21 Extended Abstracts, May 8–13, 2021, Yokohama, Japan Tolmeijer, et al.
on trait and trust attribution across dierent task contexts. Speci-
cally, we comparatively analyze user perceptions in regards to pitch
(low, high) and gender (female, male) as well as a gender-ambiguous
voice we constructed. According to literature, the pitch of the voice
is one of the most important factors regarding the attribution of
gender [
19
]. To that end, we developed a voice interface for online
experiments. On this basis, we implemented two task scenarios:
one where users were asked to book a ight with a VA (assistance
scenario) and one where users were surveyed by a VA on their -
nancial situation (compliance scenario). We conducted a 5x2 online
experiment with 234 participants on Prolic: ve voices (male-low,
male-high, gender-ambiguous, female-low, and female-high) were
set against two task settings (assistance and compliance). Our re-
sults show implicit stereotype activation with regards to (lack of)
trait attribution towards the dierent VA voices. Task context and
gender of the participant both have an eect on perceived traits
and reported trust. Finally, our study gives a rst indication that
a gender-ambiguous voice for VAs could be a viable alternative to
gendered voices and warrants further investigation.
2 RELATED WORK
Our research is motivated by sociophonetics and social response
theory [
17
]. Every person has a unique voice based on a complex in-
terplay of anatomical and psychological traits and emotional states
that together determine how people express themselves verbally
and in turn how they are perceived by others [
10
,
28
]. Sociopho-
netics explores how dierent speech patterns vary across social
categories and the associated socio-cultural assumptions they carry.
It is well established that people make inferences on others based
on the sound of their voice [
19
]. Voice carries para-lingual cues that
allow people to make assumptions about a person’s background
and, based on this, to apply social stereotypes. Speakers use subtle
para-lingual cues, mostly unconsciously, to induce certain images
to listeners [
28
]. Those cues can be seen as a exible resource
that people (and VAs) can use to signal dierent social traits and
attitudes [
28
]. Sometimes, voice informs stereotypes about how
specic groups of people speak. One obvious group is the gender
of the speaker.The most prominent gender-dependent feature of
voice is the pitch of a voice. The longer and thicker vocal chords of
men produce a lower pitch than woman; a distinction that is easily
perceived by listeners [19].
Based on the Computers As Social Actors (CASA) paradigm
[
17
], initial research suggests that when applied to technology,
gender-specic voice characteristics may evoke stereotypical trait
inferences [
13
]. While this is not always consciously, it is shown to
be the case on a subconscious level [
14
]. For instance, Pak et al. [
21
]
showed that users apply gender stereotypes when ascribing the
trustworthiness of a virtual agent (i.e., the authors found that users
trust a male more than to a female virtual doctor). In VAs, we nd
similar results. Initial ndings suggest that people nd it easier to
process stereotypical voices, i.e., a warm gentle female voice and an
assertive, forceful male voice [
26
]. Specically, it was shown that the
machine’s synthetic voice pitch can activate gender stereotyping
of users. For instance, Nass et al. demonstrated that participants
not only attributed gender towards computers that communicated
in a low- versus a high-pitched synthetic computer voice. They
also showed that the low versus high pitch of the synthetic voice
triggered users to apply gender-schematic judgments of the “male”
versus the “female” computer [
16
]. More recently, Yu et al. found in
their study that participants were more likely to disclose personal
information to a (lower-pitched) male voice than a (higher-pitched)
female voice of a virtual assistant [
31
]. However, research that
systematically analyzes trait and trust attribution based on dierent
voice genders and pitches for VAs is scarce, despite its paramount
role in VA design.
Another limitation to our understanding of voice pitch percep-
tion is that only male and female VA voices have been explored,
despite calls to research a gender-ambiguous voice [
28
,
29
]. There
is very little literature available on a gender-neutral voice pitch,
except for references to ‘Q’, a voice that was recently created to
be used for VAs to circumvent stereotyping [
24
]. The creators of
‘Q’ mention the fundamental frequency should be between 145
and 175Hz for the voice to sound gender neutral. However, they
indicate that gender is more than just pitch: tone and harmonics
(e.g., the sound of vowels) also inuence gender perception [
5
]. As
the term gender-ambiguous indicates, voice cannot be regarded as
binary [
1
,
27
]. A brain activity study done by Junger et al. found
that people have an increased brain response to gender-ambiguous
voices and opposite gendered voices cause stronger activation in
the fronto-temporal neural network [
11
]. While the dierence in
neural perception is shown, the dierence in user perception for
VAs has not been investigated. The use of gender-ambiguous voices,
if proven not to have a negative impact on user trust and experi-
ence, can be a viable alternative to gendered voices to create a more
inclusive environment for non-binary voices.
3 METHOD
In order to investigate the perceptual eects of voice pitch, we
conducted a 5x2 between subjects experiment that manipulated
i) the VA’s voice gender based on pitch and ii) the task context
the VA was deployed in. Dependent variable measures included
trait ascription and reported trust in the VA. Specically, in this
exploratory study, we investigated the following research questions:
RQ1: How does voice gender based on pitch aect trait ascription?
RQ2: How does voice gender based on pitch aect user trust? RQ3:
How does the task context inuence the way how voice gender
based on pitch aect trust and trait ascription?
3.1 Experimental Platform and Voice Design
The experiment was executed in a custom developed online voice
assistant interface. By keeping the interface constant and as clean
as possible, the focus remains on the voice of the voice assistant
(see Figure 1), which allows to investigate trait attribution based
on voice characteristics. The interaction with the VA follows a
simple turn-taking mechanism, where the VA guides the unfolding
conversation with the user. After each utterance of the VA, the
button ‘record’ appears to send the user’s response to the server.
To control for diversity in the conversation, VA responses were
prerecorded and the conversation path was delimited to focus on
the task at hand.
The prerecorded answers of the VA use state of the art in text-to-
speech generation to produce our voice responses: Google WaveNet
Female by Default? Exploring the Eect of Voice Assistant Gender and Pitch on Trait and Trust Aribution CHI ’21 Extended Abstracts, May 8–13, 2021, Yokohama, Japan
Figure 1: Online Voice Assistant Interface
[
20
]. To account for both gender and pitch dierences, ve Amer-
ican English voices are selected: a high- and low-pitched female
voice (based on voice
en-US-Wavenet-F
), a high- and low-pitched
male voice (based on voice
en-US-Wavenet-B
), and a gender am-
biguous voice (based on voice
en-US-Wavenet-E
). While gendered
voice generators are readily available, there is not yet a gender-
ambiguous text-to-speech generator available. Google’s text to
speech generator has it listed as an option that is not yet sup-
ported.
2
The only available gender-ambiguous generated voice is
a carefully crafted voice clip called ‘Q’, created to ght gender
stereotypes in voice assistants [
24
]. But ‘Q’ oers no text-to-speech
generation. In order to create a voice closest to gender-ambiguous,
we pretest male voices with their pitch shifted up, and female voices
with the pitch shifted down to identify a voice that classies as
gender-ambiguous. In this regard, gender-ambiguous refers to a
voice that falls into both spectrums, meaning that dierent people
would assign dierent genders to it based on prior mental models.
Research on third gender associations has shown that typically peo-
ple assign a gender to a voice, even though they cannot intuitively
assign a gender [
27
]. To account for this tendency, we included a
survey measure asking respondents to identify the gender of the
voice assistant through three categories: (1) female; (2) male; and
(3) unsure. We used this as a control measure in our models. Eleven
manipulated voices based on dierent Google WaveNet voices were
pretested by 52 participants on Prolic (47% female, average age 45
and ranging from 27 to 74). The voice receiving the highest division
between assigned gender (58% male and 42% female) was voice
en-US-Wavenet-F
shifted down by three semitones. The selected
voices can be found in Table 1.
3.2 Experiment Procedure
The experiment consisted of three phases: 1) randomization, 2)
experimental task, and 3) post-test. Randomization and post-test
were constant for all groups. Two dierent experimental task types
were used: an assistant and a compliance task. These tasks are
inspired by classical gender stereotypes: women are considered
to be better in an assistant role, while men are more likely to be
seen as leaders [
9
,
12
]. Additionally, they are realistic VA tasks, as
2
When last checked by the authors on December 11th 2020.
https://cloud.google.com/text-to-speech/docs/reference/rest/v1/SsmlVoiceGender
both customer surveys [
30
] and assistant tasks [
18
] are currently
used in VAs. The assistance task involves booking a ight. The
participant is given details for a specic ight they want to book
and the VA will ask them questions to nd and book the right ight
for them. The compliance task focuses on personal questions asked
in the context of a customer survey. People are asked to answer
the questions, but are told it is possible to skip the answer if they
prefer not to answer. An example of task interactions can be found
in Figure 2.
3.3 Participants
Participants were approached on crowdsourcing platform Prolic.
3
While complying with academic and Prolic’s standards on data
collection, we set the following preconditions: 1) US nationality,
2) 75%+ approval rate, 3) 10+ previous submissions, and 4) not
in pretest sample. Requirement 1) was implemented to control
for a language/culture barrier, as the selected voices are speaking
in US English. Requirement 2) and 3) were applied to have some
quality control in our sample. Requirement 4) excluding priming or
bias stemming from the pretest. Initially, 345 people participated.
We excluded participants who did not complete the entire task or
failed the attention test. After data cleaning, we were left with 234
participants (96 male). The average age was 33 years old, ranging
from 19 to 74.
3.4 Measurements and Analysis
The assignment of traits was measured by asking participants about
the presence of 24 traits of the VA, based on male and female
stereotypes [
2
,
6
]. Each trait was enquired using a 5-point Lik-
ert scale, ranging from a positive trait ascription (i.e., 5 indicates
‘strongly agree’ that the VA had this trait), to negative trait ascrip-
tion (i.e., 1 indicates ‘strongly disagree’ that the VA had this trait).
Female traits were averaged to indicate female stereotype activation
(
α =
0
.
91), the mean of male traits was used to indicate male stereo-
types (
α =
0
.
87). Perceived trust was measured using a validated
questionnaire about the perceived competence, benevolence, and
integrity of the VA [3] (α = 0.93).
4 RESULTS
Trait ascription scores were not normally distributed: a Shapiro-
Wilkin test resulted in p < 0.001 for all twenty-four traits. This is
possibly because of the nature of the Likert scale for trait ascription:
‘1’ indicates ‘strongly disagree’ that the VA has this trait, ‘3’ shows
the participants ’neither agrees nor disagrees’, while ‘5’ reects a
‘strong agreement’ that the VA has this trait. To test whether a trait
was signicantly assigned in a positive way (i.e., signicantly higher
than a neutral answer of ‘3’) or a negative way (i.e., signicantly
lower than a neutral answer of ‘3’), we used the non-parametric
paired sample Wilcoxon Signed-Rank test to compare our sample
against the neutral value ‘3’.
Trust scores were not normally distributed either when compar-
ing dierent voices: a Shapiro-Wilkin test showed the female high
voice data was not normally distributed (
W =
0
.
95
, p =
0
.
04). As
such, a Kruskal-Wallis test was used rather than an ANOVA test.
In the case of two-group comparisons, Mann-Whitney U tests were
3https://www.prolic.co/
CHI ’21 Extended Abstracts, May 8–13, 2021, Yokohama, Japan Tolmeijer, et al.
Table 1: Original Google English US Wavenet voices are shifted by amount of semitone, either using Google’s text to speech
API (TTS) or by using online generator https://onlinetonegenerator.com/ (gen).
Voice type Female high (FH) Female low (FL) Gender-ambiguous (A) Male high (MH) Male low (ML)
Original English US Wavenet voice F F E B B
Method TTS TTS Gen TTS TTS
Semitones pitch shift +2 -6 -3 +2 -6
Average pitch 235 Hz 150 Hz 141 Hz 162 Hz 106 Hz
Figure 2: Example excerpt of the compliance task
executed. The results of all tests can be found in the remainder of
this section.
4.1 Trait Ascription
While we found no signicant activation of combined average
male and female traits, results did show signicant negative trait
ascription. Specically, over both task types, participants indicated
that on average, some VAs did not have male and female traits.
When taking the average over all stereotypically male traits, only
the male low voice was not negatively marked as stereotypical male
(
Z =
341
,p =
0
.
175). All other voices we signicantly negatively
associated with a male stereotype (MH:
Z =
268
, p =
0
.
006; FL:
Z =
461
, p =
0
.
029; FH:
Z =
198
,p <
0
.
001; A:
Z =
186
,p =
0
.
006).
Female traits were only negatively assigned to low voices: the
male low voice (
Z =
198
,p =
0
.
006) and female low voice (
Z =
532
,p =
0
.
034) were not considered to have stereotypical female
traits. Other voices did not have negative stereotype ascription (MH:
Z =
456
, p =
0
.
176, A:
Z =
378
, p =
0
.
543, FH:
Z =
743
,p =
0
.
903).
The gender of the participant did not inuence negative stereotype
assignment. The only voice that came near to activating a perceived
stereotype was the female high voice: it was almost signicant for
activating a female stereotype (Z = 743, p = 0.096).
Additionally, we tested for group dierences with regards to
the individual perceived traits of the VA voices. Again, we added
gender as a co-variate, to control for gender-specic dierences in
individual trait attribution. All voices were experienced to be or-
ganised, condent, cooperative, and polite. While low voices were
overall considered to be determined (ML:
Z =
257
,p =
0
.
039; FL:
Z =
759
, p =
0
.
034), only the low male voice was not experienced
as friendly (
Z =
232
, p =
0
.
250). Curiously, a participant gender dif-
ference occurred in trait ascription to the gender-ambiguous voice:
while all participants thought the voice was friendly and polite,
women rated the ambiguous voice as signicantly more friendly
than men (
Mdn.women
: 5
, Mdn.men
: 4
,U =
84
.
5
, p =
0
.
037) and
polite (
Mdn.women
: 5
, Mdn.men
: 4
,U =
65
.
5
, p =
0
.
006) than
male participants. Two traits had no signicant assignment of any
kind for any voice pitch: assertive and aable.Interestingly, for
many traits, the trait assignment was negative: people responded
signicant in the (strongly) disagree category. All voices were not
considered to be aggressive, hard-hearted, tough, aectionate, sen-
timental, or romantic. However, implicit stereotype activation can
be found in lack of negative trait ascription. For example, the low
male voice was the only voice that was not considered not to be
authoritative (
Z =
230
, p =
0
.
356) or dominant (
Z =
136
, p =
0
.
057).
The female high voice, together with the gender-ambiguous voice,
were the only voices not negatively assigned typical female traits
such as delicate, family oriented, or sensitive. Dierence in partic-
ipant gender is more clear in negative trait ascription, as women
assign lower values than men in many cases.
A summary of trait ascription can be found in Table 2, which
shows trait ascription scores for all traits that were not uniformly
assigned across all voices.
4.2 Trust
Our results reveal no signicant dierences between the conditions
when comparing reported trust in the VAs (
χ2(
4
,
234
) =
1
.
9958
, p =
0
.
736). Average trust scores were comparable at 4.525 (ML), 4.480
(MH), 4.710 (FL), 4.636 (FH), and 4.824 (A). However, this does
show that the gender ambiguous voice is not trusted less than
Female by Default? Exploring the Eect of Voice Assistant Gender and Pitch on Trait and Trust Aribution CHI ’21 Extended Abstracts, May 8–13, 2021, Yokohama, Japan
Table 2: Selected average trait ascription scores per voice pitch. ‘1’ implies strongly disagree the VA has this trait, ‘3’ indicates
the trait is not assigned, ‘5’ shows a strong agreement for VA having this trait. Signicant dierences from lack of trait ascrip-
tion, test by Wilcoxon Signed-Rank test, are shows as follows: *p
0
.
05
, **p
0
.
01
, ***p
0
.
001
. The ve individual voices are
male-low (ML),male-high (MH), female-low (FL), female-high (FH), and gender-ambiguous (A) respectively.
ML MH A FL FH ML MH A FL FH
Authoritative
Speaks their mind
Determined
Cold
2.95
2.76
3.21*
2.98
2.60*
2.49**
3.23
2.77
2.18***
2.74
3.31
2.49*
2.77**
2.46**
3.37*
2.72*
2.51**
2.43**
3.25
2.27***
Empathetic
Delicate
Friendly
Sincere
2.19***
2.29***
3.15
2.98
2.74
2.47***
3.64***
3.13
2.72
2.87
3.59*
3.49*
2.44***
2.61**
3.68***
3.14
2.90
3.12
3.98***
3.47*
Dominant
Leadership skills 2.73
2.51** 2.51**
2.74 2.13***
2.69 2.32***
2.53** 2.16***
2.69* Family-oriented
Sensitive 2.54*
2.49** 2.66*
2.57** 2.87
2.79 2.63**
2.46*** 2.76
2.84
gendered voices. Moreover, there is a signicant dierence in trust
scores reported by male and female participants: female participants
trust the gender-ambiguous voice more than men (
Mdn.women
:
5.45, Mdn.men : 4.595,U = 84.5, p = 0.048).
4.3 The Role of Task Context
In order to answer RQ3, we added task context as a variable in our
analysis. For average male and female traits, context dependence is
only seen for average male traits: the male low (Mdn. assistant task
(AT):2
.
665, Mdn. compliance task (CT):3
.
08
,U =
119
,p =
0
.
028),
male high (
Mdn.AT
: 2
.
42
, Mdn.CT
: 3
.
0
,U =
177
, p =
0
.
020),
and gender-ambiguous voice (
Mdn.AT
: 2
.
08
, Mdn.CT
: 2
.
83
,U =
113
,p =
0
.
021) score higher on average male traits in the compliance
task than the assistance task. Additionally, reported trust was stable
over both tasks for male voices, while they were task dependent
for female low (
Mdn.AT
: 5
.
225
, Mdn.CT
: 4
.
18
,U =
250
,p =
0
.
007), female high (
Mdn.AT
: 5
.
18
, Mdn.CT
: 4
.
045
,U =
177
, p =
0
.
003) and gender-ambiguous voices (
Mdn.AT
: 5
.
045
, Mdn.CT
:
4
.
18
,U =
122
,p =
0
.
038). In fact, all voices scored higher on trust
for the assistance task compared to the compliance task.
5 DISCUSSION
This study found evidence for the inuence of voice gender and
pitch on (stereotypical) trait attribution. While no positive stereo-
type activation was found, negative stereotypical trait ascription,
and the lack thereof, showed implicit activation of gender stereo-
typing. For example, while the low male voice was not explicitly
considered to be stereotypically male, only the low male voice was
not perceived not to be typically male, and only the low voices—both
male and female—were not refuted to be stereotypically female. As
for trust attribution, we did not identify direct eects of voice pitch.
However, a trend showed higher trust in the gender-ambiguous
voice for female participants. Finally, task context inuences both
stereotype activation (for male traits) and trust (for female and
gender-ambiguous voices).
With regards to trait attribution, our ndings show mixed results
with respect to the CASA paradigm. Negative trait ascription was
prevalent, which can be both due to a lack of a perceived trait or
a lack of viewing the voice as a social actor all together. While
active stereotype activation was missing, the absence of stereotype
negation seems to indicate an implicit gender bias. The fact that
male traits and trust in female (and gender-ambiguous) voices was
context dependent, indicates voice pitch and voice gender does
subtly inuence perception. With regards to trust formation, our
results do not seem in line with prior research on the eect of pitch
in inter-personal interactions, which indicate that people generally
trust people with high-pitched voices more [
19
]. This may indicate
that some of those mechanisms may be weaker in human-computer
interaction. However, it has to be noted, that the means, especially
for voices with the particularly low and high voice, reveal a trend
towards higher-pitched voices being trusted more. As our sample
size is comparatively small, those results may become signicant
with a more appropriate sample size.
Task context did have an eect on perception and stereotype
activation. Male voices were perceived more stereotypically male
in a more ‘male’ context of a compliance task. Female voices on
the other hand were signicantly more trusted in assistance tasks
when compared to a compliance task. Curiously, these eects were
both present for the gender-ambiguous voice: perceived male traits
and trust were context dependent. While it is a positive indication
that the gender-ambiguous voice was not assigned one specic
gender, it also shows a risk: because the voice does not t one
gender stereotype, it also does not t one stereotypical response,
making it sensitive to multiple possible responses.
Nevertheless, the gender-ambiguous voice showed no signicant
trust dierences when compared to the gendered voices. This is a
promising rst result, as there is very little research on the impact
of gender-ambiguous voices. The fear that lack of a mental model
and added cognitive load due to unrecognizable sex of the voice
negatively inuences trust does not seem to be conrmed by our re-
search. More research is needed into dierent contexts and dierent
pitches to conrm that the gender-ambiguous voice does not have a
negative impact on trust compared to gendered voices. Overall, the
gender-ambiguous voice was found to be organized, condent, co-
operative, and polite; just as the gendered voices. This seems to be a
encouraging initial resultfor use of gender-ambiguous voices in VAs.
The fact that women have a higher trust in the gender-ambiguous
voice than men warrants further research.
Our study had some limitations that should be pointed out. First,
for a quantitative study, our sample size is comparatively small
due to the study’s exploratory character. Second, the participants
were asked to imagine the scenarios to be real-life, which may
threaten the external validity of our results. Although our study
included actual voice interaction, future work may reexamine the
CHI ’21 Extended Abstracts, May 8–13, 2021, Yokohama, Japan Tolmeijer, et al.
results in a eld setting. Third, it should be noted that results only
capture rst impressions of the VAs. A longitudinal perspective on
trait ascription and trust formation should included in future work.
Furthermore, three dierent Google WaveNet voices were used to
create the voices used in our experiment. We did not control for
other voice characteristics such as timbre and tone, which could
have inuenced our results.
Additionally, a limitation lies in the created gender-ambiguous
voice. As indicated, voice gender does not only come from pitch,
but also from language usage and intonation. We controlled for this
as much as possible by testing dierent pitch shifts for dierent
voices, but all voices were originally gendered. The gender-neutral
voice clip called Q [
24
] was recorded by using voices of people
that neither ascribe to the male gender nor the female one. The
lack of gender-ambiguous voice generators or text-to-speech tools
hampers the research into the possibilities of such a voice.
6 CONTRIBUTION AND FUTURE WORK
Our study has several theoretical and practical contributions to
prior work in HCI research on the use of VA in commercial set-
tings, the role of para linguistic cues for trait attribution, and the
eective design of a VA‘s ‘personality’. To the best of our knowl-
edge, this is the rst line of systematic research demonstrating how
variations in voice pitch induce gender-specic trait attribution to-
wards the agent, how such attributions aect important perceptual
downstream consequences such as trust, and how such changes
are impacted by the task context. Moreover, we develop and com-
paratively evaluate a gender-ambiguous voice with promising rst
results.
Our ndings show stereotype activation is not as clear-cut as
one might expect, but appears as a lack of stereotype negation. This
combined with the inuence of the participants’ gender and task
context asks for a more in-depth examination into stereotype acti-
vation and perpetuation of VAs. Additionally, gender-ambiguous
voices are a promising avenue of research for VA design, to strive
for more inclusive design. However, there is currently a lack of
tools providing gender-ambiguous voice generation. We call upon
researchers and industry alike to focus on the creating of gender-
ambiguous voice tools to be able to research and provide more
inclusive and stereotype avoiding voices for VAs.
ACKNOWLEDGMENTS
This work is supported by the Swiss National Science Foundation,
Grant 192718. Andreas Janson acknowledges support from the basic
research fund of the University of St.Gallen. We thank Prof. Leah
Ruppanner for her valuable feedback.
REFERENCES
[1]
Alice Baird, Stina Hasse Jørgensen, Emilia Parada-Cabaleiro, Nicholas Cummins,
Simone Hantke, and Björn Schuller. 2018. The perception of vocal traits in synthe-
sized voices: age, gender, and human likeness. Journal of the Audio Engineering
Society 66, 4 (2018), 277–285.
[2]
Sandra L Bem. 1981. Bem Sex-Role Inventory: professional manual. Consulting
Psychologists Press, Palo Alto, CA, USA.
[3]
Izak Benbasat and Weiquan Wang. 2005. Trust in and adoption of online recom-
mendation agents. Journal of the association for information systems 6, 3 (2005),
4.
[4]
Hyojin Chin, Lebogang Wame Mole, and Mun Yong Yi. 2020. Empathy Is All
You Need: How a Conversational Agent Should Respond to Verbal Abuse. In
Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems.
Association for Computing Machinery, New York, NY, USA, 1–13.
[5]
Christy L Dennison. 2006. The eect of gender stereotypes in language on attitudes
toward speakers. Ph.D. Dissertation. University of Pittsburgh.
[6]
Friederike Eyssel and Frank Hegel. 2012. (s) he’s got the look: Gender stereotyping
of robots 1. Journal of Applied Social Psychology 42, 9 (2012), 2213–2230.
[7]
Jasper Feine, Ulrich Gnewuch, Stefan Morana, and Alexander Maedche. 2019.
A taxonomy of social cues for conversational agents. International Journal of
Human-Computer Studies 132 (2019), 138–161.
[8]
Márcio Fuckner, Jean-Paul Barthès, and Edson Emílio Scalabrin. 2014. Using a
personal assistant for exploiting service interfaces. In Proceedings of the 2014 IEEE
18th International Conference on Computer Supported Cooperative Work in Design
(CSCWD). IEEE, IEEE, Piscataway, N.J., USA, 89–94.
[9]
Eva Gustavsson. 2005. Virtual servants: stereotyping female front-oce employ-
ees on the internet. Gender, Work & Organization 12, 5 (2005), 400–419.
[10]
Christian Hildebrand, Fotis Efthymiou, Francesc Busquet, William H Hampton,
Donna L Homan, and Thomas P Novak. 2020. Voice analytics in business
research: Conceptual foundations, acoustic feature extraction, and applications.
Journal of Business Research 121 (2020), 364–374.
[11]
Jessica Junger, Katharina Pauly, Sabine Bröhr, Peter Birkholz, Christiane
Neuschaefer-Rube, Christian Kohler, Frank Schneider, Birgit Derntl, and Ute
Habel. 2013. Sex matters: Neural correlates of voice gender perception. Neuroim-
age 79 (2013), 275–287.
[12]
Anne M Koenig, Alice H Eagly, Abigail A Mitchell, and Tiina Ristikari. 2011.
Are leader stereotypes masculine? A meta-analysis of three research paradigms.
Psychological bulletin 137, 4 (2011), 616.
[13]
Eun Ju Lee, Cliord Nass, and Scott Brave. 2000. Can computer-generated speech
have gender? An experimental test of gender stereotype. In CHI’00 extended
abstracts on Human factors in computing systems. Association for Computing
Machinery, New York, NY, USA, 289–290.
[14]
Wade J Mitchell, Chin-Chang Ho, Himalaya Patel, and Karl F MacDorman. 2011.
Does social desirability bias favor humans? Explicit–implicit evaluations of syn-
thesized speech support a new HCI model of impression management. Computers
in Human Behavior 27, 1 (2011), 402–412.
[15]
Chelsea Myers, Anushay Furqan, Jessica Nebolsky, Karina Caro, and Jichen Zhu.
2018. Patterns for how users overcome obstacles in voice user interfaces. In
Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems.
Association for Computing Machinery, New York, NY, USA, 1–7.
[16]
Cliord Nass, Youngme Moon, and Nancy Green. 1997. Are machines gender
neutral? Gender-stereotypic responses to computers with voices. Journal of
applied social psychology 27, 10 (1997), 864–876.
[17]
Cliord Nass, Jonathan Steuer, and Ellen R Tauber. 1994. Computers are social
actors. In Proceedings of the SIGCHI conference on Human factors in computing
systems. Association for Computing Machinery, New York, NY, USA, 72–78.
[18]
Nora Ni Loideain and Rachel Adams. 2018. From Alexa to Siri and the GDPR: the
gendering of virtual personal assistants and the role of EU data protection law.
[19]
Jillian JM O’Connor and Pat Barclay. 2017. The inuence of voice pitch on
perceptions of trustworthiness across social contexts. Evolution and Human
Behavior 38, 4 (2017), 506–512.
[20]
Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray
Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg,
et al
.
2018. Parallel wavenet: Fast high-delity speech synthesis. In International
conference on machine learning. PMLR, Stockholm, Sweden, 3918–3926.
[21]
Richard Pak, Anne Collins McLaughlin, and Brock Bass. 2014. A multi-level
analysis of the eects of age and gender stereotypes on trust in anthropomorphic
technology by younger and older adults. Ergonomics 57, 9 (2014), 1277–1289.
[22]
Emmi Parviainen and Marie Louise Juul Søndergaard. 2020. Experiential Qualities
of Whispering with Voice Assistants. In Proceedings of the 2020 CHI Conference
on Human Factors in Computing Systems. Association for Computing Machinery,
New York, NY, USA, 1–13.
[23]
Amanda Purington, Jessie G Taft, Shruti Sannon, Natalya N Bazarova, and
Samuel Hardman Taylor. 2017. " Alexa is my new BFF" Social Roles, User Sat-
isfaction, and Personication of the Amazon Echo. In Proceedings of the 2017
CHI Conference Extended Abstracts on Human Factors in Computing Systems.
Association for Computing Machinery, New York, NY, USA, 2853–2859.
[24]
Q. 2019. The First Genderless Voice. 2019. Meet Q: The First Genderless Voice-
FULL SPEECH.
[25]
Giuseppe Riccardi. 2014. Towards healthcare personal agents. In Proceedings of
the 2014 Workshop on Roadmapping the Future of Multimodal Interaction Research
including Business Opportunities and Challenges. Association for Computing
Machinery, New York, NY, USA, 53–56.
[26]
Elizabeth A Strand. 2000. Gender stereotype eects in speech processing. Ph.D.
Dissertation. The Ohio State University.
[27]
Selina Jeanne Sutton. 2020. Gender Ambiguous, not Genderless: Designing
Gender in Voice User Interfaces (VUIs) with Sensitivity. In Proceedings of the
2nd Conference on Conversational User Interfaces. Association for Computing
Machinery, New York, NY, USA, 1–8.
[28]
Selina Jeanne Sutton, Paul Foulkes, David Kirk, and Shaun Lawson. 2019. Voice as
a design material: sociophonetic inspired design strategies in Human-Computer
Female by Default? Exploring the Eect of Voice Assistant Gender and Pitch on Trait and Trust Aribution CHI ’21 Extended Abstracts, May 8–13, 2021, Yokohama, Japan
Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Com-
[31]
Qian Yu, Tonya Nguyen, Soravis Prakkamakul, and Niloufar Salehi. 2019. " I
puting Systems. Association for Computing Machinery, New York, NY, USA, Almost Fell in Love with a Machine" Speaking with Computers Aects Self-
1–14.
disclosure. In Extended Abstracts of the 2019 CHI Conference on Human Factors in
[29]
UNESCO. 2019. I’d blush if I could closing gender divides in digital skills through
Computing Systems. Association for Computing Machinery, New York, NY, USA,
education. Technical Report. UNESCO. 1–6.
[30]
Maria Vernuccio, Michela Patrizi, and Alberto Pastore. 2020. Brand Anthropo-
[32]
Naim Zierau, Edona Elshan, Camillo Visini, and Andreas Janson. 2020. A Re-
morphism and Brand Voice: The Role of the Name-Brand Voice Assistant. In
view of the Empirical Literature on Conversational Agents and Future Research
Advances in Digital Marketing and eCommerce. Springer, Switzerland, 31–39.
Directions. In ICIS 2020 Proceedings. Association for Information Systems, At-
lanta,GA,USA, 1–17.
... Voice-based virtual assistants (VAs), in particular, can provide hands-free guidance, which is suitable for mindfulness meditation [8,27]. While work on user experiences (UX) with VAs has covered factors as diverse as personality [22], gender [14,19,30], contexts of use [5,21], and speech interaction [4,31], few have focused on the "body" of the voice and how it might relate to the body of the user [27]. In theory, VAs are not only auditory; tactility can be stimulated with vibrations at the location of the voice. ...
Preprint
Full-text available
Mindfulness meditation is a validated means of helping people manage stress. Voice-based virtual assistants (VAs) in smart speakers, smartphones, and smart environments can assist people in carrying out mindfulness meditation through guided experiences. However, the common fixed location embodiment of VAs makes it difficult to provide intuitive support. In this work, we explored the novel embodiment of a "wandering voice" that is co-located with the user and "moves" with the task. We developed a multi-speaker VA embedded in a yoga mat that changes location along the body according to the meditation experience. We conducted a qualitative user study in two sessions, comparing a typical fixed smart speaker to the wandering VA embodiment. Thick descriptions from interviews with twelve people revealed sometimes simultaneous experiences of immersion and dis-immersion. We offer design implications for "wandering voices" and a new paradigm for VA embodiment that may extend to guidance tasks in other contexts.
... Analogously, users who interact with DVAs with realistic voice production capabilities exhibit a concerning inclination to generalise purely human concepts to digital assistants (Abercrombie et al. 2021). When a DVA's simulated voice mimics a 'female' tone, for example, people ascribe gendered stereotypes to their DVAs (Shiramizu et al. 2022;Tolmeijer et al. 2021) despite the baselessness of applying gendered concepts to an inherently genderless entity. This evidence suggests that, once initial impressions of humanlikeness have been established, the process of anthropomorphism extends beyond context-specific instances and instead permeates broadly to evoke a wide range of human-like attributions. ...
Article
The development of highly-capable conversational agents, underwritten by large language models, has the potential to shape user interaction with this technology in profound ways, particularly when the technology is anthropomorphic, or appears human-like. Although the effects of anthropomorphic AI are often benign, anthropomorphic design features also create new kinds of risk. For example, users may form emotional connections to human-like AI, creating the risk of infringing on user privacy and autonomy through over-reliance. To better understand the possible pitfalls of anthropomorphic AI systems, we make two contributions: first, we explore anthropomorphic features that have been embedded in interactive systems in the past, and leverage this precedent to highlight the current implications of anthropomorphic design. Second, we propose research directions for informing the ethical design of anthropomorphic AI. In advancing the responsible development of AI, we promote approaches to the ethical foresight, evaluation, and mitigation of harms arising from user interactions with anthropomorphic AI.
Article
A growing body of literature is emerging on using AI-enabled anthropomorphic technologies to assist in interactions with humans. The present research uses two experiments to investigate the impact of the gender of the anthropomorphic technology, users’ gender and users’ technology self-efficacy on their attitude and intention towards using anthropomorphic technology. The findings of study 1 suggested that, for gender-neutral tasks, users tend to favour anthropomorphic technology with the opposite gender. This inclination can be understood through the lens of the heterophily theory within the context of customer–seller relationships. The findings of study 2 suggested that users with high technology self-efficacy have more positive attitudes and intentions to use anthropomorphic technology than users with low technology self-efficacy. Also, the findings suggest that users with high technology self-efficacy prefer the female version of anthropomorphic technology, while those with low technology self-efficacy prefer the male version. Findings contribute to academic knowledge and offer practical implications for crafting user-centric anthropomorphic technology solutions in the ever-evolving technological landscape, empowering individuals and facilitating the integration of technology into their daily lives.
Article
Voice-based interaction is experiencing a second wind through the advent of machine learning (ML) techniques, affordable consumer products and renewed work on natural language processing (NLP) and large language models (LLMs). A growing body of work is exploring how users perceive new forms of computer-generated voices from qualitative and quantitative angles. However, critical voices have called for greater rigour, especially in confirming the voice as a manipulated variable, i.e. manipulation checks. We present three case studies that highlight the value of investing in rigorous manipulation checks for HCI researchers and designers. We demonstrate the importance of testing assumptions, the need for care and reflection in the design of response options and measurement and the advantages of more exploratory approaches to understanding user perceptions of and user experiences (UX) with voice phenomena. Through these case studies, we raise awareness, empirically justify and critically assess the value of manipulation checks for voice UX research and beyond.
Article
Voice assistants are increasingly popular in autonomous vehicles, with over one-third of United States drivers using them. However, more knowledge is needed for future technology advancements. A scoping review was conducted to understand factors that may impact the use of in-vehicle voice assistants. The PRISMA method identified relevant articles across three databases (Engineering Village, OneSearch, and IEEE Explore), resulting in 35 articles meeting the criteria. The paper explores three areas of intersection between autonomous vehicles and voice assistants: user interactions, user experience, and accessibility. These areas cover subthemes of tone, gender, anthropomorphism, trust, privacy and security, situation awareness, and visual inclusiveness. Results show differences in individual preferences and an understanding of how voice assistants can integrate seamlessly to meet drivers’ needs and expectations. The findings of this study provide insights into the user experiences of voice assistants (VAs) in today’s vehicles and the factors that impact their usage, which will help improve future in-vehicle systems to become more inclusive.
Conference Paper
Full-text available
The knowledge base related to user interaction with conversational agents (CAs) has grown dramatically but remains segregated. In this paper, we conduct a systematic literature review to investigate user interaction with CAs. We examined 107 papers published in outlets related to IS and HCI research. Then, we coded for design elements and user interaction outcomes, and isolated 7 significant determinants of these outcomes, as well as 42 themes with inconsistent evidence, providing grounds for future research. Building upon the insights from the analysis, we propose a research agenda to guide future research surrounding user interaction with CAs. Ultimately, we aim to contribute to the body of knowledge of IS and HCI in general and user interaction with CA in particular by indicating how developed a research field is regarding the number and content of the respective contributions. Furthermore, practitioners benefit from a structured overview related to CA design effects.
Article
Full-text available
Recent advances in artificial intelligence and natural language processing are gradually transforming how humans search, shop, and express their preferences. Leveraging the new affordances and modalities of human-machine interaction through voice-controlled interfaces will require a nuanced understanding of the physics and psychology of speech formation as well as the systematic extraction and analysis of vocal features from the human voice. In this paper, we first develop a conceptual framework linking vocal features in the human voice to experiential outcomes and emotional states. We then illustrate the effective processing, editing, analysis, and visualization of voice data based on an Amazon Alexa user interaction, utilizing state-of-the-art signal-processing packages in R. Finally, we offer novel insight into the ways in which business research might employ voice and sound analytics moving forward, including a discussion of the ethical implications of building multi-modal databases for business and society.
Article
Full-text available
Conversational agents (CAs) are software-based systems designed to interact with humans using natural language and have attracted considerable research interest in recent years. Following the Computers Are Social Actors paradigm, many studies have shown that humans react socially to CAs when they display social cues such as small talk, gender, age, gestures, or facial expressions. However, research on social cues for CAs is scattered across different fields, often using their specific terminology, which makes it challenging to identify, classify, and accumulate existing knowledge. To address this problem, we conducted a systematic literature review to identify an initial set of social cues of CAs from existing research. Building on classifications from interpersonal communication theory, we developed a taxonomy that classifies the identified social cues into four major categories (i.e., verbal, visual, auditory, invisible) and ten subcategories. Subsequently, we evaluated the mapping between the identified social cues and the categories using a card sorting approach in order to verify that the taxonomy is natural, simple, and parsimonious. Finally, we demonstrate the usefulness of the taxonomy by classifying a broader and more generic set of social cues of CAs from existing research and practice. Our main contribution is a comprehensive taxonomy of social cues for CAs. For researchers, the taxonomy helps to systematically classify research about social cues into one of the taxonomy's categories and corresponding subcategories. Therefore, it builds a bridge between different research fields and provides a starting point for interdisciplinary research and knowledge accumulation. For practitioners, the taxonomy provides a systematic overview of relevant categories of social cues in order to identify, implement, and test their effects in the design of a CA.
Conference Paper
Full-text available
Listening and speaking are tied to human experiences of closeness and trust. As voice interfaces gain mainstream popularity, we ask: is our relationship with technology that speaks with us fundamentally different from technology we use to read and type? In particular, will we disclose more about ourselves to computers that speak to us and listen to our answer? We examine this question through a controlled experiment where a conversational agent asked participants closeness-generating questions common in social psychology through either text-based or voice-based interfaces. We found that people skipped more invasive questions when reading-typing compared to speaking-listening. Surprisingly, typing in their answers seemed to increased the propensity for self-disclosure. This research has implications for the future design of voice-based conversational agents and deepens concerns of user privacy.
Conference Paper
Full-text available
While there is a renewed interest in voice user interfaces (VUI) in HCI, little attention has been paid to the design of VUI voice output beyond intelligibility and naturalness. We draw on the field of sociophonetics - the study of the social factors that influence the production and perception of speech - to highlight how current VUIs are based on a limited and homogenised set of voice outputs. We argue that current systems do not adequately consider the diversity of peoples' speech, how that diversity represents sociocultural identities, and how voices have the potential to shape user perceptions and experiences. Ultimately, as other technological developments have influenced the ideologies of language, the voice outputs of VUIs will influence the ideologies of speech. Based on our argument, we pose three design strategies for VUI voice output design - individualisation, context awareness, and diversification - to motivate new ways of conceptualising and designing these technologies.
Chapter
The name-brand voice assistant (NBVA), allowing the brand to dialogue directly with consumers by its brand name and its brand voice, could theoretically activate the brand anthropomorphism perception in the consumer’s mind. Despite the growing use of NBVAs and the theoretical possibility to recognise important branding implications, the role that companies can assign to NBVAs in the context of brand strategies has not yet been addressed by marketing scholars. Therefore, the objective of our work is to begin to fill this gap by investigating the role that the NBVA might have in the brand anthropomorphisation strategy, adopting the managerial perspective. To this end, we have conducted qualitative research based on a single in-depth case study, the first in-car NBVA to be release in the international market.
Conference Paper
With the popularity of AI-infused systems, conversational agents (CAs) are becoming essential in diverse areas, offering new functionality and convenience, but simultaneously, suffering misuse and verbal abuse. We examine whether conversational agents' response styles under varying abuse types influence those emotions found to mitigate peoples' aggressive behaviors, involving three verbal abuse types (Insult, Threat, Swearing) and three response styles (Avoidance, Empathy, Counterattacking). Ninety-eight participants were assigned to one of the abuse type conditions, interacted with the three spoken (voice-based) CAs in turn, and reported their feelings about guiltiness, anger, and shame after each session. The results show that the agent's response style has a significant effect on user emotions. Participants were less angry and more guilty with the empathy agent than the other two agents. Furthermore , we investigated the current status of commercial CAs' responses to verbal abuse. Our study findings have direct implications for the design of conversational agents.