Conference PaperPDF Available

Embodiment and gender interact in alignment to TTS voices


Abstract and Figures

The current study tests subjects' vocal alignment toward female and male text-to-speech (TTS) voices presented via three systems: Amazon Echo, Nao, and Furhat. These systems vary in their physical form, ranging from a cylindrical speaker (Echo), to a small robot (Nao), to a human-like robot bust (Furhat). We test whether this cline of personification (cylinder < mini robot < human-like robot bust) predicts patterns of gender-mediated vocal alignment. In addition to comparing multiple systems, this study addresses a confound in many prior vocal alignment studies by using identical voices across the systems. Results show evidence for a cline of personification toward female TTS voices by female shadowers (Echo < Nao < Furhat) and a more categorical effect of device personification for male TTS voices by male shadowers (Echo < Nao, Furhat). These findings are discussed in terms of their implications for models of device-human interaction and theories of computer personification.
Content may be subject to copyright.
Embodiment and gender interact in alignment to TTS voices
Michelle Cohn (
Department of Linguistics, Phonetics Lab, UC Davis, 1 Shields Avenue
Davis, CA 95616 USA
Patrik Jonell (
Division of Speech, Music, & Hearing, KTH Royal Institute of Technology, SE-100 44
Stockholm, Sweden
Taylor Kim (
Department of Linguistics, Phonetics Lab, UC Davis, 1 Shields Avenue
Davis, CA 95616 USA
Jonas Beskow (
Division of Speech, Music, & Hearing, KTH Royal Institute of Technology, SE-100 44
Stockholm, Sweden
Georgia Zellou (
Department of Linguistics, Phonetics Lab, UC Davis, 1 Shields Avenue
Davis, CA 95616 USA
The current study tests subjects’ vocal alignment toward female
and male text-to-speech (TTS) voices presented via three
systems: Amazon Echo, Nao, and Furhat. These systems vary in
their physical form, ranging from a cylindrical speaker (Echo), to
a small robot (Nao), to a human-like robot bust (Furhat). We test
whether this cline of personification (cylinder < mini robot <
human-like robot bust) predicts patterns of gender-mediated
vocal alignment. In addition to comparing multiple systems, this
study addresses a confound in many prior vocal alignment studies
by using identical voices across the systems. Results show
evidence for a cline of personification toward female TTS voices
by female shadowers (Echo < Nao < Furhat) and a more
categorical effect of device personification for male TTS voices
by male shadowers (Echo < Nao, Furhat). These findings are
discussed in terms of their implications for models of device-
human interaction and theories of computer personification.
Keywords: vocal alignment; embodiment; human-device
interaction; gender; text-to-speech
Recent advancements in robotics and conversational AI has
led to the development of more human-like robotic systems,
such as those with expressive facial movements (e.g., Sophia
by Hanson Robotics) and speech synthesis systems that yield
hyper-naturalistic voices (e.g., Amazon Echo). The presence
of and variation across these different systems allows for an
empirical test of aspects of computer personification theories,
such as the Computers are Social Actors (CASA) framework
which proposes that humans apply the social behavior norms
from human-human interaction to their interactions with
technology when they detect a cue of humanity in a digital
system (e.g., Nass, Steuer, and Tauber, 1994). Aspects of
CASA have received support across many empirical studies,
such as showing that people apply politeness norms to
computer interlocutors (Nass et al., 1997). Yet, most studies
do not directly compare human-computer and human-human
interaction (e.g., Bell et al., 2003; Nass et al., 1999, 1994).
Furthermore, no prior studies, to our knowledge, have tested
the extent to which people’s application of human-based
social responses might be gradient. The current study was
designed to fill this gap in the literature by investigating
whether we see differences in application of social behavior
from human-human interaction across systems that vary
gradiently in apparent humanness.
Given that the main type of interaction with modern voice-
activated artificially intelligent (voice-AI) devices is through
speech, a relevant social behavior to examine is vocal
alignment: when speakers adjust their pronunciations of
words to more closely mirror their interlocutors’ speech
patterns. Greater degree of alignment has been argued to
signal social closeness between interlocutors; one theory of
human-human linguistic coordination is Communication
Accommodation Theory (CAT) (Giles et al., 1991; Shepard
et al., 2001): where speakers use degree of convergence to
convey their closeness to an interlocutor – or, conversely,
their divergence to signal greater social distance. For
example, people align more to interlocutors if they find them
attractive (Babel, 2012) and likeable (Chartrand & Bargh,
Some prior work has explored whether alignment patterns
differ for non-human interlocutors, comparing human-human
and human-computer interaction (for a review, see Branigan
et al., 2010). For example, Branigan and colleagues (2003)
found that participants aligned in syntactic structure (e.g.,
“give the dog a bone” vs. “give a bone to the dog”) to the
same extent in typed interactions between an apparent
‘computer’ and ‘human’ interlocutor. Yet, in spoken
language interaction, differences by interlocutor appear to be
©2020 e Author(s). is work is licensed under a Creative
Commons Aribution 4.0 International License (CC BY).
more pronounced: three recent studies found that people
vocally align to both human and voice-AI assistants (Apple’s
Siri, Amazon’s Alexa), but display less alignment to the
assistant voices (Cohn et al., 2019; Raveh et al., 2019; Snyder
et al., 2019). These findings suggest that our transfer of social
behaviors to AI systems in speech interactions is tempered by
their social category as not human. This differentiation of
speech behavior based on humanness is in line with the
theory of Audience Design (Bell, 1984; Clark & Murphy,
1982): whereby interlocutors strategically adapt their
productions for the communicative needs of their listener.
Combining aspects of Audience Design and CASA (Nass et
al., 1997, 1994), we hypothesize that people’s speech
behavior toward voice-AI will vary gradiently as a function
of their personification of the system. We predict that people
will treat more naturalistic systems more like they would a
real human, while less human-like systems will receive less
human-based socially-mediated behaviors.
The present study tests this hypothesis gradient
application of social behaviors based on personification – by
varying the physical embodiment of voice-AI systems.
Current devices vary in how they embody humanness. For
example, cylindrical smart speakers are now common
household voice-AI systems (e.g., Amazon Echo; Google
Home). Other types of voice-AI systems take on more
human-like forms. For example, the Nao robot has a head,
face, and body, but with clear physical and mechanical
characteristics that make it distinct from a real human (see
Figure 1). Related work has suggested that the Nao could be
considered an intermediate type of robot along a cline of
human-likeness: in a study by Brink and colleagues (2019),
they found that participants found the Nao less uncanny than
a more human-like robot face. In the present study, we
consider a cline of personification from a smart speaker to a
Nao robot to a Furhat robot (Al Moubayed et al., 2012),
another type of robot that is more human-like (see Figure 1).
The Furhat resembles a human bust, with a 3D printed face,
and a back-projected video of a human face. These videos
increase its realism: the eyes blink and make micro-
movements, and the mouth shows appropriate articulation of
speech sounds to match the audio.
Figure 1: Systems used in the present study (L-R): Amazon
Echo, Nao, Furhat (male), Furhat (female).
These three devices – a cylindrical speaker, mini-robot, and
naturalistic bust vary along a continuum of humanness in
terms of embodying a human form and displaying human-
like features. Our AI personification hypothesis is that simply
varying the humanness of the device will lead to changes in
vocal alignment toward the system. Identical stimuli
recordings will be presented across systems to avoid any
confound that might arise from using different voices. In this
case, we expect an increasing degree of alignment, signaling
greater application of this human-based, socially-mediated
behavior, as the personification of the device increases:
greatest alignment toward the Furhat device, less toward the
Nao, and least toward the Echo speaker.
An alternative hypothesis is that increasing personification
of AI may lead to less alignment – or even divergence – from
the speech produced by the most human-like system (e.g.,
Furhat) as a consequence of the Uncanny Valley effect (Mori
et al., 2012): as non-human, robotic entities display greater
human-like characteristics, there is a tendency for people to
assess them more positively. Yet, there is a point at which
there is a steep drop-off and likeability plunges, a function
known as the ‘uncanny valley’. An example of this is the
response to seeing a nearly human-like face in a non-human
device, triggering feelings of disgust or uneasiness. Speakers’
patterns of vocal alignment are one way to test the uncanny
valley; prior work has shown that speakers show more
convergence toward an interlocutor they feel socially close
with, while they show divergence from those they want to
distance themselves from socially. If a human-like voice is
paired with a hyper-naturalistic robotic entity, this might
trigger an uncanny valley-like effect, causing participants to
align less than they would for a real human.
Gender-mediated alignment toward AI?
In addition to social factors, such as likeability, human-
human vocal alignment has been shown to be mediated by
gender. For example, participants show stronger vocal
alignment toward human male voices than female voices
(Pardo, 2006). However, this gender effect is sometimes
mixed (Pardo et al., 2017), suggesting that idiosyncratic
properties of voices can influence the degree of alignment as
well. Nevertheless, there is some evidence that gender-
mediated alignment patterns may also transfer to human-
device interaction: humans display greater alignment toward
male, relative to female, voices for both human and Apple
Siri model talkers (Cohn et al., 2019; Snyder et al., 2019).
This supports the hypothesis that humans transfer gender-
mediated patterns of vocal alignment from human-human
conversations to their interactions with voice-AI systems,
supporting predictions made by CASA (Nass et al., 1994).
Yet, the properties of the voices themselves (e.g.,
idiosyncrasies of the human speakers, TTS synthesis) pose a
confound between apparent human-likeness and degree of
alignment seen in prior studies.
Based on our proposal that people’s vocal alignment
behavior toward AI will vary as a function of the
personification of the system, we can explore more specific
predictions by varying the apparent gender of the voice. Prior
work reports that male voices are imitated to a greater degree
than female voices, which is realized to a lesser extent for
TTS voices (Cohn et al., 2019; Snyder et al., 2019). Thus, we
predict that this gender-mediated pattern will vary gradiently
as a function of the personification of the AI. More
specifically, we predict gender-mediated patterns of
alignment to be realized to the largest extent for the AI system
with the most human-like physical features (Furhat) and the
least amount of gender-mediated alignment patterns for the
Echo speaker, with the Nao receiving alignment patterns in-
between the others.
Current Study
The present study examines 1) the influence of degree of
human-likeness on extent of vocal alignment toward voice-
AI interlocutors, and 2) how AI personification interacts with
apparent gender on vocal alignment patterns. We conducted
a shadowing experiment for identical sound files produced by
two TTS voices presented across three embodied robotic
systems: a Furhat (Al Moubayed et al., 2012), a Nao robot
(SoftBank Robotics), and an Amazon Echo. In doing so, we
address a limitation of many vocal alignment studies, where
comparisons are made across a small subset of different
model talkers, leading to mixed, and often conflicting
findings about the influence of gender on alignment in the
literature (cf. Pardo et al., 2017), allowing us to specifically
test for the role of system personification, while holding the
voice characteristics constant across model talkers.
Furthermore, using these three systems also serves as a
stronger cue that these are indeed separate interlocutors. The
current study consists of two experiments: the first is a single
word shadowing paradigm, where participants were first
asked to record baseline productions of words and then asked
to repeat (to shadow) words produced by the systems.
Experiment 2 is an AXB similarity rating task where a
separate group of listeners rate the speakers’ baseline and
shadowed productions from Experiment 1, providing a
holistic assessment of vocal alignment (cf. Cohn et al., 2019).
Experiment 1. Shadowing
Subjects. Subjects were 10 native English speakers (mean
age = 35.1 ± 8.5 years old; 5 female, 5 male). Six participants
reported prior use of one or more voice-AI systems (e.g.,
Amazon’s Alexa, Apple’s Siri, Google Assistant, etc.); four
reported no prior interaction with any voice-AI system.
Participants received a $15 Amazon gift card for their time.
Stimuli. Twelve target CVN words were selected for the
current study, taken from related studies of phonetic
alignment by talker gender and humanness (Cohn et al., 2019;
Snyder et al., 2019): bomb, chime, hem, pun, sewn, shone,
shun, tame, vine, wane, wren, yawn. The 12 target words
were generated with two Amazon Polly TTS voices (US-
English): a male voice (“Matthew”) and a female voice
(“Salli”). For the Furhat talkers, two face “textures” were
selected (male texture: “Marty”, female texture: “Fedora”).
These faces were selected as they were the most human-like
available (see Figure 1). For each of the 6 gender/system
pairings, we generated instructions where each model talker
introduced themselves with a different gender-matching
name (e.g., 6 different apparent speakers: “Rebecca”,
“Matthew”, “Mary”, “Michael”, etc.).
Procedure. Subjects completed the experiment in a semi-
soundproof room with a head-mounted microphone. First, a
pre-exposure production of the words was recorded from
each of the subjects in order to get their baseline speech
patterns prior to exposure to the model talkers. Participants
produced each of the 12 target words (repeated 2 times),
reading from a pseudo-randomized list.
Next, participants completed the word-shadowing portion
of the study with the Amazon Echo, Nao, and Furhat (order
counterbalanced across subjects). The same experiment was
designed on all three systems, using the Amazon Alexa Skills
Kit, Nao Choregraphe, and Furhat Blockly, respectively. For
each interlocutor (Echo, Nao, Furhat), subjects completed
two blocks: a male and female speaker (gender ordering was
counterbalanced across subjects). For each subject, the voice
gender ordering (e.g., M-F, M-F, M-F) was consistent across
the interlocutors; this was to avoid consecutively presenting
an identical voice for two different interlocutors (e.g., male
Furhat, male Echo). On a given trial, subjects heard the
system produce a target word (e.g., “wren”) followed by a
3000ms silence, providing the subject time to respond. Note
that the systems’ responses were not contingent on the
subjects’ productions to avoid ASR errors.
Finally, subjects completed a short ratings survey about
each talker (randomly presented). For each, they saw the
“name” of the talker, a picture of the face/system, and an
example word recording. Using a sliding scale, they rated
each voice on four dimensions: age, friendliness (0=not
friendly, 100=extremely friendly), human-likeness
(0=machine-like, 100=human-like), and interactiveness
(0=inert, 100=extremely interactive.)
Ratings Analysis & Results
We analyzed participants’ ratings of the talkers with separate
mixed effects linear regressions, with main effects of Model
Talker System (a 3-level predictor: Echo, Nao, Furhat) and
Model Talker Gender (a two-level predictor: Female, Male),
and their interaction, and by-Subject random intercepts.
Average ratings for the male and female TTS voices across
the systems are plotted in Figure 2.
First, we observe differences of Model Talker Gender for
age rating: female voices were rated as being younger than
male voices [β=-7.27, t=-9.13, p<0.001]. Additionally, there
was a top-down effect of Model Talker System: voices
presented through the Furhat were rated as being younger
[β=-4.03, t=-3.58, p<0.001]. Yet, this is driven by an
interaction between Model Talker Gender and System, where
the male voice was rated as being younger when presented
through the Furhat device, compared to when it was
presented through the other systems [β=-3.27, t=2.90,
p<0.01]. For friendliness ratings of the voices, the model
showed only a main effect of Model Talker System: voices
presented through the Furhat were rated as being friendlier
[β=6.53, t=2.97, p<0.01]. For friendliness ratings of the
voices, the model showed only a main effect of Model Talker
System: voices presented through the Furhat were rated as
being friendlier [β=6.53, t=2.97, p<0.01]. For ratings of
human-likeness and interactiveness of the voices (bottom two
panels), there were no significant differences by the Model
Talker System or Model Talker Gender.
Figure 2: Mean ratings of age, friendliness, human-likeness
and interactiveness of the TTS voices when presented across
systems (Echo, Nao, Furhat). Error bars show standard error
of the mean.
Experiment 2. AXB Similarity
In this experiment, we assessed global similarity between the
participants’ baseline productions of the words (produced at
the beginning of the experiment, prior to exposure to the
model talkers) and their shadowed productions for each
model talker from Experiment 1 with an AXB similarity
ratings task (Cohn et al., 2019).
Subjects. 51 native English speakers participated in the AXB
study. Subjects were recruited through a university
Psychology subjects’ pool (37 females, 14 males; mean age
= 19.9 ± 1.7 years old). All subjects received course credit for
their participation.
Stimuli. The stimuli consisted of a baseline and shadowed
production by the 10 speakers who completed Experiment 1.
For each speaker, we selected one of their pre-exposure and
shadowed productions of each word for each of the six model
talkers (i.e., Furhat female, Furhat male, Echo female, etc.).
Due to speakers’ confusions about the TTS production of
‘yawn’, and speaker mispronunciations for several other
words, we only had a full set of pre-exposure and correct
shadowed productions from each model talker of 8 words for
the AXB study: bomb, chime, hem, pun, shun, tame, wane,
Procedure. Participants completed the AXB similarity
ratings experiment in a sound-attenuated booth, wearing
headphones (Seinheiser Pro) and sitting in front of a
computer screen and button box. On a given trial, raters heard
three words separated by a short silence (ISI =1s): a speaker’s
production of a word at baseline (e.g., “A”), the model
talker’s production of that same word (“X”), and the
speaker’s shadowed production of that word for that model
talker (e.g., “B”). Their task was to select the speaker’s token
that sounded most similar to “X” (i.e., the model talker).
Order of pre-exposure and shadowed token (i.e., “A” and
“B”) was balanced within each subject and counterbalanced
across both system and interlocutor gender. In total, raters
completed 480 AXB similarity ratings (10 speakers x 8 words
x 3 systems x 2 genders). Trials were presented in four blocks
of 120 trials; after each block, subjects could take a short
break. In total, the experiment lasted roughly 45 minutes.
Analysis. We coded whether the raters selected the shadowed
token as more “similar” to the model talker (=1) or not (=0)
and analyzed their responses with a mixed effects logistic
regression (glmer). Fixed effects included the Model Talker
System (3 levels: Echo, Nao, Furhat), the Model Talker
Gender (2 levels: female, male), and the Shadower Gender (2
levels: female, male), and the interaction between them.
Random effects structure included by-Shadower random
intercepts and by-Shadower random slopes for Model Talker
System, and by-Rater and by-Word random intercepts.
The mean AXB similarity ratings for each of the three
systems and two TTS voices is displayed in Figure 3. Overall,
the model computed several main effects and interactions.
First, there was a main effect of Model Talker Gender:
shadowers showed significantly less alignment to the female
TTS voice (in orange, Figure 3) than the male TTS voice (in
blue, Figure 3) [β=-0.01, t=-4.8, p<0.001].
While there was a trend toward significance for less
alignment toward the Echo (p=0.054), there was not a main
effect of Model Talker System. Similarly, there was not a
main effect of Shadower Gender. However, Model Talker
System and Model Talker Gender did participate in several
significant interactions. First, we observed greater alignment
toward Furhat by female shadowers [β=0.01, t=3.08,
p<0.001]. Yet, this effect was mediated by a three-way
interaction: female shadowers imitated the female Furhat
more [β=0.02, t=4.29, p<0.001] (see Figure 3, left panel).
Figure 3: Mean ratings of perceived degree of vocal
alignment in the AXB similarity ratings task for the three
systems (Echo, Nao, Furhat) and two TTS voices. Error bars
show standard error of the mean.
There was also a three-way interaction for the Echo: female
shadowers showed less alignment toward the Echo with the
female TTS voice than to the male TTS voice [β=-0.01, t=-
3.03, p<0.001]. The releveled model (ref = Echo) showed
only a two-way interaction for the Nao: both male and female
shadowers aligned to the male Nao more than the female Nao
[β=1.24e-02, t=2.85, p<0.01]. The releveled model for
gender (ref = female) showed that males aligned to the male
Furhat more [β=1.92e-02, t=4.39, p<0.001] and to the male
Echo less [β=-1.32e-02, t=-3.03, p<0.01] (in blue, Figure 3).
A post-hoc analysis on the data for male shadowers/male
model talkers confirmed no significant difference for the Nao
and Furhat, but both showed greater alignment than the Echo
[β=-2.61e-02, t=-2.96, p<0.01].
Post-hoc Analysis: Alignment and Ratings
We additionally conducted post-hoc analyses to test whether
participants’ ratings of the model talkers (e.g., age,
friendliness, human-likeness, interactiveness) mediated their
alignment patterns. The four ratings were included as
additional independent variables in separate logistic
regression models run on similarity ratings responses, with
identical fixed and random effects structure: Model Talker
System*Model Talker Gender*ShadowerGender*Rating +
(1 + Model Talker System | Shadower) + (1|Rater) +
None of the models revealed significant main fixed effects
for any of the ratings; yet, there were interactions between all
ratings and Model Talker System. For age, we found an
interaction with age ratings of the model talkers and degree
of alignment: participants showed more alignment toward the
female Echo when they rated the voice as being older
[β=2.45e-03, t=2.16, p<0.05]. Age ratings did not influence
alignment patterns toward the Furhat model talkers. The
releveled model showed that female shadowers displayed
less alignment toward the female Nao voice as ratings of its
age increased [β=-1.75e-03 t=-2.04, p<0.05].
Friendliness also interacted with alignment patterns toward
particular model talkers: participants showed less alignment
toward the Furhat talkers when they were rated as being
friendlier [β=-9.8e-04, t=-2.8, p<0.05] and no effect for the
Echo talkers. The releveled model showed greater alignment
toward the Nao voices as they were rated as being friendlier
[β=6.8e-04, t=2.5, p<0.05].
For ratings of human-likeness of the system, several
interactions were computed as significant: first, there was less
alignment toward the female Furhat if it was rated as being
more human-like [β=-3.7e-04, t=-2.1, p<0.05]. Additionally,
the model revealed there was less alignment for female
shadowers toward the female Echo if it was rated as being
more human-like [β=6.0e-04, t=-2.6, p<0.001]. The model,
releveled in order to unpack the comparison with the Nao
system and the other devices, also revealed that shadowers
displayed alignment patterns toward the Nao system as a
function of their human-likeness ratings: there was more
alignment toward the female Nao if it was rated as being more
human-like [β=8.0e-04, t=3.4, p<0.0001].
The model with how interactive-inert the talker was
revealed just one interaction: shadowers displayed even less
alignment toward the female Echo with less interactive
ratings [β=-7.2e-04, t=-3.0 p<0.001]. There was no effect for
Furhat talkers. The releveled model showed a different
pattern toward the Nao: shadowers displayed greater
alignment toward the female Nao with increasing
interactiveness ratings [β=9.0e-04, t=4.1 p<0.001].
This study was designed to test whether patterns of vocal
alignment toward male and female TTS voices are realized
gradiently, on the basis of the physical form of the device
producing the speech, varying from very non-human-like (a
cylindrical speaker) to more human-like (a human-shaped
bust). In general, participants aligned toward the male TTS
voices to a greater extent, in line with gender-mediated
patterns observed in prior work on human and Siri voices
(e.g., Cohn et al., 2019; Pardo, 2006). That we see
applications of this gender-mediated social ‘rule’ from
human-human interaction to human-AI interaction supports
predictions made by the CASA framework (Nass et al., 1997,
1994): participants applied gender-mediated patterns to their
alignment during interactions with AI systems.
Additionally, we observed gender asymmetries based on
both shadower and model talker gender. Female participants,
in general, displayed greater alignment to the male TTS voice
across systems. These findings parallel those in the human-
human literature. For example, in a shadowing experiment
with disembodied voices (and no images), female shadowers
aligned more to the male talkers, while male shadowers
aligned equally toward both male and female voices (Namy
et al., 2002). We see these patterns borne out in the present
study for the Amazon Echo responses (females aligning to
the male TTS voice more; males aligning to both genders
equally). Yet, when more social cues are available (e.g., in a
more human-like form: Furhat), we observe that alignment
may vary based on same- versus mixed-gender
shadower/model talker pairs. This suggests that the amount
of social information available and characteristics of the
participants may shape the degree of alignment more
Furthermore, we found some evidence in support of our
proposal of AI personification gradience: degree of vocal
alignment increased as degree of personification of the device
increased (cylinder < mini-robot < human-like robot) for
female shadowers toward the female TTS voice. Male
shadowers also showed evidence that personification of the
system mediates alignment, but in a more categorical way:
Male shadowers aligned more toward the male TTS voice
presented in the Nao and the Furhat, the two more pseudo-
anthropomorphic systems, relative to the Echo. These results
support our hypothesis, that the degree to which a device
embodies a human-like form, the more people will apply the
norms of human-human communication to human-AI dyadic
interactions: in this case, alignment. This finding supports
CAT (Giles et al., 1991; Shepard et al., 2001), where speakers
strategically adapt their degree of convergence toward their
interlocutor based on their social relationship. Additionally,
our findings are broadly in line with Audience Design (Bell,
1984; Clark & Murphy, 1982): speakers adjust their speech
differently based on the apparent communicative needs of
their interlocutor (here, based on their physical form as,
possibly, a cue of more ‘human-like’ competence).
Our proposal of AI personification gradience also receives
support from our post-hoc analyses; all four ratings of the
interlocutors (age, friendliness, human-likeness, and
interactiveness) interacted with the degree of human
embodiment of the system to explain vocal alignment
patterns. For one, increasing human-likeness ratings of the
Nao system led to increased alignment; in contrast,
increasing ratings of human-likeness led to decreased degree
of alignment toward the Furhat device. The reversal of the
expected pattern of increasing alignment with increasing
human-likeness, might be interpreted as an ‘uncanny valley’
effect (Mori et al., 2012), where increasing human-likeness
of a non-human entity leads to increasing positive feelings
toward the entity until a threshold where it elicits feelings of
discomfort and/or disgust. Some participants may have felt a
sense of eeriness in seeing a more human-like face realized
on a device. Age ratings were also linked to patterns of vocal
alignment: participants aligned more to the Echo if they rated
the voice as being from an older speaker, but displayed less
alignment to the Nao if it was rated as being older. This may
also be related to the uncanny valley effect, where cue
incongruency drives a sense of uneasiness: the Nao has an
infant-like form which contrasts with the voice ages (adult
TTS parameters) (~30s for the female TTS voice, ~40-50s for
male TTS voice). These observations lead us to refine our AI
personification hypothesis: people’s application of human-
based behavior norms during speech interaction with voice-
AI will increase as a function of the personification of the
device, until the AI anthropomorphism reaches realism levels
that trigger feelings of discomfort. The finding of uncanny
valley realized in patterns of vocal alignment is novel and
opens up new ways of exploring and investigating behavioral
responses to embodied AI.
There are several limitations of the current study that can
serve as avenues for future work. For one, differences
observed for the Furhat faces may have been driven by those
particular images displayed: future studies using additional
face textures and having participants rate the attractiveness of
the faces can tease apart the contribution of this visual social
information. Previous work has reported a link between
shadower’s attractiveness ratings of faces and their degree of
alignment toward that voice (Babel, 2012), further suggesting
this may have played a role.
Additionally, while an advantage of an AXB similarity
rating is that we make no a priori assumptions as to which
acoustic-phonetic features may be imitated, our overall
number of shadowers was limited in order to allow for raters
to make similarity judgments on the full set of stimuli (all
shadowed tokens, across the three systems; 480 trials, taking
roughly 45 minutes). While some groups have split AXB
ratings into separate experiments by groups of speakers, the
results were less than clear (Pardo et al., 2017). One benefit
of the current approach is that the patterns are easily
identifiable and comparable for future work (cf. Cohn et al.
2019; Snyder et al., 2019).
Furthermore, one limitation and avenue for future study is
the number and variety of TTS voices. While using two
Amazon Polly voices allowed us to address a confound in
previous work (by using identical voices across the three
systems during the shadowing experiment), it may have
affected the ratings that the participants provided (age,
friendliness, etc.) if they recognized that the same voice was
used on each system. This was, in part, mitigated by never
presenting the same voice consecutively. Having the speakers
shadow a greater variety of TTS voices across different
systems could lessen this possible effect.
Finally, subtle differences in the speaker systems between
the Echo, Nao, and Furhat may also have contributed to
differences in perceived human-likeness. Future work using
computer-mediated methods, such as that presenting videos
of the three interlocutors, can control more aspects of the
interaction (e.g., intensity) and be compared to the present
study to assess the degree to which embodied versus
computer-mediated interactions may shape vocal alignment.
Overall, this study provides a first step in exploring the
nature of AI personification and its relationship with vocal
alignment, and sets the groundwork for future research.
Al Moubayed, S., Beskow, J., Skantze, G., & Granström, B.
(2012). Furhat: A back-projected human-like robot head
for multiparty human-machine interaction. In Cognitive
behavioural systems (pp. 114–130). Springer.
Babel, M. (2012). Evidence for phonetic and social
selectivity in spontaneous phonetic imitation. Journal of
Phonetics, 40(1), 177–189.
Bell, A. (1984). Language style as audience design.
Language in Society, 13(2), 145–204.
Bell, L., Gustafson, J., & Heldner, M. (2003). Prosodic
adaptation in human-computer interaction. Proceedings of
ICPHS, 3, 833–836.
Branigan, H. P., Pickering, M. J., Pearson, J., & McLean, J.
F. (2010). Linguistic alignment between people and
computers. Journal of Pragmatics, 42(9), 2355–2368.
Branigan, H. P., Pickering, M. J., Pearson, J., McLean, J. F.,
& Nass, C. (2003). Syntactic alignment between computers
and people: The role of belief about mental states.
Proceedings of the 25th Annual Conference of the
Cognitive Science Society, 186–191.
Brink, K. A., Gray, K., & Wellman, H. M. (2019). Creepiness
creeps in: Uncanny valley feelings are acquired in
childhood. Child Development, 90(4), 1202–1214.
Chartrand, T. L., & Bargh, J. A. (1996). Automatic activation
of impression formation and memorization goals:
Nonconscious goal priming reproduces effects of explicit
task instructions. Journal of Personality and Social
Psychology, 71(3), 464.
Clark, H. H., & Murphy, G. L. (1982). Audience Design in
Meaning and Reference. In J.-F. Le Ny & W. Kintsch
(Eds.), Advances in Psychology (Vol. 9, pp. 287–299).
Cohn, M., Ferenc Segedin, B., & Zellou, G. (2019). Imitating
Siri: Socially-mediated alignment to device and human
voices. Proceedings of International Congress of Phonetic
Sciences, 1813–1817.
Giles, H., Coupland, N., & Coupland, I. (1991). 1.
Accommodation theory: Communication, context, and.
Contexts of Accommodation: Developments in Applied
Sociolinguistics, 1.
Mori, M., MacDorman, K. F., & Kageki, N. (2012). The
uncanny valley [from the field]. IEEE Robotics &
Automation Magazine, 19(2), 98–100.
Namy, L. L., Nygaard, L. C., & Sauerteig, D. (2002). Gender
differences in vocal accommodation: The role of
perception. Journal of Language and Social Psychology,
21(4), 422–432.
Nass, C., Moon, Y., & Carney, P. (1999). Are people polite
to computers? Responses to computer-based interviewing
systems 1. Journal of Applied Social Psychology, 29(5),
Nass, C., Moon, Y., Morkes, J., Kim, E.-Y., & Fogg, B. J.
(1997). Computers are social actors: A review of current
research. Human Values and the Design of Computer
Technology, 72, 137–162.
Nass, C., Steuer, J., & Tauber, E. R. (1994). Computers are
social actors. Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems, 72–78.
Pardo, J. (2013). Measuring phonetic convergence in speech
production. Frontiers in Psychology, 4.
Pardo, J. S. (2006). On phonetic convergence during
conversational interaction. The Journal of the Acoustical
Society of America, 119(4), 2382–2393.
Pardo, J. S., Urmanche, A., Wilman, S., & Wiener, J. (2017).
Phonetic convergence across multiple measures and model
talkers. Attention, Perception, & Psychophysics, 79(2),
Raveh, E., Siegert, I., Steiner, I., Gessinger, I., & Möbius, B.
(2019). Three’sa Crowd? Effects of a Second Human on
Vocal Accommodation with a Voice Assistant. Proc.
Interspeech 2019, 4005–4009.
Shepard, C. A., Giles, H., & Le Poire, B. A. (2001).
Communication accommodation theory. In The new
handbook of language and social psychology (W. P.
Robinson, H. Gile, pp. 33–56). John Wiley & Sons, Ltd.
Snyder, C., Cohn, M., & Zellou, G. (2019). Individual
variation in cognitive processing style predicts differences
in phonetic imitation of device and human voices.
Proceedings of the Annual Conference of the International
Speech Communication Association, 116–120.
... A growing body of research has begun to investigate the social, cognitive, and linguistic effects of humans interacting with voice-AI (Purington et al., 2017;Arnold et al., 2019;Cohn et al., 2019b;Burbach et al., 2019). For example, recent work has shown that listeners attribute human-like characteristics to the text-tospeech (TTS) output used for modern voice-AI, including personality traits (Lopatovska, 2020), apparent age (Cohn et al., 2020a;Zellou et al., 2021), and gender (Habler et al., 2019;Loideain and Adams, 2020). While the spread of voice-AI assistants is undeniable-particularly in the United States-there are many open scientific questions as to the nature of people's interactions with voice-AI. ...
... For instance, Brave and colleagues (2005) found when computer systems expressed empathetic emotion, they were rated more positively. For voice-AI, there is a growing body of work testing how individuals perceive emotion in TTS voices (Cohn et al., 2019a;Cohn et al., 2020a). For example, an Amazon Alexa Prize socialbot was rated more positively if it used emotional interjections (Cohn et al., 2019a). ...
... At the same time, in the current study, the human-likeness ratings for the interlocutors collected at the end of study suggest that the participants found the interlocutors to be distinct. Future work manipulating rate of misunderstanding and embodiment (Staum Casasanto et al., 2010;Cohn et al., 2020a) can investigate what conditions lead to greater targeted intelligibility strategies for distinct interlocutor types. We also explored whether emotional expressiveness mediates speech styles for Alexa-and human-DS. ...
Full-text available
The current study tests whether individuals (n = 53) produce distinct speech adaptations during pre-scripted spoken interactions with a voice-AI assistant (Amazon’s Alexa) relative to those with a human interlocutor. Interactions crossed intelligibility pressures (staged word misrecognitions) and emotionality (hyper-expressive interjections) as conversation-internal factors that might influence participants’ intelligibility adjustments in Alexa- and human-directed speech (DS). Overall, we find speech style differences: Alexa-DS has a decreased speech rate, higher mean f0, and greater f0 variation than human-DS. In speech produced toward both interlocutors, adjustments in response to misrecognition were similar: participants produced more distinct vowel backing (enhancing the contrast between the target word and misrecognition) in target words and louder, slower, higher mean f0, and higher f0 variation at the sentence-level. No differences were observed in human- and Alexa-DS following displays of emotional expressiveness by the interlocutors. Expressiveness, furthermore, did not mediate intelligibility adjustments in response to a misrecognition. Taken together, these findings support proposals that speakers presume voice-AI has a “communicative barrier” (relative to human interlocutors), but that speakers adapt to conversational-internal factors of intelligibility similarly in human- and Alexa-DS. This work contributes to our understanding of human-computer interaction, as well as theories of speech style adaptation.
... Additionally, there is a growing body of work examining the role of embodiment in interactions with technology (Appel et al., 2012;Lee et al., 2006). Physical form, varying in terms of human-likeness, has been shown to more gradiently shape vocal alignment patterns (Cohn, Jonell, et al., 2020), suggesting that embodiment might also play a role in speech adaptation strategies as well. While prior work has compared human-and computer avatar-DS (Burnham et al., 2010) or between human-and computer-DS without a visual component (e.g., , future work manipulating the presence of physical form (e.g., using virtual reality and/or physical embodiment differences, comparing to voice-only conditions) can shed further light on the source of register differences. ...
Full-text available
Millions of people engage in spoken interactions with voice activated artificially intelligent (voice-AI) systems in their everyday lives. This study explores whether speakers have a voice-AI-specific register, relative to their speech toward an adult human. Furthermore, this study tests if speakers have targeted error correction strategies for voice-AI and human interlocutors. In a pseudo-interactive task with pre-recorded Siri and human voices, participants produced target words in sentences. In each turn, following an initial production and feedback from the interlocutor, participants repeated the sentence in one of three response types: after correct word identification, a coda error, or a vowel error made by the interlocutor. Across two studies, the rate of comprehension errors made by both interlocutors was varied (lower vs. higher error rate). Register differences are found: participants speak louder, with a lower mean f0, and with a smaller f0 range in Siri-DS. Many differences in Siri-DS emerged as dynamic adjustments over the course of the interaction. Additionally, error rate shapes how register differences are realized. One targeted error correction was observed: speakers produce more vowel hyperarticulation in coda repairs in Siri-DS. Taken together, these findings contribute to our understanding of speech register and the dynamic nature of talker-interlocutor interactions.
... While the aim was to provide clear guise information (such that it was unambiguous that the talker was a human or device), there is a body of work showing visual cues shape auditory perception (Babel & Russell, 2015;D'Onofrio, 2019;Hay et al., 2006;. Recently, there is also work showing differences in vocal alignment based on physical form: speakers show stronger vocal alignment toward TTS voices when they are presented with a more human-like form (e.g., Furhat or Nao robot) relative to a form that lacks human features (e.g., Amazon Echo) (Cohn, Jonell, et al., 2020). In the current study, the visual information for the human (a smiling female) might have provided stronger emotion-congruent information with the positive-valence stimuli (e.g., "Awesome!"). ...
Full-text available
This study tests whether individuals vocally align toward emotionally expressive prosody produced by two types of interlocutors: a human and a voice-activated artificially intelligent (voice-AI) assistant. Participants completed a word shadowing experiment of interjections (e.g., “Awesome”) produced in emotionally neutral and expressive prosodies by both a human voice and a voice generated by a voice-AI system (Amazon's Alexa). Results show increases in participants’ word duration, mean f0, and f0 variation in response to emotional expressiveness, consistent with increased alignment toward a general ‘positive-emotional’ speech style. Small differences in emotional alignment by talker category (human vs. voice-AI) parallel the acoustic differences in the model talkers’ productions, suggesting that participants are mirroring the acoustics they hear. The similar responses to emotion in both a human and voice-AI talker support accounts of unmediated emotional alignment, as well as computer personification: people apply emotionally-mediated behaviors to both types of interlocutors. There were small differences in magnitude by participant gender, the overall patterns were similar for women and men, supporting a nuanced picture of emotional vocal alignment.
... Modern voice-AI systems have highly human-like features, such as apparent gender [4] and personality traits [5]. People even assign apparent speaker age for TTS voices, e.g., Siri voices are rated as being approximately 40s or 50s [6] and Amazon Polly voices are rated as being in their 30s (female voice) or 50s (male voice) [7]. The present study investigates the extent to which voice age shapes how listeners perceive coarticulation in naturally produced and TTS voices. ...
Conference Paper
Full-text available
The current study explores whether perception of coarticulatory vowel nasalization differs by speaker age (adult vs. child) and type of voice (naturally produced vs. synthetic speech). Listeners completed a 4IAX discrimination task between pairs containing acoustically identical (both nasal or oral) vowels and acoustically distinct (one oral, one nasal) vowels. Vowels occurred in either the same consonant contexts or different contexts across pairs. Listeners completed the experiment with either naturally produced speech or text-to-speech (TTS). For same-context trials, listeners were better at discriminating between oral and nasal vowels for child speech in the synthetic voices but adult speech in the natural voices. Meanwhile, in different-context trials, listeners were less able to discriminate, indicating more perceptual compensation for synthetic voices. There was no difference in different-context discrimination across talker ages, indicating that listeners did not compensate differently if the speaker was a child or adult. Findings are relevant for models of compensation, computer personification theories, and speaker-indexical perception accounts.
... Using a perceptual assessment, they found less alignment toward the Siri voices, overall, than toward the human voices (similar patterns were reported using an acoustic assessment of alignment in Snyder et al. (2019)). The observation of less alignment toward device voices than toward human voices suggests that people are sensitive to the social distinction between devices and humans, in line with a Communication Accommodation Theory (Giles et al., 1991), as well as accounts proposing more gradient application of CASA (e.g. that individuals show greater phonetic alignment toward machines as their embodiment increases in Cohn et al. (2020)). In other words, people are more socially close to other humans than AI, therefore, they might align more toward humans than voice-AI. ...
Two studies investigated the influence of conversational role on phonetic imitation toward human and voice-AI interlocutors. In a Word List Task, the giver instructed the receiver on which of two lists to place a word; this dialogue task is similar to simple spoken interactions users have with voice-AI systems. In a Map Task, participants completed a fill-in-the-blank worksheet with the interlocutors, a more complex interactive task. Participants completed the task twice with both interlocutors, once as giver-of-information and once as receiver-of-information. Phonetic alignment was assessed through similarity rating, analysed using mixed effects logistic regressions. In the Word List Task, participants aligned to a greater extent toward the human interlocutor only. In the Map Task, participants as giver only aligned more toward the human interlocutor. Results indicate that phonetic alignment is mediated by the type of interlocutor and that the influence of conversational role varies across tasks and interlocutors. ARTICLE HISTORY
... Experiment two is an AXB similarity rating task (Pardo et al., 2013) where a separate group of listeners rate whether participants' pre-exposure and shadowed productions from Experiment one are more acoustically similar to the model talker's production of the shadowed item, providing a holistic assessment of speech alignment. (Brysbaert and New, 2009)) used in related imitation experiments with human and device interlocutors Snyder et al., 2019;Cohn et al., 2020). Stimuli consisted of recordings of the words produced by four distinct voices: two human and two AI device voices. ...
Full-text available
Speech alignment is where talkers subconsciously adopt the speech and language patterns of their interlocutor. Nowadays, people of all ages are speaking with voice-activated, artificially-intelligent (voice-AI) digital assistants through phones or smart speakers. This study examines participants’ age (older adults, 53–81 years old vs. younger adults, 18–39 years old) and gender (female and male) on degree of speech alignment during shadowing of (female and male) human and voice-AI (Apple’s Siri) productions. Degree of alignment was assessed holistically via a perceptual ratings AXB task by a separate group of listeners. Results reveal that older and younger adults display distinct patterns of alignment based on humanness and gender of the human model talkers: older adults displayed greater alignment toward the female human and device voices, while younger adults aligned to a greater extent toward the male human voice. Additionally, there were other gender-mediated differences observed, all of which interacted with model talker category (voice-AI vs. human) or shadower age category (OA vs. YA). Taken together, these results suggest a complex interplay of social dynamics in alignment, which can inform models of speech production both in human-human and human-device interaction.
For conversational agents’ speech, either all possible sentences have to be prerecorded by voice actors or the required utterances can be synthesized. While synthesizing speech is more flexible and economic in production, it also potentially reduces the perceived naturalness of the agents among others due to mistakes at various linguistic levels. In our article, we are interested in the impact of adequate and inadequate prosody, here particularly in terms of accent placement, on the perceived naturalness and aliveness of the agents. We compare (1) inadequate prosody, as generated by off-the-shelf text-to-speech (TTS) engines with synthetic output; (2) the same inadequate prosody imitated by trained human speakers; and (3) adequate prosody produced by those speakers. The speech was presented either as audio-only or by embodied, anthropomorphic agents, to investigate the potential masking effect by a simultaneous visual representation of those virtual agents. To this end, we conducted an online study with 40 participants listening to four different dialogues each presented in the three Speech levels and the two Embodiment levels. Results confirmed that adequate prosody in human speech is perceived as more natural (and the agents are perceived as more alive) than inadequate prosody in both human (2) and synthetic speech (1). Thus, it is not sufficient to just use a human voice for an agents’ speech to be perceived as natural—it is decisive whether the prosodic realisation is adequate or not. Furthermore, and surprisingly, we found no masking effect by speaker embodiment, since neither a human voice with inadequate prosody nor a synthetic voice was judged as more natural, when a virtual agent was visible compared to the audio-only condition. On the contrary, the human voice was even judged as less “alive” when accompanied by a virtual agent. In sum, our results emphasize, on the one hand, the importance of adequate prosody for perceived naturalness, especially in terms of accents being placed on important words in the phrase, while showing, on the other hand, that the embodiment of virtual agents plays a minor role in the naturalness ratings of voices.
This study investigates the perception of coarticulatory vowel nasality generated using different text-to-speech (TTS) methods in American English. Experiment 1 compared concatenative and neural TTS using a 4IAX task, where listeners discriminated between a word pair containing either both oral or nasalized vowels and a word pair containing one oral and one nasalized vowel. Vowels occurred either in identical or alternating consonant contexts across pairs to reveal perceptual sensitivity and compensatory behavior, respectively. For identical contexts, listeners were better at discriminating between oral and nasalized vowels in neural than in concatenative TTS for nasalized same-vowel trials, but better discrimination for concatenative TTS was observed for oral same-vowel trials. Meanwhile, listeners displayed less compensation for coarticulation in neural than in concatenative TTS. To determine whether apparent roboticity of the TTS voice shapes vowel discrimination and compensation patterns, a "roboticized" version of neural TTS was generated (monotonized f0 and addition of an echo), holding phonetic nasality constant; a ratings study (experiment 2) confirmed that the manipulation resulted in different apparent robot-icity. Experiment 3 compared the discrimination of unmodified neural TTS and roboticized neural TTS: listeners displayed lower accuracy in identical contexts for roboticized relative to unmodified neural TTS, yet the performances in alternating contexts were similar.
Conference Paper
Full-text available
The current study explores the extent to which humans vocally align to digital device voices (i.e., Apple's Siri) and human voices. First, participants shadowed word productions by 4 model talkers: a female and a male digital device voice, and a female and a male real human voice. Second, an independent group of raters completed an AXB task assessing perceptual similarity between imitators' pre-and post-exposure items to model talkers' productions. Results show that people do imitate device voices, but to a lesser degree than they imitate real human voices. Furthermore, similar social factors mediated vocal imitation toward both device and human voices: people imitated male device and human voices to a greater extent than female device and human voices.
Conference Paper
Full-text available
This study examines how the presence of other speakers affects the interaction with a spoken dialogue system. We analyze participants’ speech regarding several phonetic features, viz., fundamental frequency, intensity, and articulation rate, in two conditions: with and without additional speech input from a human confederate as a third interlocutor. The comparison was made via tasks performed by participants using a commercial voice assistant under both conditions in alternation. We compare the distributions of the features across the two conditions to investigate whether speakers behave differently when a confederate is involved. Temporal analysis exposes continuous changes in the feature productions. In particular, we measured overall accommodation between the participants and the system throughout the interactions. Results show significant differences in a majority of cases for two of the three features, which are more pronounced in cases where the user first interacted with the device alone. We also analyze factors such as the task performed, participant gender, and task order, providing additional insight into the participants’ behavior.
Full-text available
The Uncanny Valley posits that very human-like robots are unsettling, a phenomenon amply demonstrated in adults but unexplored in children. 240 3-to 18-year-olds viewed one of two robots (machine-like or very human-like) and rated their feelings toward (e.g., "Does the robot make you feel weird or happy?") and perceptions of the robot's capacities (e.g., "Does the robot think for itself?"). Like adults, children older than 9 judged the human-like robot as creepier than the machine-like robot-but younger children did not. Children's perceptions of robots' mental capacities predicted uncanny feelings: children judge robots to be creepy depending on whether they have human-like minds. The uncanny valley is therefore acquired over development and relates to changing conceptions about robot minds.
Full-text available
This study consolidates findings on phonetic convergence in a large-scale examination of the impacts of talker sex, word frequency, and model talkers on multiple measures of convergence. A survey of nearly three dozen published reports revealed that most shadowing studies used very few model talkers and did not assess whether phonetic convergence varied across same- and mixed-sex pairings. Furthermore, some studies have reported effects of talker sex or word frequency on phonetic convergence, but others have failed to replicate these effects or have reported opposing patterns. In the present study, a set of 92 talkers (47 female) shadowed either same-sex or opposite-sex models (12 talkers, six female). Phonetic convergence was assessed in a holistic AXB perceptual-similarity task and in acoustic measures of duration, F0, F1, F2, and the F1 × F2 vowel space. Across these measures, convergence was subtle, variable, and inconsistent. There were no reliable main effects of talker sex or word frequency on any measures. However, female shadowers were more susceptible to lexical properties than were males, and model talkers elicited varying degrees of phonetic convergence. Mixed-effects regression models confirmed the complex relationships between acoustic and holistic perceptual measures of phonetic convergence. In order to draw broad conclusions about phonetic convergence, studies should employ multiple models and shadowers (both male and female), balanced multisyllabic items, and holistic measures. As a potential mechanism for sound change, phonetic convergence reflects complexities in speech perception and production that warrant elaboration of the underspecified components of current accounts.
Full-text available
Phonetic convergence is defined as an increase in the similarity of acoustic-phonetic form between talkers. Previous research has demonstrated phonetic convergence both when a talker listens passively to speech and while talkers engage in social interaction. Much of this research has focused on a diverse array of acoustic-phonetic attributes, with fewer studies incorporating perceptual measures of phonetic convergence. The current paper reviews research on phonetic convergence in both non-interactive and conversational settings, and attempts to consolidate the diverse array of findings by proposing a paradigm that models perceptual and acoustic measures together. By modeling acoustic measures as predictors of perceived phonetic convergence, this paradigm has the potential to reconcile some of the diverse and inconsistent findings currently reported in the literature.
This study examined how perceptual sensitivity contributes to gender differences in vocal accommodation. Male and female shadowers repeated isolated words presented over headphones by male and female speakers, and male and female listeners evaluated whether accommodation occurred. Female shadowers accommodated more than males, and more to males than to female speakers, although some speakers elicited greater accommodation than others. Gender differences in accommodation emerged even when immediate social motives were minimized, suggesting that accommodation may be due, in part, to differences in perceptual sensitivity to vocal characteristics.