Conference PaperPDF Available

Social and Functional Pressures in Vocal Alignment: Differences for Human and Voice-AI Interlocutors

Authors:

Abstract and Figures

Increasingly, people are having conversational interactions with voice-AI systems, such as Amazon's Alexa. Do the same social and functional pressures that mediate alignment toward human interlocutors also predict align patterns toward voice-AI? We designed an interactive dialogue task to investigate this question. Each trial consisted of scripted, interactive turns between a participant and a model talker (pre-recorded from either a natural production or voice-AI): First, participants produced target words in a carrier phrase. Then, a model talker responded with an utterance containing the target word. The interlocutor responses varied by 1) communicative affect (social) and 2) correctness (functional). Finally, participants repeated the carrier phrase. Degree of phonetic alignment was assessed acoustically between the target word in the model's response and participants' response. Results indicate that social and functional factors distinctly mediate alignment toward AI and humans. Findings are discussed with reference to theories of alignment and human-computer interaction.
Content may be subject to copyright.
Social and functional pressures in vocal alignment:
Differences for human and voice-AI interlocutors
Georgia Zellou and Michelle Cohn
UC Davis, Phonetics Lab, Department of Linguistics, Davis, CA, USA
Abstract
Increasingly, people are having conversational interactions
with voice-AI systems, such as Amazon’s Alexa. Do the same
social and functional pressures that mediate alignment toward
human interlocutors also predict align patterns toward voice-
AI? We designed an interactive dialogue task to investigate
this question. Each trial consisted of scripted, interactive turns
between a participant and a model talker (pre-recorded from
either a natural production or voice-AI): First, participants
produced target words in a carrier phrase. Then, a model talker
responded with an utterance containing the target word. The
interlocutor responses varied by 1) communicative affect
(social) and 2) correctness (functional). Finally, participants
repeated the carrier phrase. Degree of phonetic alignment was
assessed acoustically between the target word in the model’s
response and participants’ response. Results indicate that
social and functional factors distinctly mediate alignment
toward AI and humans. Findings are discussed wit h reference
to theories of alignment and human-computer interaction.
Index Terms: vocal alignment, human-computer interaction,
social vs. functional pressures
1. Introduction
Interacting talkers systematically align toward the acoustic-
phonetic patterns of each other’s speech to sound more alike.
This phenomenon is known as vocal alignment (also phonetic
imitation) and has been well reported in the literature [1]–[3].
There is also evidence of vocal alignment toward speech
generated by non-human entities: people align toward the
speaking rate [4] and amplitude [5] of synthetic computer
voices, and even toward speech patterns of modern voice-
activated artificially intelligent (voice-AI) systems (e.g.,
Apple’s Siri, Amazon’s Alexa [6]–[10]). That we see an
application of this human behavior toward technology is in
line with the ‘Computers as social actors’ theory (CASA, [11],
[12]), which posits that people’s behavior during interactions
with technology mirrors their behavior toward humans.
Above and beyond documenting the presence of alignment
toward computers and voice-AI, a growing body of work has
suggested that the magnitude of speech alignment may differ
by interlocutor: individuals tend to show weaker vocal
alignment toward voice-AI, relative to human interlocutors
[6], [8], [9]. Conversely, others have found that people adopt
the lexical choices [13] and syntactic structures [14] for the
computers to a greater extent than for human interlocutors.
Why do we see these conflicting results? Aside from this
alignment occurring at different linguistic levels (e.g.,
phonetic vs. syntactic), one possibility is that modern voice-AI
has different functional and social pressures in communication
than computer systems or avatars. For one, the primary way
we communicate with voice-AI systems is using speech, a
uniquely human form of communication. Also, unlike
computer systems in the past, voice-AI systems have
improved text-to-speech (TTS) synthesis [15] and automatic
speech recognition (ASR) abilities; therefore, the functional
pressures may be lessened. This might explain mixed findings
of less alignment toward voice-AI, but greater alignment
toward computer systems. On the other hand, modern voice-
AI systems exhibit greater social cues, such as having names
(e.g., “Alexa”) and apparent genders. There is even evidence
that humans engage with voice-AI in for purely social
purposes, for example having short, non-utilitarian
conversations with chatbots” [16]. Therefore, another
possibility is that humans will respond to apparent sociality in
voice-AI more similarly as they would toward another human.
An alternative possibility, following the ‘Uncanny Valley of
the Mind’ framework [17], [18] is that as cues of human-
likeness near ‘real’ human levels, it can trigger feelings of
uneasiness or ‘uncanniness’ toward robots.
The current study investigates the nature of human-AI and
human-human interaction by examining whether degree of
vocal alignment is different based on social and functional
pressures within a verbal exchange. Moment-by-moment, the
dynamics of an interaction vary: speakers may become more
animated and emotionally expressive (social) or make errors
that require correction (functional). Parametrically
manipulating these factors across interlocutors (human vs. AI)
can pinpoint differences, and similarities, in the way humans
engage with voice-AI, relative to humans, and can shed light
more broadly on the nature of human-AI interaction.
Furthermore, this study can contribute to understanding the
mechanisms of vocal alignment, as driven by social dynamics
(§1.1.) and/or pressures to improve intelligibility (§1.2.).
1.1. Social factors
Synchrony between interlocutors is thought to serve as ‘social
glue’. ‘Communication Accommodation Theory’ (CAT [19])
proposes that alignment is used to foster social closeness: the
gender [1], regional affiliation [20], attractiveness [1], and
emotional state [7] of interlocutors predicts alignment. In the
present study, one social dimension is whether the interlocutor
is human or AI; here, as in prior work, we might predict less
alignment toward voice-AI, relative to humans [6], [8].
Further, the current study manipulates within speaker
social dynamics by including an expressive interjection in
some model talker utterances. Interjections, or emotively used
words such as “Yipee!” or “Darn!”, are conventionalized
phrases. Interjections often provide no additional linguistic
meaning to an utterance; their function is purely social in that
they convey the speaker’s emotional or cognitive state [21],
[22]. There is also some evidence that individuals are sensitive
to these interjections, even when they are produced by voice-
{gzellou, mdcohn}@ucdavis.edu
Copyright © 2020 ISCA
INTERSPEECH 2020
October 25–29, 2020, Shanghai, China
http://dx.doi.org/10.21437/Interspeech.2020-13351634
AI: [7] found that participants aligned more to Amazon’s
Alexa voice when they shadowed interjections realized in
hyper-expressive prosody, relative to Alexa’s neutral prosody.
Therefore, in the current study, we utilize these expressive
interjections, not as target words, but as additions to responses
made by the model talker as a way to increase the socio-
communicative force of the utterance. If increased
expressiveness modulates vocal alignment, we predict that
participants will align toward the interlocutor to a greater
extent when the interlocutor’s response contains an expressive
interjection (e.g., “Super! I think I heard boot”), relative to
when they do not (e.g., “I think I heard boot”). Yet,
interjections add no semantic content to the utterance. If the
addition of these expressive-encoding items does not add to
the sociality of the interaction, we predict no difference in
alignment patterns when they are present or not.
With respect to social factors, we might also predict
different alignment patterns toward the human and voice-AI
model talker as a function of the socio-communicative
expressiveness of the interaction. For one, modern-day voice-
AI can play a social role for humans. As previously
mentioned, voice-AI systems are increasingly assuming more
human-like qualities (e.g., more realistic voices, better speech
recognition, etc.). Thus, one prediction is that alignment
toward AI will increase with interactions containing
interjections, paralleling what we expect for alignment toward
human interlocutors. This would support computer
personification frameworks, e.g., CASA [11]. Alternatively,
displays of socio-expressiveness might be negatively
perceived by participants. This would support observations of
an ‘uncanny valley in people’s behavioral responses toward
technology which take on near-human qualities [17]. For
example, [18] observed that conflicting cues to human-
likeness in a non-human entity violates people’s expectations
of technology and leads to feelings of uneasiness or
discomfort toward voice-AI. Therefore, we might predict that
when the AI model talker displays highly expressive socio-
communicative responses, participants will diverge, doing the
opposite of how we expect them to behave toward the human
interlocutor in more expressive interactions (converge).
1.2. Functional factors
Other accounts propose that alignment serves a functional,
intelligibility-driven role: to help a speaker be better
understood by their listener by matching their linguistic
representations for better mutual intelligibility [23], [24]. For
example, participants who actively imitate novel speech
patterns later display improved recognition of that speech
(unfamiliar accent [25]; disordered speech [26]). Furthermore,
[27] assessed alignment between dyads completing a map
task. They found that when participants were giving
information during the task, the ‘giver’ displayed greater vocal
alignment than the ‘receiver’. These results lead us to predict
that speakers will actively align toward the speech of another
talker when there is strong pressure to be more intelligible.
Functional pressures are also very much present in
interactions with technology: people adopt the lexical choices
[13] and syntactic structures [14] of a computer system during
interactions to a greater extent than toward a human under
similar conditions. These findings suggest that linguistic
behavior toward AI is distinct from that toward human
interlocutors, in some cases triggering less alignment toward
voice-AI (speech alignment) and in some cases more
alignment toward computers (lexico-syntactic alignment). One
possibility is that the social and functional pressures in the
different paradigms from prior studies led to greater alignment
toward the human or the technological interlocutor. A more
formal investigation into which of these factors leads to more
or less alignment toward voice-AI, relative to humans, can
inform our understanding of the mechanisms of vocal
alignment and the dynamics of human-computer interactions.
For one, we expect people to align in different ways
toward human and voice-AI model talkers as a function of
differing intelligibility pressures in the interaction. Voice-AI
systems are often used for practical and utilitarian tasks, with
the user usually in the ‘giver’ role when producing an
utterance (e.g., “set a timer”, “play a song”, “tell me the
weather”). Thus, we might expect that vocal alignment
patterns toward AI will be more intelligibility-motivated,
relative to those toward a human, reflecting the utilitarian
purpose of voice-AI as task-oriented interlocutors. This would
explain studies comparing alignment toward computers and
humans which find greater alignment toward the computer
[14] where the task was goal-oriented: the bias toward greater
alignment toward technology could be explained by stronger
functional pressures.
Thus, the current study examines whether a correct or
incorrect response from the interlocutor influences patterns of
phonetic imitation. If feedback about intelligibility influences
alignment, we predict that a trial containing an uncertain
response from the interlocutor, that they have not with
certainty understood the target word, will elicit more robust
phonetic imitation by participants than trials where the
interlocutor correctly responds. For our predictors, we expect
to see an effect of Correctness on degree of alignment.
1.3. Present study
The present study examines identical communicative
interactions in a laboratory setting, varying social and
functional factors, across human-human and human-AI
interactions. To our knowledge, no prior work investigates the
relative weight of these social and functional factors within an
interaction with a single interlocutor; most vary the
interlocutor as a way of assessing social variables (e.g.,
gender, attractiveness, [1]). Doing so allows us to probe both
social and functional pressures that may differentially, or
similarly, affect human-human and human-AI interaction.
2. Methods
2.1. Participants, Stimuli, and Procedure
Participants (n=54; 27 F) were native English speakers (mean
age=20.2 years old, sd=2.4) recruited from the UC Davis
subject pool. None reported having any hearing impairments.
Target words were 16 low frequency CV C items selected
from [1]: bat, boot, cheek, coat, dune, hoop, moat, pod, tap,
toot, tot, weave, cot, soap, deed, sock. Both model talkers
(human and AI) produced the target words in a ‘correct’ frame
(“I think I heard sock”) and one in an ‘incorrect’ frame (“I
think I heard sock or sack”), where the distractor was a
minimal pair differing in vowel backness (order of the target
word and minimal pair were counterbalanced across trials and
differed across interlocutors). For the human model talker, a
female native English speaker recorded all utterances in a
sound attenuated booth with a head-mounted microphone. For
the voice-AI model talker, we generated recordings by default
1635
female Alexa voice (US-En) using the Alexa Skills Kit. Both
talkers produced all sentences in their correct and incorrect
forms in a neutral speaking style, as well as all introductions,
voice-overs, and final responses.
Both model talkers also produced 16 interjections (e.g.,
“Yipee!”) in a hyper-expressive manner, balanced by valence
(8 positive, 8 negative). These were produced naturally by the
human talker; for the Alexa voice, emotionally expressive
interjections recorded by the Alexa voice actor, or
‘Speechcons’, were added to the TTS output using speech
synthesis markup language (SSML) tags (a limitation of TTS
is that emotion is otherwise difficult to synthesize).
Participants sat in a sound-attenuated booth, wearing a
head-mounted microphone and headphones, facing a computer
screen. Subjects first were presented a screen introducing them
to the two model talkers: an Amazon Echo device named
“Alexa” and a female human named “Melissa”, with images
of them. Then, participants produced baseline productions of
target words, reading a carrier sentence containing the target
word, “The word is __”. Subjects completed two baseline
blocks, where order of sentences were randomly presented.
Next, participants completed either the AI or Human
shadowing blocks (order counterbalanced between subjects).
Instructions were given at the beginning of each shadowing
block. The interlocutors (AI, Human) were introduced (“Hi!
I’m Melissa. I’m a research assistant in the Phonetics Lab.” /
“Hi! I’m Alexa. I’m a digital device through Amazon.”) and
went through a voice-over example. During each trial, the
target words occurred in pre-scripted dialogues between the
participant and the model talker. Each trial contained multiple
turns: 1. Initial turn: participants saw a phrase containing the
target word printed on the screen and read it aloud (e.g., “The
word is weave.”). 2. Interlocutor response turn: The
interlocutor provided a response, telling them what she
‘heard’. 3. Participant shadowing response turn: Participants
repeated the sentence a second time. 4. Interlocutor
concluding turn: The interlocutor provided final feedback
(e.g., “Great!”, “Thanks, got it!”, “Perfect!”, etc.; randomly
presented).
Subjects completed two interlocutor blocks, where trials
were manipulated to vary in the social and functional
properties of Interlocutor response turns. To manipulate
intelligibility pressures, there were two correctness
conditions: In 50% of trials, the interlocutor responded with
the correct target word (e.g., “I think I heard weave.”) while in
50% of trials the interlocutor did not correctly identify the
target word with certainty (e.g., “I’m not sure I understood. I
think I heard weave or wove.”). (Note that while incorrect
responses differed from correct responses in information
structure, the former eliciting ‘corrective focus’, our critical
prediction is that there are differences in alignment across
interlocutor types for a given response type.)
To manipulate socio-expressiveness, there were two
expressiveness conditions: In 50% of trials, the interlocutor
responses took the form described above; in the other 50%, the
interlocutor response contained an emotionally expressive
interjection that corresponded to the intended sentiment of the
response: either a positive interjection (e.g., “Yipee! I think I
heard sock.”) for correct turns, or a negative interjection (e.g.,
“Darn! I’m not sure I understood. I think I heard sock or
sack.”) for incorrect turns. Within each block, correctness and
expressiveness trials were intermixed and randomly presented.
There were 128 trials (16 items, 2 correctness conditions,
2 expressiveness conditions, 2 interlocutors).
Figure 1 presents a sample trial and the four response
conditions for turn 2, varying by expressiveness and
correctness.
Figure 1: Sample dialogue of a trial.
2.2. Acoustic Analysis
Interlocutor and participant recordings were force-aligned
with FAVE [28] and segment boundaries were hand-corrected
by phonetically-trained research assistants. Vowel duration
was measured for each target word vowel from the model
talker response turn (turn 2) and the participant’s response
(turn 3). We assessed degree of alignment with Difference in
Distance (DID) = |baseline-model|-|shadowed-model| [1]. This
relative difference score reflects overall alignment, taking into
account baseline similarity between participants and model
talkers. Positive DID values indicate alignment toward, while
negative values indicate divergence from, the model talker,
relative to participants’ baseline productions.
2.3. Statistical Analysis
We modeled DID duration values in a linear mixed effects
model. The model was run in R using the lmer() function in
the lme4 package [29]. Fixed effects included Model Talker (2
levels: AI, human), Correctness (2 levels: correct, incorrect),
Expressiveness (2 levels: regular, expressive). The model
included all two-way interactions, as well as the three-way
interaction, between predictors. Predictors were sum-coded.
By-participant and by-item random intercepts and by-
participant random slopes for each main effect, each two- and
three-way interactions were also included.
3. Results
There was a significant main effect of Correctness [β=-1.52,
t=-2.64, p<0.05]: overall, participants displayed greater
alignment when the model talker produced an uncertainty
response than for a correct response. Correct responses made
by the Human, and all responses made by the voice-AI,
triggered less alignment to the model talkers’ productions.
There was no effect of Expressiveness (p=.14).
Furthermore, the model computed a significant three-way
interaction between Model talker, Expressiveness, and
Correctness [β=-2.1, t=-4.3, p<0.001]. Figure 2 displays this
interaction. For one, speakers align more toward voice-AI in
Incorrect trials that include the expressive interjection (e.g.,
“Darn! I’m not sure I understood. I heard beat or boot.”).
Meanwhile, they also display less alignment toward the human
1636
in Correct trials without interjections (e.g., “I’m not sure I
understood...”). No other interactions were observed.
Figure 2: Vowel duration DID scores (means and standard
errors of the mean) to AI and Human model talkers’ target
words, by Correctness and Expressiveness.
4. Discussion
The current study investigates how social and functional
factors mediate vocal alignment toward a human and voice-AI
interlocutor in an interactive shadowing task. Overall, we
observe that functional pressures do play a role in predicting
degree of alignment during this task: participants displayed
greater alignment when the model talker responded with
uncertainty about the correct target word than when the model
talker responded unequivocally with the correct target word.
This observation supports theories that functional pressures in
an interaction influence speech alignment [23], [24].
Table 1 summarizes the interaction between social and
functional factors and interlocutor type on alignment patterns.
Table 1: Summary of alignment patterns seen in this study.
Incorrect
(+ functional pressure)
Correct
(- functional pressure)
Regular
(less social)
Converge to human
Diverge from AI
No alignment
(divergence)
Expressive
(more social)
Converge to AI
No alignment to human
Converge to human
Diverge from AI
First, we observe converge toward the human interlocutor
under two conditions: when there is pressure to be intelligible
but with a non-expressive response and with a sociall y
expressive response with no pressure to be intelligible. These
are also the conditions where we observe the greatest
divergence from the voice-AI interlocutor. That we see these
simple factors leading to convergence toward the human, but
divergence from voice-AI, is contrary to prior reports of
greater alignment toward computers than toward human
interactors [13], [14]. Yet, it does align with recent work
reporting less alignment toward voice-AI systems, relative to
naturally produced human voices [6], [8], [9]. The observation
of the same factors leading to different patterns of alignment
toward humans and voice-AI does not support predictions
made by theories of computer personification, e.g. CASA
[11], that people automatically behave toward technology as a
person. Rather, we observe distinct patterns of speech
alignment that people apply toward voice-AI systems.
Additionally, and further in support of distinct vocal
behavior toward human and voice-AI interlocutors, we
observe that participants do align toward voice-AI only with
additive social and functional pressures. Thus, one
interpretation for this is that in order for functional pressures
to trigger alignment toward voice-AI, the system needs to
display even more social signals than for a human interlocutor.
This is not what we would expect from an ‘Uncanny Valley
hypothesis for human-computer interaction [18]: that people
feel discomfort when technology displays near-humanlike
behavior. We expected highly emotive responses from the AI
system, containing expressive interjections, to be perceived as
off-putting, or uncanny, by participants. Yet, these interactions
received the most alignment for the voice-AI. This also aligns
with recent work showing greater alignment toward these
interjections during word shadowing [7] and improved user
ratings during conversational interactions with a chatbot that
produced these interjections compared to when it did not [30].
Thus, contra an ‘Uncanny Valley’ hypothesis, people respond
positively to socially-expressive utterances from voice-AI.
[23] argue that alignment during interactive dialogue is
automatic and facilitates comprehension by converging
interlocutors’ linguistic representations. Our findings do not
support such a strong stance for how functional pressures
modulate alignment. Both the type of interlocutor and the
socio-communicative force of the interaction mediate how
functional pressures influence alignment.
Future work could also examine how other linguistic
factors mediate alignment toward voice-AI. For example, the
current study used low frequency words, which are shown to
be more susceptible to exposure. Comparing imitation of high
versus low frequency items across interlocutor types could
reveal differences in representational factors at play during
these interactions, cf. [2]. Also, in the current study we
presented comprehension errors with minimal pairs differing
in vowel backness. Future work examining how varying types
of phonological confusions trigger different types of alignment
patterns across interlocutor types could also tease apart what
linguistic-communicative pressure influence speech
alignment. Other phonetic variables (e.g., formant frequencies,
pitch, and intensity) could also be explored in future work.
Investigations of vocal alignment toward voice-AI can
present novel tests and theoretical extensions to human-AI
interaction frameworks. As speech becomes a more dominant
mode of interfacing with technology, understanding how
voice-AI systems influence human language patterns will be
more important. Humans often interact with voice-AI in the
more functional, ‘giver’ role. Our findings suggest this could
lead to more alignment toward voice-AI, if it is expressive.
The findings from the current study also have implications for
voice user interface design. For one, based on a greater degree
of alignment, including expressive interjections in interactions
with AI appears to improve user responsiveness and
interactive behavior toward the voice-AI interlocutor.
Furthermore, this study suggests that the communicative
success of the interaction (e.g., presence of ASR errors) might
dynamically interact with apparent sociality of the system, and
shape the extent users apply human-human speech rules to
their interactions with voice-AI.
5. Acknowledgements
This material is based upon work supported by the National
Science Foundation SBE Postdoctoral Research Fellowship
under Grant No. 1911855 to MC.
1637
6. References
[1] M. Babel, “Evidence for phonetic and social selectivity in
spontaneous phonetic imitation,” J. Phon., vol. 40, no. 1, pp.
177189, Jan. 2012, doi: 10.1016/j.wocn.2011.09.001.
[2] S. D. Goldinger, “Words and voices: episodic traces in spoken
word identification and recognition memory,” J. Exp. Psychol.
Learn. M em. Cogn., vol. 22, no. 5, pp. 11661183, Sep. 1996,
doi: 10.1037//0278-7393.22.5.1166.
[3] J. S. Pardo, On phonetic convergence during conversational
interaction,” J. Acoust. Soc. Am., vol. 119, no. 4, pp. 2382
2393, Apr. 2006, doi: 10.1121/1.2178720.
[4] L. Bell, “Linguistic Adaptations in Spoken Human-Computer
Dialogues - Empirical Studies of User Behavior,” 2003,
Accessed: Apr. 15, 2020. [Online]. Available:
http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-3607.
[5] N. Suzuki and Y. Katagiri, “Prosodic alignment in human
computer interaction,” Connect. Sci., vol. 19, no. 2, pp. 131
141, Jun. 2007, doi: 10.1080/09540090701369125.
[6] E. Raveh, I. Siegert, I. Steiner, I. Gessinger, and B. Möbius,
“Three’sa Crowd? Effects of a Second Human on Vocal
Accommodation with a Voice Assistant,” Proc Interspeech
2019, pp. 40054009, 2019.
[7] M. Cohn and G. Zellou, “Expressiveness Influences Human
Vocal Alignment Toward voice-AI,” in Interspeech 2019,
Sep. 2019, pp . 4145, doi: 10.21437/Interspeech.2019-1368.
[8] M. Cohn, B. Ferenc Segedin, and G. Zellou, “Imitating Siri:
Socially-mediated vocal alignme nt to de vice a nd human
voices,” Proc. 19th Int. Congr. Phon. Sci., pp. 18131817,
2019.
[9] C. Snyder, M. Cohn, and G. Zellou, “Individual Variation in
Cognitive Processing Style Predicts Differences in Phonetic
Imitation of Device and Human Voices,” in Interspeech 2019,
Sep. 2019, pp. 116120, doi: 10.21437/Interspeech.2019-
2669.
[10] K. Metcalf et al., “Mirroring to build trust in digital
assistants,” ArXiv Prepr. ArXiv190401664, 2019.
[11] C. Nass, J . Steuer, and E. R. Tauber, “Computers are social
actors,” in Proceedings of the SIGCHI conference on Human
factors in computing systems, 1994, pp. 7278.
[12] C. I. Nass, Y. Moon, and J. Morkes, “Computers Are Social
Actors: A Review of Current,” Hum. Values Des. Comput.
Techno l., no. 72, p. 137, 1997.
[13] S. E. Brennan, “Lexical entrainment in spontaneous dialog,”
Proc. ISSD, vol. 96, p p. 4144, 1996.
[14] H. P. Branigan, M. J. Pickering, J. Pearson, and J. F. McLean,
“Linguistic alignment between people and computers,” J.
Pragmat., vol. 42, no. 9, pp. 23552368, 2010.
[15] A. van den Oord et al., “Wavenet: A generative model for raw
audio,” ArXiv Prepr. ArXiv160903499, 2016.
[16] A. Ram et al., “Conversational ai: The science behind the
alexa prize,ArXiv Prepr. ArXiv180103604, 2018.
[17] M. Mori, “The Unc anny Valley: The Original Essay by
Masahiro Mori - IEEE Spectrum,” IEEE Spectr., p. 6, 2017.
[18] R. K. Moore, “A Ba yesian explanation of the ‘Uncanny
Valley’effect and related psychological phenomena,” Sci.
Rep., vol. 2, p. 864, 2012.
[19] C. A. Shepard, “Communication accommodation theory,” New
Hand-Book Lang. Soc. Psychol., pp. 3356, 2001.
[20] R. Y. Bourhis and H. Giles, “The language of intergroup
distinctiveness,” Lang. Ethn. Intergroup Relat., vol. 13, p.
119, 1977.
[21] F. Ameka, “Interjections: The universal yet neglected part of
speech,” J. Pragmat., vol. 18, no. 23, pp. 101118, 1992.
[22] E. Goffman, Forms of talk. University of Pennsylvania Press,
1981.
[23] M. J. Pickering and S. Garrod, Toward a mechan istic
psychology of dialogue,” Behav. Brain Sci., vol. 27, no. 2, pp.
169190, 2004.
[24] S. Garrod and G. Doherty, “Conversation, co-ordination and
convention: An empirical investigation of how groups
establish linguistic conventions,” Cognition, vol. 53 , no. 3 , pp.
181215, 1994.
[25] P. Adank, P. Hagoort, and H. Bekkering, “Imitation improves
language comprehension,” Psychol. Sci., vol. 21, no. 12, pp.
19031909, 2010.
[26] S. A. Borrie and M. C. Schäfer, “The role of somatosensory
information in speech perception: Imitation improves
recognition of disordered speech,” J. Speech Lang. Hear. Res.,
vol. 58, no. 6, pp. 17081716, 2015.
[27] J. S. Pardo, I . C. Jay, and R. M. Krauss, “Conversational role
influences speech imitation,” Atten. Percept. Psychophys., vol.
72, no. 8, pp. 22542264, Nov. 2010, doi:
10.3758/bf03196699.
[28] I. Rosenfelder, J. Fruehwald, K. Evanini, and J. Yuan, “FAVE
(forced alignment and vowel extraction) program suite,” URL
Httpfave Ling Upenn Edu, 2011.
[29] D. Bates, M. Mächler, B. Bolker, and S. Walker, “Fitting
linear mixed-effects models using lme4,” ArXiv Prepr.
ArXiv14065823, 2014.
[30] M. Cohn, C.-Y. Chen, and Z. Yu, “A Large-Scale User Study
of an Alexa Prize Chatbot : Effect of TTS Dynamism on
Perceived Quality of Social Dialog,” in Proceedings of the
20th Annual SIGdial Meeting on Discourse and Dialogue,
2019, pp. 293306.
1638
... For example, Stent et al. (2008) found that speakers' increased hyperarticulation in response to an ASR error lingered for several trials before 'reverting' back to their pre-error speech patterns; in the present study, we similarly predict slower speech rate following an ASR error. While less examined than hyperarticulation, there is some evidence suggesting that entrainment also serves a functional role (Branigan et al., 2011;Cowan et al., 2015); for example, participants show more duration alignment if their interlocutor made an error (Zellou and Cohn, 2020). Thus, we might also predict greater entrainment following an error, relative to pre-error. ...
... In addition to examining situational context, we also tested the impact of functional pressures in communication-specifically whether speakers hyperarticulate and/or entrain more following a system ASR error. We did not find effects for either behavior, contra findings human-computer interaction for post-error hyperarticulation (e.g., Oviatt et al., 1998b;Vertanen, 2006) or post-error entrainment (Zellou and Cohn, 2020). One possible explanation for why we do not observe hyperarticulation following ASR errors is that speakers were already talking in a very slow, 'clear speech' manner when talking to the socialbot. ...
Article
Full-text available
This paper investigates users’ speech rate adjustments during conversations with an Amazon Alexa socialbot in response to situational (in-lab vs. at-home) and communicative (ASR comprehension errors) factors. We collected user interaction studies and measured speech rate at each turn in the conversation and in baseline productions (collected prior to the interaction). Overall, we find that users slow their speech rate when talking to the bot, relative to their pre-interaction productions, consistent with hyperarticulation. Speakers use an even slower speech rate in the in-lab setting (relative to at-home). We also see evidence for turn-level entrainment: the user follows the directionality of Alexa’s changes in rate in the immediately preceding turn. Yet, we do not see differences in hyperarticulation or entrainment in response to ASR errors, or on the basis of user ratings of the interaction. Overall, this work has implications for human-computer interaction and theories of linguistic adaptation and entrainment.
... Socially-mediated imitation patterns are often interpreted through the lens of Communication Accommodation Theory (CAT) (Giles et al., 1991;Shepard, 2001), which proposes that speakers use linguistic alignment to emphasize or minimize social differences between themselves and their interlocutors. The CAT framework can also be applied to understand humandevice interaction: recent studies that make a direct comparison between human and voice-AI interlocutors found greater vocal imitation for the human, relative to the voice-AI speaker (e.g., Apple's Siri in Cohn et al., 2019;Snyder et al., 2019;Amazon's Alexa in Raveh et al., 2019;Zellou and Cohn, 2020). Less speech alignment toward digital device assistants suggests that people may be less inclined to demonstrate social closeness toward voice-AI, as they do for humans. ...
Article
Full-text available
Speech alignment is where talkers subconsciously adopt the speech and language patterns of their interlocutor. Nowadays, people of all ages are speaking with voice-activated, artificially-intelligent (voice-AI) digital assistants through phones or smart speakers. This study examines participants’ age (older adults, 53–81 years old vs. younger adults, 18–39 years old) and gender (female and male) on degree of speech alignment during shadowing of (female and male) human and voice-AI (Apple’s Siri) productions. Degree of alignment was assessed holistically via a perceptual ratings AXB task by a separate group of listeners. Results reveal that older and younger adults display distinct patterns of alignment based on humanness and gender of the human model talkers: older adults displayed greater alignment toward the female human and device voices, while younger adults aligned to a greater extent toward the male human voice. Additionally, there were other gender-mediated differences observed, all of which interacted with model talker category (voice-AI vs. human) or shadower age category (OA vs. YA). Taken together, these results suggest a complex interplay of social dynamics in alignment, which can inform models of speech production both in human-human and human-device interaction.
Article
Full-text available
This study tests whether individuals vocally align toward emotionally expressive prosody produced by two types of interlocutors: a human and a voice-activated artificially intelligent (voice-AI) assistant. Participants completed a word shadowing experiment of interjections (e.g., “Awesome”) produced in emotionally neutral and expressive prosodies by both a human voice and a voice generated by a voice-AI system (Amazon's Alexa). Results show increases in participants’ word duration, mean f0, and f0 variation in response to emotional expressiveness, consistent with increased alignment toward a general ‘positive-emotional’ speech style. Small differences in emotional alignment by talker category (human vs. voice-AI) parallel the acoustic differences in the model talkers’ productions, suggesting that participants are mirroring the acoustics they hear. The similar responses to emotion in both a human and voice-AI talker support accounts of unmediated emotional alignment, as well as computer personification: people apply emotionally-mediated behaviors to both types of interlocutors. There were small differences in magnitude by participant gender, the overall patterns were similar for women and men, supporting a nuanced picture of emotional vocal alignment.
Article
Full-text available
The current study tests whether individuals (n = 53) produce distinct speech adaptations during pre-scripted spoken interactions with a voice-AI assistant (Amazon’s Alexa) relative to those with a human interlocutor. Interactions crossed intelligibility pressures (staged word misrecognitions) and emotionality (hyper-expressive interjections) as conversation-internal factors that might influence participants’ intelligibility adjustments in Alexa- and human-directed speech (DS). Overall, we find speech style differences: Alexa-DS has a decreased speech rate, higher mean f0, and greater f0 variation than human-DS. In speech produced toward both interlocutors, adjustments in response to misrecognition were similar: participants produced more distinct vowel backing (enhancing the contrast between the target word and misrecognition) in target words and louder, slower, higher mean f0, and higher f0 variation at the sentence-level. No differences were observed in human- and Alexa-DS following displays of emotional expressiveness by the interlocutors. Expressiveness, furthermore, did not mediate intelligibility adjustments in response to a misrecognition. Taken together, these findings support proposals that speakers presume voice-AI has a “communicative barrier” (relative to human interlocutors), but that speakers adapt to conversational-internal factors of intelligibility similarly in human- and Alexa-DS. This work contributes to our understanding of human-computer interaction, as well as theories of speech style adaptation.
Chapter
Although corresponding technological and didactical models have been known for decades, the digitization of teaching has hardly advanced beyond simple non-interactive formats (e.g. downloadable slides are provided within a learning management system). The COVID-19 crisis is changing this situation dramatically, creating a high demand for highly interactive formats and fostering exchange between conversation partners about the course content. Systems are required that are able to communicate with students verbally, to answer their questions, and to check the students’ knowledge. While technological advances have made such systems possible in principle, the game stopper is the large amount of manual work and knowledge that must be put into designing such a system and feeding it the right content. In this publication, we present a first system to overcome the aforementioned drawback by automatically generating a corresponding dialog system from slide-based presentations, such as PowerPoint, OpenOffice, or Keynote, which can be dynamically adapted to the respective students and their needs. Our first experiments confirm the proof of concept and reveal that such a system can be very handy for both respective groups, learners and lecturers, alike. The limitations of the developed system, however, also reminds us that many challenges need to be addressed to improve the feasibility and quality of such systems, in particular in the understanding of semantic knowledge.
Article
Two studies investigated the influence of conversational role on phonetic imitation toward human and voice-AI interlocutors. In a Word List Task, the giver instructed the receiver on which of two lists to place a word; this dialogue task is similar to simple spoken interactions users have with voice-AI systems. In a Map Task, participants completed a fill-in-the-blank worksheet with the interlocutors, a more complex interactive task. Participants completed the task twice with both interlocutors, once as giver-of-information and once as receiver-of-information. Phonetic alignment was assessed through similarity rating, analysed using mixed effects logistic regressions. In the Word List Task, participants aligned to a greater extent toward the human interlocutor only. In the Map Task, participants as giver only aligned more toward the human interlocutor. Results indicate that phonetic alignment is mediated by the type of interlocutor and that the influence of conversational role varies across tasks and interlocutors. ARTICLE HISTORY
Conference Paper
Full-text available
This study tests the effect of cognitive-emotional expression in an Alexa text-to-speech (TTS) voice on users' experience with a social dialog system. We systematically introduced emotionally expressive interjections (e.g., "Wow!") and filler words (e.g., "um", "mhmm") in an Amazon Alexa Prize socialbot, Gunrock. We tested whether these TTS manipulations improved users' ratings of their conversation across thousands of real user interactions (n=5,527). Results showed that interjections and fillers each improved users' holistic ratings, an improvement that further increased if the system used both manipulations. A separate perception experiment corroborated the findings from the user study, with improved social ratings for conversations including interjections; however, no positive effect was observed for fillers, suggesting that the role of the rater in the conversation-as active participant or external listener-is an important factor in assessing social dialogs.
Article
Full-text available
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
Article
Full-text available
Purpose: Perceptual learning paradigms involving written feedback appear to be a viable clinical tool to reduce the intelligibility burden of dysarthria. The underlying theoretical assumption, that pairing the degraded acoustics with the intended lexical targets facilitates a remapping of existing mental representations in the lexicon. This study investigated whether ties to mental representations can be strengthened by way of a somatosensory motor trace. Methods: Following an intelligibility pretest, one hundred participants were assigned to one of five experimental groups. The control group received no training, but the other four groups received training with dysarthric speech under conditions involving a unique combination of auditory targets, written feedback, and/or a vocal imitation task. All participants then completed an intelligibility posttest. Results: Training improved intelligibility of dysarthric speech, with the largest improvements observed when the auditory targets were accompanied by both written feedback and an imitation task. Further, a significant relationship between intelligibility improvement and imitation accuracy was identified. Conclusions: This study suggests that somatosensory information can strengthen the activation of speech sound maps of dysarthric speech. This implicates a bi-directional relationship between speech perception and speech production, and advances our understanding of the mechanisms that underlie perceptual learning of degraded speech.
Article
Full-text available
Maximum likelihood or restricted maximum likelihood (REML) estimates of the parameters in linear mixed-effects models can be determined using the lmer function in the lme4 package for R. As for most model-fitting functions in R, the model is described in an lmer call by a formula, in this case including both fixed- and random-effects terms. The formula and data together determine a numerical representation of the model from which the profiled deviance or the profiled REML criterion can be evaluated as a function of some of the model parameters. The appropriate criterion is optimized, using one of the constrained optimization functions in R, to provide the parameter estimates. We describe the structure of the model, the steps in evaluating the profiled deviance or REML criterion, and the structure of classes or types that represents such a model. Sufficient detail is included to allow specialization of these structures by users who wish to write functions to fit specialized linear mixed models, such as models incorporating pedigrees or smoothing splines, that are not easily expressible in the formula language used by lmer.
Article
This study assessed the impact of a conscious imitation goal on phonetic convergence during conversational interaction. Twelve pairs of unacquainted talkers participated in a conversational task designed to elicit between-talker repetitions of the same lexical items. To assess the degree to which the talkers exhibited phonetic convergence during the conversational task, these repetitions were used to elicit perceptual similarity judgments provided by separate sets of listeners. In addition, perceptual measures of phonetic convergence were compared with measures of articulation rates and vowel formants. The sex of the pair of talkers and a talker's role influenced the degree of phonetic convergence, and perceptual judgments of phonetic convergence were not consistently related to individual acoustic-phonetic attributes. Therefore, even with a conscious imitative goal, situational factors were shown to retain a strong influence on phonetic form in conversational interaction.
Article
Conversational agents are exploding in popularity. However, much work remains in the area of social conversation as well as free-form conversation over a broad range of domains and topics. To advance the state of the art in conversational AI, Amazon launched the Alexa Prize, a 2.5-million-dollar university competition where sixteen selected university teams were challenged to build conversational agents, known as socialbots, to converse coherently and engagingly with humans on popular topics such as Sports, Politics, Entertainment, Fashion and Technology for 20 minutes. The Alexa Prize offers the academic community a unique opportunity to perform research with a live system used by millions of users. The competition provided university teams with real user conversational data at scale, along with the user-provided ratings and feedback augmented with annotations by the Alexa team. This enabled teams to effectively iterate and make improvements throughout the competition while being evaluated in real-time through live user interactions. To build their socialbots, university teams combined state-of-the-art techniques with novel strategies in the areas of Natural Language Understanding, Context Modeling, Dialog Management, Response Generation, and Knowledge Acquisition. To support the efforts of participating teams, the Alexa Prize team made significant scientific and engineering investments to build and improve Conversational Speech Recognition, Topic Tracking, Dialog Evaluation, Voice User Experience, and tools for traffic management and scalability. This paper outlines the advances created by the university teams as well as the Alexa Prize team to achieve the common goal of solving the problem of Conversational AI.
Article
Spontaneous phonetic imitation is the process by which a talker comes to be more similar-sounding to a model talker as the result of exposure. The current experiment investigates this phenomenon, examining whether vowel spectra are automatically imitated in a lexical shadowing task and how social liking affects imitation. Participants were assigned to either a Black talker or White talker; within this talker manipulation, participants were either put into a condition with a digital image of their assigned model talker or one without an image. Liking was measured through attractiveness rating. Participants accommodated toward vowels selectively; the low vowels /æ ɑ/ showed the strongest effects of imitation compared to the vowels /i o u/, but the degree of this trend varied across conditions. In addition to these findings of phonetic selectivity, the degree to which these vowels were imitated was subtly affected by attractiveness ratings and this also interacted with the experimental condition. The results demonstrate the labile nature of linguistic segments with respect to both their perceptual encoding and their variation in production.