Conference PaperPDF Available

Differences in Gradient Emotion Perception: Human vs. Alexa Voices


Abstract and Figures

The present study compares how individuals perceive gradient acoustic realizations of emotion produced by a human voice versus an Amazon Alexa text-to-speech (TTS) voice. We manipulated semantically neutral sentences spoken by both talkers with identical emotional synthesis methods, using three levels of increasing 'happiness' (0 %, 33 %, 66 % 'happier'). On each trial, listeners (native speakers of American English, n=99) rated a given sentence on two scales to assess dimensions of emotion: valence (negative-positive) and arousal (calm-excited). Participants also rated the Alexa voice on several parameters to assess anthropomorphism (e.g., naturalness, human-likeness, etc.). Results showed that the emotion manipulations led to increases in perceived positive valence and excitement. Yet, the effect differed by interlocutor: increasing 'happiness' manipulations led to larger changes for the human voice than the Alexa voice. Additionally, we observed individual differences in perceived valence/arousal based on participants' an-thropomorphism scores. Overall, this line of research can speak to theories of computer personification and elucidate our changing relationship with voice-AI technology.
Content may be subject to copyright.
Differences in Gradient Emotion Perception: Human vs. Alexa Voices
Michelle Cohn1, Eran Raveh2, Kristin Predeck1, Iona Gessinger2, Bernd Möbius2, Georgia Zellou1
1Phonetics Laboratory, Linguistics, UC Davis, Davis, California, USA
2Language Science and Technology, Saarland University, Saarbrücken, Germany
{mdcohn, kpredeck, gzellou}, {raveh, gessinger, moebius}
The present study compares how individuals perceive gradient
acoustic realizations of emotion produced by a human voice
versus an Amazon Alexa text-to-speech (TTS) voice. We ma-
nipulated semantically neutral sentences spoken by both talk-
ers with identical emotional synthesis methods, using three
levels of increasing ‘happiness’ (0 %, 33 %, 66 % ‘happier’).
On each trial, listeners (native speakers of American English,
n=99) rated a given sentence on two scales to assess dimen-
sions of emotion: valence (negative-positive) and arousal (calm-
excited). Participants also rated the Alexa voice on several pa-
rameters to assess anthropomorphism (e.g., naturalness, human-
likeness, etc.). Results showed that the emotion manipulations
led to increases in perceived positive valence and excitement.
Yet, the effect differed by interlocutor: increasing ‘happiness’
manipulations led to larger changes for the human voice than
the Alexa voice. Additionally, we observed individual differ-
ences in perceived valence/arousal based on participants’ an-
thropomorphism scores. Overall, this line of research can speak
to theories of computer personification and elucidate our chang-
ing relationship with voice-AI technology.
Index Terms: voice-activated personal assistants, emotion per-
ception, human-computer interaction
1. Introduction
While the primary function of speech is to communicate a mes-
sage to our interlocutor, the voice also carries other properties,
including social details (e.g., region, age, gender) and even our
emotional state. Whether we are happy, surprised, sad, or an-
gry might be conveyed on an utterance [1]. Emotional expres-
siveness has been targeted as a way to make synthetic, text-
to-speech (TTS) voices more engaging to human users [2, 3].
These efforts have concentrated primarily on synthesizing ba-
sic human emotions, including happiness, sadness, anger, fear,
disgust, and surprise [4]; yet perception of such qualities in the
synthesis, however, is not always clear to the listeners [4]. The
parameters adjusted in emotional synthesis may be contribut-
ing to this confusion. Another contributor may be the quality
of the synthetic voices; it may be the case that it is difficult for
listeners to extract emotion from more robotic-sounding voices.
Finally, a third contributor may be the degree to which listen-
ers attribute human-like emotion to the systems – which may
be due to the human-like characteristics of the system as well
as individual differences in personification of the systems. The
present paper examines these factors in how listeners perceive
emotion in a real human and an Amazon Alexa TTS voice.
1.1. Emotion and modern voice-AI systems
In the last decade, modern voice-activated, artificially intelli-
gent (voice-AI) systems, such as Apple’s Siri, Amazon’s Alexa,
and Google Assistant, have become a common household inter-
locutor for many human users, particularly in the United States
[5]. These systems engage users in a variety of functional and
social tasks. For example, users may ask Alexa to “turn off the
light”, “tell a joke”, or even have a conversation [6]. Prior work
suggests that humans apply social knowledge to their speech in-
teractions with voice-AI systems, such as gender-related asym-
metries [7]. There is some initial evidence that users may also
be perceptive to emotional expressiveness in voice-AI systems:
speakers vocally align to emotionally expressive productions
by the Alexa voice [8] and rate conversations with an Amazon
Alexa socialbot higher when the bot uses emotionally expres-
sive interjections [9], but neither of these studies employed a di-
rect human comparison. The present study addresses this gap,
examining whether listeners similarly perceive gradient emo-
tion conveyed in TTS and natural human productions.
1.2. Emotion and CASA
Comparing responses to emotion produced by human and
voice-AI interlocutors can speak to computer personification
theories, such as the Computers Are Social Actors (CASA) theo-
retical framework [10, 11], which holds that humans treat tech-
nology as a social actor in interactions and apply social rules
and norms from human-human interaction (HHI). This personi-
fication of technology is thought to be automatic, subconscious,
and driven by the fact that device interaction often involves sim-
ilar aspects as HHI. For example, participants assigned higher
trustworthiness and likeability ratings to a computer system
that displayed more empathetic emotion than one that did not
[12]. In another study, participants showed different negotia-
tion strategies when haggling with a ‘happy’ or an ‘angry’ com-
puter system, in line with emotion-based asymmetries observed
in human-human negotiation [13]. Meanwhile, other work has
found that negative reactions are triggered by computer behav-
ior in the same ways that a human’s actions might engender
anger: after a computer system had acted unfairly in a bargain-
ing game, participants in that interaction displayed anger and
spiteful behavior toward the device [14].
In line with CASA [11], there is some evidence for similar
perception of emotion produced by a human or computer. In
a study examining explicit emotion identification (e.g., happy,
sad, surprised, etc.) based on visual and prosodic differences,
participants displayed equal responses to the ‘human’ or ‘com-
puter’ guise [15]. In a study of facial expression, Noël et al.
[16] found that subjects’ accuracy identifying emotion for a
real human face and a digital avatar was equal when context
and emotional expressiveness were congruent. Following these
studies, one possibility is that individuals will interpret emo-
tional prosody similarly for human and voice-AI speech.
On the other hand, many studies exploring synthesized
emotion do not make a direct human versus device comparison.
It is possible that while there are similarities in the gross pat-
Copyright © 2020 ISCA
October 25–29, 2020, Shanghai, China
terns of social responses toward humans and computers/robots,
there may be more fine-grained nuances that are missed, par-
ticularly in using a between-subjects design [e.g., 8, 9, 12, 13,
14, 17]. For example, participants rated the emotion of syn-
thetic and natural speech similarly when the emotional expres-
sion was congruent with the content (e.g., happy prosody with
positive content) [17]. Yet, in the incongruent conditions (e.g.,
sad prosody with positive content), they observed differences
for the human and TTS voices in the relative weight listeners
gave to the prosody, relative to the content: they rated synthetic
speech as ‘happier’ than human speech when it was produced
with sad prosody and happy content. Indeed, there is some evi-
dence that humans respond to voice-AI and human speech dif-
ferently: participants display less vocal alignment toward Siri
and Alexa TTS voices than human voices [18, 19]; this suggests
that voice-AI systems may be a distinct type of social actor than
another human. Therefore, in the present study, one prediction
is that participants may show weaker emotion perception for an
Alexa voice, relative to a human voice.
Additionally, many TTS emotion perception studies ask
participants to classify very distinct types of stimuli (e.g., ba-
sic emotions of happiness or sadness); one unexplored question,
to our knowledge, is whether the perceived magnitude of emo-
tional expression is similar for human and synthetic voices. In
the current study, we examine gradience in emotion perception
by adapting neutral speech produced by a human and voice-
AI talker (here, Amazon Alexa) at three happiness levels (see
Section 2.1.2). One possibility is that listeners will be more
sensitive to gradient emotional display by human voices, as hu-
man voices are more socially meaningful. Alternatively, an-
other possibility is that listeners will display equal sensitivity
for human and voice-AI voices producing multiple levels of an
emotion, which would provide support for the CASA account.
1.3. Variation in personification
While the CASA account proposes an automatic mechanism of
personification, there is reason to believe that any such response
will vary considerably across individuals. For example, partic-
ipants displayed different patterns of vocal alignment toward
voice-AI (Apple Siri) voices based on their cognitive process-
ing style [20]. In another study, individuals interacting with the
same robot receptionist communicated differently depending on
their attitude towards the virtual interlocutor: as being more
‘human-social’ or a ‘computational-tool’ [21]. In the present
study, we assess each participant’s anthropomorphism of the
virtual assistant Alexa across several dimensions, viz. human-
ness, naturalness, etc., in a pre-experiment survey. We predict
that overall anthropomorphism scores will be related to voice-
AI emotion perception, i.e., individuals with higher anthropo-
morphism scores are expected to be more perceptive to emotion
by the Alexa voice.
1.4. Current study
In the present study, we examine whether technology personifi-
cation is gradiently realized in the perception of emotion. In this
experiment, we ask two principal questions: 1) Do listeners per-
ceive acoustic variations conveying different levels of emotional
state similarly for human and TTS voices?, and 2) Does an indi-
vidual’s gradient perception of TTS voice emotion vary accord-
ing to the degree to which they personify the system; that is,
are listeners better at perceiving emotion for interlocutors they
deem as being more ‘human-like’? While the general acous-
tic properties of a recorded human voice and an Alexa voice
differ, we used identical parameters for both voices in the emo-
tional synthesis system, DAVID [22]. We selected DAVID given
its prior validation: listeners perceive the intended emotions
(e.g., happiness, sadness, and fear) in manipulated productions
[22]. Additionally, DAVID allows for specification of gradient
change toward a given emotion (e.g., 66% ‘happier’).
Critically, we test whether the same gradient manipulations
of emotional prosody within a given voice yields similar or dif-
ferent changes in emotion perception across the two speakers
(here, human vs. voice-AI). As our aim is to investigate the role
of emotional prosody, we conducted a norming study (see Sec-
tion 2.1.1) of sentences to generate our list of ‘emotionally neu-
tral’ sentences; accordingly, listeners would primarily respond
to the emotional properties conveyed through the voice.
2. Methods
2.1. Stimuli
2.1.1. Norming study: emotionally neutral sentences
We selected sentences that had previously been rated as emo-
tionally ‘neutral’ (14 from Russ et al. [23]; 10 from Ben-David
et al. [24]; and 2 from Mustafa et al. [25]) as well as 94 declar-
ative sentences from the Speech Perception In Noise (SPIN) test
[26], to a total of 120, for an online emotional valence norming
study. The inclusion of the SPIN sentences permits a greater
range of perceived valence. The 48 native English speakers
(mean age 19.7 ±2.1 years; recruited through the UC Davis
subject pool) rated the emotion in all 120 sentences, which were
randomly presented on the screen one at a time with no sound.
On a given trial, they saw a sentence and used a sliding scale to
indicate how negative, positive, or neutral it was; the beginning,
middle, and end of the spectrum were labeled with “0 = nega-
tive”, “50 = neutral”, and “100 = positive”, respectively. The
slider position reset to 50 at the beginning of each trial. The
data are available as supplemental material1.
2.1.2. Synthesizing emotion in human and Alexa voices
We selected the 15 sentences with the ratings closest to 50
(range 48 to 51, mean 49.9) from the norming study (Sec-
tion 2.1.1), excluding imperatives and sentences with personal
pronouns (e.g., “My T.V. has a twelve-inch screen.") that may
be incongruous if produced by a voice-AI system. We also ex-
cluded two sentences with negative words (e.g., “garbage" and
“shipwrecked"). The remaining 15 sentences had 4 to 8 words
(mean 5.9 ±1.2). We recorded a native English female speaker
producing the 15 target sentences in citation format. We gen-
erated the same 15 sentences with default US-English female
Alexa voice using the Alexa Skills Kit. Recordings had a sam-
pling rate of 44.1 kHz and were amplitude normalized2based
on mean intensity measurements in Praat [27].
Next, we generated three ‘happiness’ levels (at 0 % (no
change), 33 %, and 66 % happier) with the DAVID emotional
synthesis platform [22] in the Max programming language [28].
We used the DAVID default values for ‘happiness’, including
a fundamental frequency (f0) increase of 30 cents3, and high
shelf filter (8 kHz, gain 3 dB). We passed all sentences through
265 dB for human, 64 dB for Alexa voices; as the Alexa samples
were generated in a systematically different manner than the human
recordings (i.e., not through air transmission), this normalization was
relative and adjusted (by ear) by the first author.
3A cent is a logarithmic unit of pitch (1 octave = 100 cents)
0 33 66
Happiness level
Valence score
Perceived valence
Perceived valence
Valenc e sc ore (0-100)
Happiness Level
Perceived arousal
0 33 66
Happiness level
Arousal score
Perceived arousal
Happiness Level
Arousal score (0-100)
Interloctuor Image
Figure 1: (A) Human and synthetic speakers’ silhouettes. The corresponding silhouette appeared on the screen for all trials within a
speaker block in the human (‘Amanda’) or device (‘Alexa’) condition. (B-C) Summary of valence (B) and arousal (C) results. The blue
dots and green triangles indicate the mean scores for Alexa and the human voices, respectively. Error bars show the standard error.
the DAVID re-synthesizer at 0 %, 33 %, and 66 % of the ‘hap-
piness’ parameters (e.g., 33 % increase in f0toward 30 cents:
increase of 9.9 cents). This resulted in a total of 90 stimuli4.
(15 sentences ×3 happiness levels ×2 interlocutors).
2.2. Participants
Participants (none of whom completed the norming task) con-
sisted of 99 native speakers of American English, recruited from
the UC Davis Psychology subject pool (70 females, 29 males;
mean age 20.2 ±2.2 years); 81 of them reported some experi-
ence using a voice-AI system.
2.3. Procedure
Subjects completed the experiment online, via Qualtrics. First,
they provided basic demographic information, as well as their
voice-AI usage. Next, participants completed an audio calibra-
tion step to ensure that the stimuli were audible and understand-
able via their computer’s audio device: they heard one sentence
(not used in the experimental trials) produced by each interlocu-
tor (human and Alexa) and were asked to select what they heard
out of a set of options; if their response was correct, they con-
tinued to the experimental trials; if not, they were taken to a
screen that indicated that they needed to increase the volume.
Participants could not continue to the experimental trials until
they answered correctly.
Then, they completed a voice-AI anthropomorphism sur-
vey, adapted from Ho and MacDorman [29]. Using sliding
scale response (0-100), participants heard a single sentence pro-
duced by the Amazon Alexa voice (note that the sentence was
not manipulated in terms of emotion) and rated to what de-
gree they thought the voice was machine-like/human-like, ar-
tificial/natural, eerie/comforting, and cold/warm.
In the experimental trials, participants were told that they
would hear sentences produced by either an Amazon Alexa or a
real person (‘Amanda’), rate the sentences, and answer a few
randomly presented listening comprehension questions. Par-
ticipants were told that they would only hear each sentence
once, and to respond as quickly and accurately as possible.
Speaker condition (voice-AI/human) was divided into blocks
(order counterbalanced across subjects). During all trials of
each block, participants saw the corresponding Alexa/human
silhouette on the screen (see Figure 1.A). On each trial, sub-
jects heard an emotionally neutral sentence in one of the three
happiness levels and rated it on two dimensions of emotion us-
ing a sliding scale: valence (0 = negative, 50 = neutral, 100 =
positive) and arousal (0 = calm, 50 = neutral, 100 = excited). At
the beginning of each trial, the slider position reset to 50. The
sentences were only presented aurally and were randomized by
happiness level. Each participant rated all 90 stimuli. Addition-
ally, listeners heard a listening comprehension question after the
experimental trials for each speaker: they heard a semantically
anomalous sentence produced by the speaker (either human or
Alexa) and identified the sentence from a multiple choice list.
Participants needed to answer correctly to receive credit for the
study. In total, the experiment took roughly 30 minutes.
2.4. Analysis
We analyzed participants’ valence and arousal scores for the
sentences with separate linear mixed models (LMMs), using
the lme4 R package [30]. In both models, the fixed effects in-
cluded HA PPI NE SS L EVE L (3 levels: 0 %, 33 %, and 66 % hap-
pier), IN TER LO CUT OR (2 levels: human, device), and all possi-
ble interactions. Random effects included by-S UBJEC T random
intercepts, with by-SUBJECT random slopes for INTERLOCU-
TOR. The linear mixed models (sLMMs) were fit by REML
t-tests and used Satterthwaite approximations to determine the
degrees of freedom. The p-values were derived from the output
of these fits with the lmerTest package [31].
For the anthropomorphism analysis, we calculated a com-
posite anthropomorphism score, summing the totals for each
of the responses (human-like, natural, comforting, warm) for
the voice-AI; a higher score indicates greater personification.
On the subset of data for the Alexa talker, we modeled valence
and arousal scores in separate linear mixed models (LMMs).
Main effects included ANTHROPOMORPHISM SCORE (contin-
uous) and HA PP INE SS L EVE L, their interaction, as well as by-
SUBJECT random intercepts.
3. Results
Figure 1 shows the mean scores of valence and arousal for the
Alexa and human voices over the three levels of happiness. The
outcomes of the LMM fits (see Section 2.4) for valence and
arousal are summarized in Tables 1 and 2. Valence scores were
overall lower for the human speaker relative to Alexa. There
was also an interaction between HA PPINE SS LEVEL and IN-
TE RLO CU TOR: there was a larger increase in valence for the
human talker at the higher happiness levels (33 % and 66 %)
(see Figure 1.B). While the score difference between the human
Table 1: Summary of fixed effects in valence scores.
Coef SE df t p
(Intercept) 55.31 1.31 108 42.06 0.001 ***
Happ.33 0.56 0.49 8708 1.14 0.250
Happ.66 0.18 0.49 8708 0.38 0.710
Int.Human 3.27 1.12 129 2.93 0.004 **
Happ.33:Int 3.00 0.70 8708 4.32 0.001 ***
Happ.66:Int 2.75 0.69 8708 3.97 0.001 ***
Table 2: Summary of fixed effects in arousal scores.
Coef SE df t p
(Intercept) 33.33 1.67 105 19.89 0.001 ***
Happ.33 2.42 0.55 8708 4.38 <0.001 ***
Happ.66 3.48 0.55 8708 6.31 0.001 ***
Int.Human 0.16 1.39 122 0.12 0.740
Happ.33:Int 2.39 0.78 8708 3.05 0.002 **
Happ.66:Int 2.12 0.78 8708 2.71 0.007 **
speaker and Alexa is large for non-manipulated speech, this gap
is closed in the 33 % happiness level, and the scores are virtu-
ally identical in the 66 % happiness level. Moreover, the scores
for Alexa are relatively stable, whereas the human scores rise
sharply in the 33 % happiness level. Arousal ratings (see Fig-
ure 1.C) show a different pattern: while excitement for the two
voices is equal for the non-manipulated speech (0 % happiness
level), the scores for both speakers show an increase from 0%
to 33 % and 66 % happiness levels. This increase is larger for
the human, relative to the Alexa, voice.
As for the anthropomorphism scores, we observed varia-
tion across participants (mean 131.9 ±71.2, range 0 - 300). In
both mixed effects models, there were interactions between AN -
illustrates the anthropomorphism scores in the different condi-
tions. In the valence model, participants with higher anthropo-
morphism scores rated the Alexa voice as sounding more posi-
tive at the baseline happiness level, 0% [Coef = 0.02, SE = 5.2e-
03, t= 3.1, p< 0.01]. No other interactions were observed. In
the arousal model, a higher anthropomorphism score was asso-
ciated with less perceived excitement at the highest happiness
level, 66 % [Coef = -0.02, SE = 6.0e-03, t= -3.2, p< 0.01]; no
other interactions were significant for the arousal model.
4. Discussion and Conclusion
Overall, we found that listeners perceive emotion gradiently in
both human and voice-AI (here, Amazon’s Alexa) voices. How-
ever, this was limited to arousal ratings for the Alexa voice,
while both valence and arousal ratings for the human voice rose
with the increasing ‘happiness’ manipulations. This finding
is broadly in line with the CASA theoretical framework [11,
10], as the subjects were hearing different levels of ‘excite-
ment’ in both a human and an Alexa voice that were manip-
ulated identically. Yet, these findings also illuminate a possi-
bly limited aspect of technology personification: listeners heard
variation in valence in the human voice, but not for the Alexa
voice. This might indicate that users still do not expect – or
are not used to – TTS voices that show this dimension of emo-
tion. Another factor may be the nature of the task. Listen-
ing tasks are somewhat passive comparing to the typical use
Anthropomorphism by arousal/valence
Anthropomorphism score
Mean arousal/valence score
Anthropomorphism score
Mean arousal/valence score
Anthropomorphism score
Mean arousal/valence score
Anthropomorphism score
Mean arousal/valence score
Anthropomorphism score
Mean arousal/valence score
Anthropomorphism score
Mean arousal/valence score
Mean rating (0-100)
Figure 2: Effect of anthropomorphism scores on perceived Hap-
piness Level (0, 33, 66 %) on valence (top panel) and arousal
(bottom panel) ratings of the Alexa (blue solid line).
of voice-AI personal assistants. It is possible that the range of
listeners’ emotion judgments would be wider in more natural-
istic, dyadic interactions. Future work exploring emotion per-
ception across different types of interactions (e.g., more func-
tional, more social) are needed to further explore this effect.
Additionally, we found evidence that individual variation
in anthropomorphism of voice-AI mediates emotion perception
of the Alexa voice: participants who displayed greater person-
ification of the Alexa voice rated it as being more positive at
baseline, while also rating the voice as sounding less excited at
the 66 % happiness level. Our valence anthropomorphism find-
ings, (i.e., participants who personify Alexa more tend to also
rate the voice as sounding happier) are in line with research sug-
gesting greater generalization of positive attitudes (here, more
human-like qualities) to other domains [32]. While the decrease
in arousal ratings was unexpected, the lack of correspondence
between the valence and arousal results are consistent with prior
work showing their separable effects, which are further affected
by patterns of individual variation (e.g., personality, cultural
background; cf. [33]). One limitation in this study is that the
participants were not balanced by gender, with far more female
than male raters. While we made no a priori hypotheses about
how individuals might respond differently according to their
gender, this may be a source of variation [34]. Future work
examining different types of emotion, as well as comparing in-
dividuals of different linguistic/cultural backgrounds, genders,
and even ages can further our understanding of sources of vari-
ation in the relationship between voice-AI/human emotion per-
ception and anthropomorphism.
Overall, our findings suggest that the way humans engage
with voice-AI systems is similar in some ways to humans – in
perceiving increases in ‘arousal’ – but perception of emotion
multidimensionality (i.e., both valence and arousal) appears to
be limited to natural human productions.
5. Acknowledgments
This material is based upon work supported by the National
Science Foundation SBE Postdoctoral Research Fellowship un-
der Grant No.1911855 to MC. Funded in part by the Deutsche
Forschungsgemeinschaft (DFG, German Research Foundation)
– Project-ID MO 597/6-2 and STE 2363/1-2.
6. References
[1] I. R. Murray and J. L. Arnott, “Toward the simulation of emo-
tion in synthetic speech: A review of the literature on human
vocal emotion”, The Journal of the Acoustical Society of Amer-
ica, vol. 93, no. 2, pp. 1097–1108, 1993.
[2] C. Creed and R. Beale, “Emotional intelligence: Giving com-
puters effective emotional skills to aid interaction”, in Compu-
tational Intelligence: A Compendium, Springer, 2008, pp. 185–
[3] A. R. F. Rebordao, M. A. M. Shaikh, K. Hirose, and N. Mine-
matsu, “How to improve TTS systems for emotional expressiv-
ity”, in Tenth Annual Conference of the International Speech
Communication Association, 2009.
[4] J. E. Cahn, “The generation of affect in synthesized speech”,
Journal of the American Voice I/O Society, vol. 8, no. 1, pp. 1–
1, 1990.
[5] F. Bentley, C. Luvogt, M. Silverman, R. Wirasinghe, B. White,
and D. Lottridge, “Understanding the long-term use of smart
speaker assistants”, Interactive, Mobile, Wearable and Ubiqui-
tous Technologies, vol. 2, no. 3, pp. 1–24, 2018.
[6] H. Fang, H. Cheng, E. Clark, A. Holtzman, M. Sap, M. Osten-
dorf, Y. Choi, and N. A. Smith, “Sounding board: University of
Washington’s Alexa Prize submission”, Alexa Prize, 2017.
[7] H. Bergen et al., “I’d blush if i could: Digital assistants, dis-
embodied cyborgs and the problem of gender”, Word and Text,
A Journal of Literary Studies and Linguistics, vol. 6, no. 01,
pp. 95–113, 2016.
[8] M. Cohn and G. Zellou, “Expressiveness influences human vo-
cal alignment toward voice-AI”, in Proc. Interspeech 2019, Sep.
2019, pp. 41–45. DO I: 10.21437/Interspeech.2019-1825.
[9] M. Cohn, C.-Y. Chen, and Z. Yu, “A large-scale user study of
an Alexa prize chatbot: Effect of TTS dynamism on perceived
quality of social dialog”, in SIGdial, 2019, pp. 293–306. DO I:
[10] C. Nass and Y. Moon, “Machines and mindlessness: Social re-
sponses to computers”, Journal of social issues, vol. 56, no. 1,
pp. 81–103, 2000. DO I: 10.1111/0022-4537.00153.
[11] C. Nass, Y. Moon, J. Morkes, E.-Y. Kim, and B. Fogg, “Com-
puters are social actors: A review of current research”, Human
values and the design of computer technology, vol. 72, pp. 137–
162, 1997.
[12] S. Brave, C. Nass, and K. Hutchinson, “Computers that care:
Investigating the effects of orientation of emotion exhibited by
an embodied computer agent”, International journal of human-
computer studies, vol. 62, no. 2, pp. 161–178, 2005. DO I: 10.
[13] C. M. de Melo, P. Carnevale, and J. Gratch, “The effect of
expression of anger and happiness in computer agents on ne-
gotiations with humans”, in International Conference on Au-
tonomous Agents and Multiagent Systems – Volume 3, Interna-
tional Foundation for Autonomous Agents and Multiagent Sys-
tems, 2011, pp. 937–944.
[14] R. E. Ferdig and P. Mishra, “Emotional responses to computers:
Experiences in unfairness, anger, and spite”, Journal of Educa-
tional Multimedia and Hypermedia, vol. 13, no. 2, pp. 143–161,
[15] C. Bartneck, “Affective expressions of machines”, in CHI’01
extended abstracts on Human factors in computing systems,
ACM, 2001, pp. 189–190.
[16] S. Noël, S. Dumoulin, and G. Lindgaard, “Interpreting human
and avatar facial expressions”, in IFIP Conference on Human-
Computer Interaction, Springer, 2009, pp. 98–110.
[17] C. Nass, U. Foehr, S. Brave, and M. Somoza, “The effects of
emotion of voice in synthesized and recorded speech”, in Pro-
ceedings of the AAAI symposium emotional and intelligent II:
The tangled knot of social cognition, AAAI North Falmouth,
MA, 2001.
[18] M. Cohn, B. F. Segedin, and G. Zellou, “Imiating Siri: Socially-
mediated vocal alignment to human and device voices”, in
ICPhS, Aug. 2019, pp. 1813–1817. DO I: 10.21437/Interspeech.
[19] E. Raveh, I. Siegert, I. Steiner, I. Gessinger, and B. Möbius,
“Three’s a crowd? Effects of a second human on vocal accom-
modation with a voice assistant”, Interspeech 2019, pp. 4005–
4009, 2019.
[20] C. Snyder, M. Cohn, and G. Zellou, “Individual variation in cog-
nitive processing style predicts differences in phonetic imitation
of device and human voices”, Proc. Interspeech 2019, pp. 116–
120, 2019.
[21] M. K. Lee, S. Kiesler, and J. Forlizzi, “Receptionist or informa-
tion kiosk: How do people talk with a robot?”, in Proceedings of
the 2010 ACM conference on computer supported cooperative
work, 2010, pp. 31–40.
[22] L. Rachman, M. Liuni, P. Arias, A. Lind, P. Johansson, L. Hall,
D. Richardson, K. Watanabe, S. Dubal, and J.-J. Aucouturier,
“David: An open-source platform for real-time transformation
of infra-segmental emotional cues in running speech”, Behavior
research methods, vol. 50, no. 1, pp. 323–343, 2018.
[23] J. B. Russ, R. C. Gur, and W. B. Bilker, “Validation of affec-
tive and neutral sentence content for prosodic testing”, Behav-
ior research methods, vol. 40, no. 4, pp. 935–939, 2008. DO I:
[24] B. M. Ben-David, P. H. van Lieshout, and T. Leszcz, “A re-
source of validated affective and neutral sentences to assess
identification of emotion in spoken language after a brain in-
jury”, Brain injury, vol. 25, no. 2, pp. 206–220, 2011. DOI :
[25] M. B. Mustafa, R. N. Ainon, and R. Zainuddin, “EM-HTS:
Real-time HMM-based Malay emotional speech synthesis”, in
ISCA Workshop on Speech Synthesis, 2010. DO I: 10 . 22452 /
[26] L. L. Elliott, “Verbal auditory closure and the speech perception
in noise (spin) test”, Journal of Speech, Language, and Hearing
Research, vol. 38, no. 6, pp. 1363–1376, 1995. DOI: 10 .1044 /
[27] P. Boersma and D. Weenink, “Praat”, Doing phonetics by com-
puter (Version 5.1), 2005.
[28] M. Puckette, Max/msp (version 7): Cycling’74, 2014.
[29] C.-C. Ho and K. F. MacDorman, “Revisiting the Uncanny Val-
ley theory: Developing and validating an alternative to the God-
speed indices”, Computers in Human Behavior, vol. 26, no. 6,
pp. 1508–1518, 2010.
[30] D. Bates, M. Mächler, B. Bolker, and S. Walker, “Fitting linear
mixed-effects models using lme4”, Journal of Statistical Soft-
ware, vol. 67, no. 1, pp. 1–48, 2015. DOI: 10 .18637 /jss .v067.
[31] A. Kuznetsova, P. B. Brockhoff, and R. H. B. Christensen,
“Lmertest package: Tests in linear mixed effects models”, Jour-
nal of statistical software, vol. 82, no. 13, 2017.
[32] R. H. Fazio, E. S. Pietri, M. D. Rocklage, and N. J. Shook, “Pos-
itive versus negative valence: Asymmetries in attitude forma-
tion and generalization as fundamental individual differences”,
in Advances in experimental social psychology, vol. 51, Else-
vier, 2015, pp. 97–146.
[33] P. Kuppens, F. Tuerlinckx, M. Yik, P. Koval, J. Coosemans,
K. J. Zeng, and J. A. Russell, “The relation between valence
and arousal in subjective experience varies with personality and
culture”, Journal of personality, vol. 85, no. 4, pp. 530–542,
[34] A. Lausen and A. Schacht, “Gender differences in the recogni-
tion of vocal emotions”, Frontiers in Psychology, vol. 9, p. 882,
2018. DO I: 10.3389/fpsyg.2018.00882.
... For example, one study found that adults who had increased anthropomorphism of animals had greater activity in neural substrates associated with theory of mind (Cullen et al., 2014), or the ability to take the perspective of an external person ('placing yourself in their shoes'). Degree of anthropomorphism has been related with how listeners perceive TTS voices, where higher scores for how human-like, natural, comforting, warm the voice sounded were related to more positive ratings of emotional valence (Cohn et al., 2020b). In the current study, we test whether an individual's anthropomorphism shapes the way they rate and accommodate the TTS voicesdwhether they show more similar patterns with human-human interaction (cf. ...
... Furthermore, the extent to which individual variation by humans' social and cognitive characteristics shapes speech adaptation to voice-AI is a promising area for future research. Prior work has shown variation in how people perceive and personify technological agents, such as robots (Hinz et al., 2019) and voice-AI (Cohn, Raveh, et al., 2020;Etzrodt & Engesser, 2021). Recently, some work has revealed differences in speech alignment toward voice-AI by speaker age (e.g., older vs. college-age adults in Zellou, Cohn, & Ferenc Segedin, 2021) and cognitive processing style (e.g., autisticlike traits in Snyder et al., 2019), suggesting these differences could shape voice-AI speech adaptation as well. ...
Full-text available
Millions of people engage in spoken interactions with voice activated artificially intelligent (voice-AI) systems in their everyday lives. This study explores whether speakers have a voice-AI-specific register, relative to their speech toward an adult human. Furthermore, this study tests if speakers have targeted error correction strategies for voice-AI and human interlocutors. In a pseudo-interactive task with pre-recorded Siri and human voices, participants produced target words in sentences. In each turn, following an initial production and feedback from the interlocutor, participants repeated the sentence in one of three response types: after correct word identification, a coda error, or a vowel error made by the interlocutor. Across two studies, the rate of comprehension errors made by both interlocutors was varied (lower vs. higher error rate). Register differences are found: participants speak louder, with a lower mean f0, and with a smaller f0 range in Siri-DS. Many differences in Siri-DS emerged as dynamic adjustments over the course of the interaction. Additionally, error rate shapes how register differences are realized. One targeted error correction was observed: speakers produce more vowel hyperarticulation in coda repairs in Siri-DS. Taken together, these findings contribute to our understanding of speech register and the dynamic nature of talker-interlocutor interactions.
... On-screen representations of an interlocutor are largely referred to as avatars. Those can be static images associated with specific speakers (as in Cohn et al., 2020), but nowadays normally include at least some basic facial expressions and animations. In addition, avatars are also used sometimes as a general term for any virtual, graphically rendered interlocutor (including a VH). ...
Full-text available
Artificial intelligence (AI) based synthesized speech has become almost human-like, ubiquitous in everyday live (e.g., smart phones, grocery self-checkouts), and relatively easy to synthesize. This opens opportunities to use AI speech in research and clinical areas, such as hearing sciences, audiology, and speech pathology, where recordings of speech materials by voice actors can be time- and cost-intensive. However, much research thus far has focused on technological developments towards more human-like voices evaluated by younger adults. How older adults perceive AI speech is unclear. Using Google’s Wavenet text-to-speech synthesizer, the current study explores whether AI speech can be used to investigate common speech-in-noise perception phenomena in younger and older adults. Speech intelligibility was recorded for human speech and synthesized speech masked by a modulated or an unmodulated multi-talker babble noise. For both human and AI speech, speech intelligibility was better for the modulated than the unmodulated masker (masking release), and this masking-release benefit was reduced in older adults. Release from masking effects were comparable between human and AI speech, suggesting that modern AI speech could be useful for hearing and speech research. The data further suggest that older adults recognize the presentation of AI speech less frequently, rate AI speech as more natural, and are less able to discriminate between human and AI speech compared to younger adults. Research on speech perception in older adults may thus especially benefit from modern AI-based synthesized speech because, to them, AI speech feels much like spoken by a human.
Conference Paper
Full-text available
Mindfulness-based therapies have been shown to be effective in improving mental health, and technology-based methods have the potential to expand the accessibility of these therapies. To enable real-time personalized content generation for mindfulness practice in these methods, high-quality computer-synthesized text-to-speech (TTS) voices are needed to provide verbal guidance and respond to user performance and preferences. However, the user-perceived quality of state-of-the-art TTS voices has not yet been evaluated for administering mindfulness meditation, which requires emotional expressiveness. In addition, work has not yet been done to study the effect of physical embodiment and personalization on the user-perceived quality of TTS voices for mindfulness. To that end, we designed a two-phase human subject study. In Phase 1, an online Mechanical Turk between-subject study (N=471) evaluated 3 (feminine, masculine, child-like) state-of-the-art TTS voices with 2 (feminine, masculine) human therapists' voices in 3 different physical embodiment settings (no agent, conversational agent, socially assistive robot) with remote participants. Building on findings from Phase 1, in Phase 2, an in-person within-subject study (N=94), we used a novel framework we developed for personalizing TTS voices based on user preferences, and evaluated user-perceived quality compared to best-rated non-personalized voices from Phase 1. We found that the best-rated human voice was perceived better than all TTS voices; the emotional expressiveness and naturalness of TTS voices were poorly rated, while users were satisfied with the clarity of TTS voices. Surprisingly, by allowing users to fine-tune TTS voice features, the user-personalized TTS voices could perform almost as well as human voices, suggesting user personalization could be a simple and very effective tool to improve user-perceived quality of TTS voice.
Full-text available
This study tests whether individuals vocally align toward emotionally expressive prosody produced by two types of interlocutors: a human and a voice-activated artificially intelligent (voice-AI) assistant. Participants completed a word shadowing experiment of interjections (e.g., “Awesome”) produced in emotionally neutral and expressive prosodies by both a human voice and a voice generated by a voice-AI system (Amazon's Alexa). Results show increases in participants’ word duration, mean f0, and f0 variation in response to emotional expressiveness, consistent with increased alignment toward a general ‘positive-emotional’ speech style. Small differences in emotional alignment by talker category (human vs. voice-AI) parallel the acoustic differences in the model talkers’ productions, suggesting that participants are mirroring the acoustics they hear. The similar responses to emotion in both a human and voice-AI talker support accounts of unmediated emotional alignment, as well as computer personification: people apply emotionally-mediated behaviors to both types of interlocutors. There were small differences in magnitude by participant gender, the overall patterns were similar for women and men, supporting a nuanced picture of emotional vocal alignment.
This study investigates the perception of coarticulatory vowel nasality generated using different text-to-speech (TTS) methods in American English. Experiment 1 compared concatenative and neural TTS using a 4IAX task, where listeners discriminated between a word pair containing either both oral or nasalized vowels and a word pair containing one oral and one nasalized vowel. Vowels occurred either in identical or alternating consonant contexts across pairs to reveal perceptual sensitivity and compensatory behavior, respectively. For identical contexts, listeners were better at discriminating between oral and nasalized vowels in neural than in concatenative TTS for nasalized same-vowel trials, but better discrimination for concatenative TTS was observed for oral same-vowel trials. Meanwhile, listeners displayed less compensation for coarticulation in neural than in concatenative TTS. To determine whether apparent roboticity of the TTS voice shapes vowel discrimination and compensation patterns, a "roboticized" version of neural TTS was generated (monotonized f0 and addition of an echo), holding phonetic nasality constant; a ratings study (experiment 2) confirmed that the manipulation resulted in different apparent robot-icity. Experiment 3 compared the discrimination of unmodified neural TTS and roboticized neural TTS: listeners displayed lower accuracy in identical contexts for roboticized relative to unmodified neural TTS, yet the performances in alternating contexts were similar.
Conference Paper
Full-text available
The current study explores the extent to which humans vocally align to digital device voices (i.e., Apple's Siri) and human voices. First, participants shadowed word productions by 4 model talkers: a female and a male digital device voice, and a female and a male real human voice. Second, an independent group of raters completed an AXB task assessing perceptual similarity between imitators' pre-and post-exposure items to model talkers' productions. Results show that people do imitate device voices, but to a lesser degree than they imitate real human voices. Furthermore, similar social factors mediated vocal imitation toward both device and human voices: people imitated male device and human voices to a greater extent than female device and human voices.
Conference Paper
Full-text available
This study tests the effect of cognitive-emotional expression in an Alexa text-to-speech (TTS) voice on users' experience with a social dialog system. We systematically introduced emotionally expressive interjections (e.g., "Wow!") and filler words (e.g., "um", "mhmm") in an Amazon Alexa Prize socialbot, Gunrock. We tested whether these TTS manipulations improved users' ratings of their conversation across thousands of real user interactions (n=5,527). Results showed that interjections and fillers each improved users' holistic ratings, an improvement that further increased if the system used both manipulations. A separate perception experiment corroborated the findings from the user study, with improved social ratings for conversations including interjections; however, no positive effect was observed for fillers, suggesting that the role of the rater in the conversation-as active participant or external listener-is an important factor in assessing social dialogs.
Conference Paper
Full-text available
This study examines how the presence of other speakers affects the interaction with a spoken dialogue system. We analyze participants’ speech regarding several phonetic features, viz., fundamental frequency, intensity, and articulation rate, in two conditions: with and without additional speech input from a human confederate as a third interlocutor. The comparison was made via tasks performed by participants using a commercial voice assistant under both conditions in alternation. We compare the distributions of the features across the two conditions to investigate whether speakers behave differently when a confederate is involved. Temporal analysis exposes continuous changes in the feature productions. In particular, we measured overall accommodation between the participants and the system throughout the interactions. Results show significant differences in a majority of cases for two of the three features, which are more pronounced in cases where the user first interacted with the device alone. We also analyze factors such as the task performed, participant gender, and task order, providing additional insight into the participants’ behavior.
Full-text available
The conflicting findings from the few studies conducted with regard to gender differences in the recognition of vocal expressions of emotion have left the exact nature of these differences unclear. Several investigators have argued that a comprehensive understanding of gender differences in vocal emotion recognition can only be achieved by replicating these studies while accounting for influential factors such as stimulus type, gender-balanced samples, number of encoders, decoders, and emotional categories. This study aimed to account for these factors by investigating whether emotion recognition from vocal expressions differs as a function of both listeners' and speakers' gender. A total of N = 290 participants were randomly and equally allocated to two groups. One group listened to words and pseudo-words, while the other group listened to sentences and affect bursts. Participants were asked to categorize the stimuli with respect to the expressed emotions in a fixed-choice response format. Overall, females were more accurate than males when decoding vocal emotions, however, when testing for specific emotions these differences were small in magnitude. Speakers' gender had a significant impact on how listeners' judged emotions from the voice. The group listening to words and pseudo-words had higher identification rates for emotions spoken by male than by female actors, whereas in the group listening to sentences and affect bursts the identification rates were higher when emotions were uttered by female than male actors. The mixed pattern for emotion-specific effects, however, indicates that, in the vocal channel, the reliability of emotion judgments is not systematically influenced by speakers' gender and the related stereotypes of emotional expressivity. Together, these results extend previous findings by showing effects of listeners' and speakers' gender on the recognition of vocal emotions. They stress the importance of distinguishing these factors to explain recognition ability in the processing of emotional prosody.
Full-text available
In this article, I seek to draw a lineage between the long history of the female cyborg and the interactive technologies (Siri, for example) that we carry with us everywhere today. Thirty years after the publication of Donna Haraway’s seminal ‘Cyborg Manifesto’, the female cyborg is still an assemblaged site of power disparity. Imprisoned at the intersection of affective labour, male desire and the weaponized female body, today’s iteration of the cyborg-the intelligent assistant that lives in our phone-is more virtual than organic, more sonic than tangible. Her design hinges on the patriarchal, profit-driven implementation of symbolic femininity, accompanied by an erasure of the female body as we know it, betraying the ways in which even incorporeal, supposedly ‘posthuman’ technologies fail to help us transcend the gendered power relations that continue to govern real human bodies. © 2016, Universitatea Petrol-Gaze din Ploiesti. All rights reserved.
Full-text available
We present an open-source software platform that transforms emotional cues expressed by speech signals using audio effects like pitch shifting, inflection, vibrato, and filtering. The emotional transformations can be applied to any audio file, but can also run in real time, using live input from a microphone, with less than 20-ms latency. We anticipate that this tool will be useful for the study of emotions in psychology and neuroscience, because it enables a high level of control over the acoustical and emotional content of experimental stimuli in a variety of laboratory situations, including real-time social situations. We present here results of a series of validation experiments aiming to position the tool against several methodological requirements: that transformed emotions be recognized at above-chance levels, valid in several languages (French, English, Swedish, and Japanese) and with a naturalness comparable to natural speech.
Full-text available
One of the frequent questions by users of the mixed model function lmer of the lme4 package has been: How can I get p values for the F and t tests for objects returned by lmer? The lmerTest package extends the 'lmerMod' class of the lme4 package, by overloading the anova and summary functions by providing p values for tests for fixed effects. We have implemented the Satterthwaite's method for approximating degrees of freedom for the t and F tests. We have also implemented the construction of Type I - III ANOVA tables. Furthermore, one may also obtain the summary as well as the anova table using the Kenward-Roger approximation for denominator degrees of freedom (based on the KRmodcomp function from the pbkrtest package). Some other convenient mixed model analysis tools such as a step method, that performs backward elimination of nonsignificant effects - both random and fixed, calculation of population means and multiple comparison tests together with plot facilities are provided by the package as well.
Full-text available
Objective: While in general arousal increases with positive or negative valence (a so-called V-shape relation), there are large differences among individuals in how these two fundamental dimensions of affect are related in people's experience. In two studies, we examined two possible sources of this variation: personality and culture. Method: In Study 1, participants recalled a recent event that was characterised by high or low valence or arousal and reported on their feelings, and reported on their personality in terms of the Five-Factor Model. In Study 2, participants from Canada, China/Hong Kong, Japan, Korea, and Spain reported on their feelings in a thin slice of time and on their personality. Results: In Study 1, we replicated the V-shape as characterising the relation between valence and arousal, and identified personality correlates of experiencing particular valence-arousal combinations. In Study 2, we documented how the V-shaped relation varied as a function of western versus eastern cultural background and again personality. Conclusion: The results showed that the steepness of the V-shape relation between valence and arousal increases with extraversion within cultures, and with a west-east distinction between cultures. Implications for the personality-emotion link and research on cultural differences in affect are discussed. This article is protected by copyright. All rights reserved.
Over the past two years the Ubicomp vision of ambient voice assistants, in the form of smart speakers such as the Amazon Echo and Google Home, has been integrated into tens of millions of homes. However, the use of these systems over time in the home has not been studied in depth. We set out to understand exactly what users are doing with these devices over time through analyzing voice history logs of 65,499 interactions with existing Google Home devices from 88 diverse homes over an average of 110 days. We found that specific types of commands were made more often at particular times of day and that commands in some domains increased in length over time as participants tried out new ways to interact with their devices, yet exploration of new topics was low. Four distinct user groups also emerged based on using the device more or less during the day vs. in the evening or using particular categories. We conclude by comparing smart speaker use to a similar study of smartphone use and offer implications for the design of new smart speaker assistants and skills, highlighting specific areas where both manufacturers and skill providers can focus in this domain.