Content uploaded by Shannon Claire Hennig
Author content
All content in this area was uploaded by Shannon Claire Hennig
Content may be subject to copyright.
Abstract— As speech synthesis technology develops more
advanced paralinguistic capabilities, open questions emerge
regarding how humans perceive the use of such vocal
capabilities by robots. Perceptions of spoken interaction are
complex and influenced by multiple factors including the
linguistic content of a message, social context, perceived
intelligence of the agent, and form factor of its embodiment.
This paper shares results from a study that controlled for the
above factors in order to investigate the effect on human
listeners of a male synthetic voice with an expressive range.
Participants were randomly assigned to three conditions,
counterbalancing for gender and language background, in
which how paralinguistic cues were applied was varied. As the
voice became more expressive and appropriate for the context,
observers were more likely to describe the communication as
effective, but were less likely to refer to the unseen agent as a
person. Possible effects of the listener gender and cultural-
linguistic background are examined. Implications for future
methodologies in this field are discussed.
I. I
NTRODUCTION
Synthetic speech is a key component of many human
robot interactions. These voices are known to shape attitudes
and listener behavior both in the context of assistive
technology [1-2], human computer interactions [3-4], and
animated avatars [5], [6]; however, the complex interactions
between how a robot is perceived and its voice and
embodiment are not fully understood [7].
Much of the previous work on human perception of
synthetic voices has focused on questions of intelligibility by
examining how easily and accurately synthesized words can
be understood, transcribed or repeated by listeners [8], [9]
and computers [10]. With the advent of new suprasegmental
capabilities, it is increasingly possible to intelligibly
synthesize an utterance in multiple ways by varying the
message’s paralinguistic cues. In fact, the ability to render a
given message in different ways could be considered the
definition of “expressiveness.” This type of expressiveness
raises new questions not addressed in the intelligibility
literature. These questions extend beyond the scope of work
on emotional expression, which while highly related to
Shannon Hennig is currently a doctoral student in the Pattern Analysis &
Computer Vision (PAVIS) group at the Istituto Italiano di Technologia in
affiliation with Università degli Studi di Genova.
Ryad Chellali is also with the Pattern Analysis & Computer Vision
(PAVIS) group at the Istituto Italiano di Technologia, Genoa, Italy (phone:
+39 010 71 78 1429; e-mail: ryad.chellali@iit.it).
expressiveness, is not identical. Like facial expressions, body
postures, gestures, and expressive synthetic voices are
thought to enrich, or detract from, the linguistic contents by
not only overlaying emotional information, but also
information about the social relationships, attitudes,
understanding and other interactions between speaking
entities.
Further, current technology does not yet fully
approximate the expressive performance of natural speech
[11–13], leading to mismatches between the paralinguistic
cues overlaid on the spoken linguistic message. The effects
of such mismatches are not fully understood. Complicating
matters, evidence from different fields suggest that there may
be individual variation in how people respond to synthetic
voices [13–15]. This variation appears to be related, in part,
to the listener’s gender [15–17] and personality traits, such as
how intuitive or analytical the observer is [18].
Such differences have been documented in human robotic
interaction. Crowell and colleagues observed that males and
females, as groups, perceive embodied and disembodied
synthetic voices differently [7]. Crowell posits that these
results may not be due to inherent gender differences, but
differences in group pre-conceptions of the technology and
emphasizes that this is a complex area of study that eludes
simple summarization [7].
Further complicating matters, expressive vocal
characteristics are believed to interact in complex ways with
linguistic content (i.e., the words being synthesized), social
situation, perceived intelligence and social competence of the
agent, and the physical appearance of the speaking agent or
person. These interactions pose methodological challenges.
One solution is to study the properties of vocal
expressiveness first in the context of human to human
interactions in order to avoid confounds regarding effects of
the robot’s perceived artificial intelligence and embodiment.
Second, the use of scenarios, rather than live interactions, can
be carefully designed to control for the linguistic content
while providing a consistent social context. Omitting either
of these factors would be unwise given that listeners interpret
paralinguistic cues by referencing them to what is being said
(linguistic content) and where it is being said (social
situation).
Recently, we conducted such an experiment that
controlled for these factors. Specifically linguistic content,
embodiment of the agent, and the social and communicative
context were controlled so that the impact of an expressive
set of vocal styles on listeners could be isolated. Specifically,
in this preliminary study the following questions were
Expressive Synthetic Voices:
Considerations for Human Robot Interaction
S. Hennig and R. Chellali, Member, IEEE
addressed: do listeners accept the spoken messages as if it
was issued by a human-like agent or do they consider it as
purely synthetic? This is linked to the question of whether the
added social and paralinguistic cues enrich the functioning of
the synthetic voice. This first step should provide insights on
designing effective synthetic voices and inform future
methodologies for evaluating them.
This paper is organized in the following manner. First we
review a selection of related literature, then we describe our
experimental setup and the protocols performed, finally we
conclude with results and a discussion of implications.
II. R
ELATED
W
ORK
There has been a call for a methodological shift in the
study of human perception of robots [19]. Specifically,
Coeckelbergh argues that we should focus on people’s social
reactions to the appearance and function of this technology,
and further suggests that discussions of Mori’s uncanny
valley [20], [21] and results from media and perceptual
studies would be beneficial. Dautenhahn argues that the
interesting question is to examine when people do not
respond to robots socially [22], rather than simply document
the many situations in which people anthropomorphize or
otherwise treat robots socially. One such aspect of HRI that
could interfere or enhance these social responses is a robot’s
speaking voice.
One proposed approach is to look at how people talk
about the voices robots use [23]. Analyzing word choice and
linguistic patterns for insights into human perception is not
new [24], [25] and has been applied to the HRI research in
the past [26]. While human to human communication is not
analogous to robot communication [22], there are
documented cases in which people react as quickly and as
frequently to a human as to a robot interviewer [26] and that
some social reactions are similar regardless if the person
believes the speech output is controlled a person or by a
computer [27].
With this in mind, we present our results from an open,
free response description task regarding a theoretical human
agent using a synthetic voice. Responses were analyzed to
address the following questions:
• Does the speaking agent appear more effective as a
communicator when the voice is more expressive and
contextually appropriate?
• Does the unseen agent seem more human as the voice
becomes more expressive and contextually appropriate?
• Do the answers to these questions depend on the gender
and/or language background of observers?
III. E
XPERIMENTAL
M
ETHODS
A computer-based survey methodology was used in
which targeted spoken messages were synthesized using up
to three discrete expressive speech styles. These styles varied
in terms of both prosody and voice quality and represented
three discrete points along a calm-intense continuum. A
synthesized target message was embedded into two
theoretical human-human scenarios (one target message
repeated three times per dialogue) as illustrated in figure 1.
A. Demographics of Participants
This paper describes the responses of a diverse group of
participants (n=63), see table 1. 49% of the participants
were female (31 of 63). All participants were fluent in
English and 44% were native English speakers (28 of 63).
Other language backgrounds included Arabic (1), Dutch (3),
French (5), German (2), Greek (2), Hindi (1), Hungarian (3),
Italian (13), Japanese (1), Mandarin (1), Persian (1),
Portuguese (1), and Spanish (1).
TABLE I. B
REAKDOWN OF
P
ARTICIPANTS BY
L
ANGUAGE AND
S
EX
Total = 63
Experimental Conditions
Flat (n= 21) Reversed (n=22) Matched (n=20)
Male Female Male Female Male Female
English
(n=28)
5 6 4 5 4 4
Non-native
speaker
(n=35)
5 5 7 6 7 5
B. Voice
A custom synthetic voice was used for this study which is
described in detail [28], [29], which was created by taking a
segmented audio-book, extracting speech features, and using
an unsupervised clustering method to identify three subcopra
from which three voice styles were modeled. This resulted in
an HMM adult male voice with three distinct voice styles that
vary both with regards to prosody and voice quality.
Previous work with the same participants confirmed that
listeners perceive these voice styles as distinctive and
spontaneously labeled them as ranging on a continuum from
calm (voice A) to expressive/emphatic (voice B) to very
expressive/emphatic (voice C) [30]. 81% of participants
agreed that the three voice styles could be described as
ranging from calm to very intense. At this stage of the
research, the choice was made to use a single gender in order
to reduce the number confounding variables. A male voice
was selected over a female voice because that word “robot”
has been noted to be ‘male’ in many gendered languages [7].
C. Stimuli
Two scenarios were created in which the use of
paralinguistic, expressive cues would be expected by the
general population. To allow for easier interpretation of
results, and because it was unknown a priori the size of any
possible effects, two relatively extreme cases were selected
with high social expectations of paralinguistic variation in the
voice: the expression of pain and the expression of enjoyment
on a social date. Further, we avoided neutral scenarios based
on the assumption that current speech technology is already
fairly capable of such scenarios. The concern raised by
Crowell et al [7] regarding people’s different preconceptions
of the capabilities of current robot technology was avoided
by framing the speaker as a person for the purposes of this
study. A small number of scenarios were selected for this
preliminary study to ensure that the participants could do
both parts of the study in less than 20 minutes.
Two target sentences were synthesized three times using
one or three of the synthetic voice styles for each of the
dialogues. A target sentence was embedded three times as an
audio file into each scenarios, see figure 1. This technique
made it possible to vary the paralinguistic cues while keeping
constant the linguistic content of the message (e.g., “that
hurts” or “I had a great time”), the linguistic context
surrounding the target message, and the overall social-
communicative contexts. Further, using three samples
(instead of only one) was necessary so that participants could
compare relative differences between the paralinguistic cues
in each synthesized target sentence and the social context
provided background to inform their impressions.
D. Tasks and conditions
Participants were randomly assigned, after counter
balancing for gender and language background, to one of
three conditions that varied which voices styles were used to
synthesize the target sentence. The conditions will be
referred to as flat, reversed (rev), and matched (mat) in which
the sentences were repeated three times within the two
dialogues, as illustrated in figure 1. Participants listened to
and subsequently judged each scenario in turn.
Figure 1. Scenario 2 with the phrase “that hurts” embedded three times.
Below the three experimental conditions are illustrated with horizontal lines
representing the situationally appropraite response and the speech bubbles
representing the voice style used for that condition, with C being the most
extreme and emphathic voice style.
In the matched condition, participants were exposed to all
three voice settings and the voice settings previously shown
to be consistent with the situational expectations. Thus, in
the matched condition listeners heard a voice that was both
expressive and theoretically appropriate for the context.
In the reversed condition, participants were exposed to all
three voice styles, but in the opposite order of what was
expected for the context. In other words if the context
suggested that a person would gradually become more
animated, in the reversed condition the voice would become
calmer and less extreme with each spoken utterance. In other
words, the A voice and the C voice were switched.
In the flat condition, participants heard the middle,
somewhat expressive/emphatic voice style (voice B) all three
times. There was no expressive variation between utterances
because the same audio sample was embedded each time.
E. Measures
After each dialogue, participants typed their observations
following the deliberately open ended writing prompt: “How
did he sound to you?” in order to capture richer descriptions
than a simple numerical rating could capture. The 126
responses (2 scenarios x 63 participants) were analyzed for
metrics: (a) whether people spontaneously referred the
speaker as a person or not and (b) whether responses
indicated that the speaker appeared to effectively be
communicating his intentions or not.
Social and communicative effectiveness was chosen
because successful communication is a primary goal of any
spoken interaction and goes beyond intelligibility and into the
realm of how the use of paralinguistic cues influence
functional outcomes, such as listener judgment. For
example, regardless of how intelligible a message is, whether
it sounds “honest” or “truthful” is more effective in most
situations, but less so if the person intends to be sarcastic. In
other situations it is more effective to sound warm and caring
(e.g., on a date or when offering condolences), whereas in
other situations, sounding indifferent is more effective (e.g.,
playing poker).
Whether the speaker was referred to as a person was
selected after it was unexpectedly observed that several
comments did not referred to the agent as a person even
though all participants were explicitly told that the voice was
a recording of a human using a speech synthesizer for
communication. This observations seem noteworthy given
literature suggested that people are more likely to use
personal pronouns when speaking about non-human agents
that seem intelligent [31] and questions related to how
people anthropomorphize technology.
It was hypothesized that both the proportion of effective
ratings and use of person-based references would be higher in
the conditions in which the voice used an expressive range
and/or was more contextually appropriate. In other words,
both metrics were expected to increase from the flat to the
reversed to the matched conditions. Further it was anticipated
that if any language or gender effects were present, women
and native English speakers would be more sensitive to vocal
variations. Women, because as a group they are thought to
attend more closely to aspects of social cues, particularly
from a male voice, and native English speakers because of
assumed greater knowledge of how to appropriately use of
paralinguistic cues in that language.
IV. R
ESULTS
In the following sections, the two metrics described above
will be described in turn, first for all overall trends in the
entire sample and second by gender and language
background. All statistical tests were computing using the R
statistical software package [32]. For comparisons between
conditions and groups, two tail Fisher exact tests were
calculated to determine significance with a criterion of p <
0.05 unless otherwise noted.
A. Inter-annotator agreement
All 126 responses were coded by two annotators. For the
first set of annotations, communicative effectiveness, two
native speakers of American English annotated all the written
evaluations and agreed on 118 items (inter-annotator
agreement of 93.6%, unweighted Cohen Kappa of 0.88, with
95% confidence range of 0.79-0.96). For the second metric,
the two authors annotated the presence of person-referential
language in the written responses and achieved a 96%
agreement rate (121 out of 126 responses, unweighted Cohen
kappa 0.93, with 95% confidence range of 0.86-0.99).
B. Metric 1: Communicative effectiveness of the speaker
Participant responses were coded in the following way:
• Effective = speaker’s communication was judged by the
survey participant as communicatively effective and
socially desirable for the presumed speaker intentions
and given context.
• Ineffective = speaker’s communication was judged by
the survey participant to be ineffective and not socially
desirable for the presumed speaker intentions.
• Unclassified = not enough information to code the
response or response that the communication was both
effective and ineffective.
In the first dialogue, the speaker attempts to convince his
social partner that he was enjoying her company with the
phrase “I had a great time”. For this dialogue, examples of
responses coded effective included “sincere, keen”, “he
sounds genuine and is responding to the situation”, and
“certain that he had a great time”. Responses coded
ineffective for this dialogue included “he seems indifferent”
and “I can’t grasp the depth of his emotion, or whether he is
truthful”.
In the second dialogue, the speaker is attempting to
convey three levels of pain from acute and intense pain to
merely describing a past pain event. Responses coded
effective included “it sounds as [sic] he was really feeling
pain”, “involved, able to give appropriate intonation”, and
“very realistic, each answer had a different inflection and
emphasis”. Ineffective examples include “he sounded very
unemotional, as if it didn’t really hurt” and “he doesn’t feel
as bad as he pretends!”.
The proportion of responses coded as effective,
ineffective or unclassifiable are presented in Figure 2. A
small percentage of responses were determined to be
unclassifiable due to conflicting information within the
response or inadequate detail to make a determination.
Figure 2. Relative proportion of writen respones coded as
effective communication. Significant differences between conditions
(p<0.001, Fisher exact test) marked with horizontal bars below graph. 1.0
represents 100% of participants assigned to that condition
As predicted by the hypothesis, as a group, the number of
responses judged as indicating effective communication by
the unseen speaking agent increased as the voice became
more variable (flat to reversed condition, p<0.007) and more
contextually appropriate (flat to matched, p<0.00001;
reversed to matched, p<0.003). There was no effect of
gender or language background for this metric and
subgroups showed an upward trend similar to the group
trend shown in figure 2.
C. Metric 2: Linguistic reference of speaker as a person
The second metric regarded the spontaneous word
choices of participants in reference to the speaking agent.
Although participants had been primed to think of the
speaker as a man in the listening test immediately proceeding
this study, many subjects did not linguistically refer to the
speaker as a person. Specifically, there were 26 linguistic
references that the speaker as a man and one photo with his
face obscured. Additionally, the first dialogue had 8
references to the speaker as a man and the second dialogue
had 12. Further, the prompt for the open ended responses
also explicitly referred to the speaker as a man. Yet, many
participants did not refer to the speaking agent as a person. In
other words, they did not use personal pronouns (e.g., he,
him, his) nor phrases such as “the man”, “like someone”, “for
a person” when answering the question. Instead this subset of
participants referred to the speaker with impersonal language
such as “… like a computer”, “a little flat and robotic”, and
“it sounded ok, but not quite right”.
All responses were categorized based on whether the
response contained a clear reference to the speaker as a
person or not. Eight responses were considered to be
unclassifiable because they were sentence fragments that
omitted a grammatical subject (e.g., “emotionless, cold”, “a
bit repetitive”). Figure 3 illustrates the relative proportions of
participant responses that referred to the speaker as a person.
Figure 3. Proportion of written responses explicitly referring to the
speaker as a person. Significant dfiferences between conditions marked
with horizontal bars below graph (p<0.05, Fisher exact test)
In contrast to the hypothesis, the proportion of people
referring to the agent as a human speaker did not increase as
the voice became more variable and more appropriate for the
situation, see figure 3. In fact, the use of such language was
most common in the flat condition, which could be argued is
the condition least like how a person actually speaks, then
decreased for conditions where the use of the voice
theoretically better (flat versus reversed p<0.006; flat versus
Person based language: all participants
Effectiveness: all participants
matched p<0.038) and judged by these listeners as more
effective (see results of metric one above).
Further, gender and language differences were observed
for this metric. As illustrated in Figure 4, three distinct
patterns emerged: (a) native English speaking men referred
to the speaker as a person more frequently as the voice
became more expressive and contextually appropriate (b) the
opposite trend emerged for women who learned English as a
first language and men who had learned it as a second
language, and (c) women who learned English as a second
language in the reverse condition also were less likely to
refer to the agent as a person in the reverse condition, but
this trend reversed direction for the matched condition, with
a dip between conditions, rather than a trend in single
direction.
Figure 4. Relative proportion of written responses referring to the speaker
as a person, segmented by participant gender and language. Significant
differences between conditions (p<0.05, Fisher exact test) marked with
solid lines below graph and differences approaching significance (p<0.06)
with dashed lines.
Given that all four groups showed similar linguistic
behaviors in at least one condition, it is appears that all
groups were capable of such linguistic behavior so these
effects are unlikely due to inherent differences in the
linguistic capabilities of women or people who speak English
as a second language.
The flat condition is particularly interesting in that this is
most similar to currently employed technology in which a
voice does not vary with the social context. Men were
statistically less likely to refer to the agent as a person in this
condition than people in all other groups (English women
p<0.03, non-native men (p <0.019), and non-native women
(p<0.028). Further the drop in usage of such language
between the flat condition and the reversed was significant or
at near significant levels for these three other groups (English
women approaching significance at p<0.056, non-native
males p<0.02, non-native women p<0.01).
These results suggest group differences in the sensitivity
to the expressive difference in the voice, their tolerance levels
for mismatches between the voice and the situation, or how
these factors influence how they refer to the speaker.
C
ONCLUSION AND
D
ISCUSSION
Sixty-three people were surveyed regarding their opinions
of how a male synthetic voice sounded when repeating a
target sentence embedded in two social vignettes. Strong
conclusions are not warranted given the small sample size
and the experimental design, however with that caveat stated,
our analysis reveal that the two metrics hypothesized to move
in parallel (perceived communicative effectiveness and how
likely one is to refer to the speaker as a person), in fact did
not, see figure 5.
Figure 5. Comparision of the two metrics: communication perceived as
effective on bottom and communication perceived as human-like on top
In answer to the questions asked in section II, the
speaking agent appeared more effective when the voice
varied paralinguistic cues in a manner that was more
expressive and contextually appropriate. In contrast, people
were more likely to speak of the agent like a person when the
voice did not use an expressive range, in other words when
less expressive people spoke of the agent more frequently
using the words he, him, and his. This supports a
justifications for future work to confirm these findings with
additional scenarios and directly applicable experiments.
Group differences were observed which could be due to
differences in sensitivity for variation in the expressive,
differences in tolerance levels for mismatches between the
voice and the situation, or differences in how such factors
influence how verbal description of the situation. Further
work is needed to address this question.
The results of this current preliminary study have the
following implications:
• The use of expressive voices within a given context has
the potential to influence listener judgment: users judge
the system more effective when voices are expressive
(regardless to the context).
• How human, and perhaps how socially relateable the
voice appears, is higher for neutral speech synthesis that
did not vary with the context.
Relating these results to the metaphor of the uncanny
valley [18], human reactions to voices with an expressive
range may present with a similar ‘valley’. Specifically as
Effectiveness versus Personification
Person based language: Gender x Language
synthetic paralinguistic capabilities improve in their ability to
approximate human speech, human listeners’ reactions to
such speech technology may follow the curve of such a
valley, but the location of any dip may occur at differently for
different groups. One interpretation of figure 4 is that some
groups of people may be entering, leaving, or in the middle
of an uncanny valley regarding the use of paralinguistic
expressive speech cues for these social contexts included in
our experiment.
Expressiveness appears to be a relevant factor, in addition
to other known factors such as intelligibility, on functional
outcomes that can be studied with current speech synthesis
technology. From the methodological point of view, these
results reinforce the importance of using carefully designed
sampling procedures, and avoiding samples of convenience,
given that different trends were observed that would have
been masked if only native English speaking men had been
included.
In summary, differences in how listeners react to the
appropriate and inappropriate use an expressive range of
synthetic voice styles can be observed in theoretical
scenarios. These results add to the small but growing
knowledge base on the effects of the use of expressive
voices, which builds on the separate and relatively more
established and extensive knowledge of the impact of factors
such as speech intelligibility, a voice’s gender, and word
choice on human interaction with speaking agents, including
robots.
A
CKNOWLEDGMENT
The authors are grateful to Éva Skékely and her colleagues
in the CNGL at University College Dublin for granting
access to the synthetic voices for this research and her work
on developing the original survey from which these results
are derived.
R
EFERENCES
[1] J. McCarthy and J. Light, “Attitudes toward
individuals who use augmentative and alternative
communication: Research review,” Augmentative and
Alternative Communication, vol. 21, no. 1, pp. 41–55,
2005.
[2] R. Schlosser, “Roles of speech output in
augmentative and alternative communication:
narrative review,” Augmentative and Alternative
Communication, vol. 19, no. 1, pp. 5–27, 2003.
[3] M. Jonsson, “Social and emotional characteristics of
speech-based in-vehicle information systems: impact
on attitude and driving behaviour,” Department of
Computer and Information Science, Linköping
University, 2009.
[4] C. I. Nass and S. Brave, Wired for speech: how voice
activates and advances the human-computer
relationship. Cambridge, MA: MIT Press, 2005.
[5] X. Ma, B. H. Le, and Z. Deng, “Perceptual analysis of
talking avatar head movements: A quantitative
perspective,” in Proceedings of the 2011 annual
conference on Human Factors in computing systems,
Vancouver, BC, Canada, 2011, pp. 2699–2702.
[6] P. Persson, “ExMS: an animated and avatar-based
messaging system for expressive peer
communication,” in Proceedings of the 2003
international ACM SIGGROUP conference on
Supporting group work, Sanibel Island, Florida, USA,
2003, pp. 31–39.
[7] C. R. Crowell, M. Scheutz, P. Schermerhorn, and M.
Villano, “Gendered voice and robot entities:
Perceptions and reactions of male and female
subjects,” in IEEE/RSJ International Conference on
Intelligent Robots and Systems, IROS 2009, 2009, pp.
3735–3741.
[8] K. D. . Drager, J. Reichle, and C. Pinkoski,
“Synthesized speech output and children: A scoping
review,” American Journal of Speech-Language
Pathology, vol. 19, no. 3, p. 259, 2010.
[9] P. Côté-Giroux, N. Trudeau, C. Valiquette, A. Sutton,
E. Chan, and C. Hébert, “Évaluation de neuf synthèses
vocales françaises basée sur l’intelligibilité et
l’appréciation Assessment of nine French synthesized
voices based on intelligibility and quality,” Revue
canadienne d’orthophonie et d’audiologie-Vol, vol.
35, no. 4, 2011.
[10] W. M. Liu, K. A. Jellyman, J. S. D. Mason, and N.
W. D. Evans, “Assessment of objective quality
measures for speech intelligibility estimation,” in
Acoustics, Speech and Signal Processing, 2006.
ICASSP 2006 Proceedings. 2006 IEEE International
Conference on, 2006, vol. 1, p. I–I.
[11] N. Campbell, “Getting to the heart of the matter:
Speech as the expression of affect; rather than just text
or language,” Language Res Eval, vol. 39, no. 1, pp.
109–118, Feb. 2005.
[12] M. Schröder, “Expressive speech synthesis: Past,
present, and possible futures,” in Affective Information
Processing, 1st ed., J. Tao and T. Tan, Eds. Springer,
2009, pp. 111–126.
[13] K. W. Lee, K. Liao, and S. Ryu, “Children’s
responses to computer-synthesized speech in
educational media: gender consistency and gender
similarity effects,” Human communication research,
vol. 33, no. 3, pp. 310–329, 2007.
[14] K. Fischer, “Interpersonal variation in understanding
robots as social actors,” in Proceedings of the 6th
international conference on Human-robot interaction,
Laussane, Switzerland, 2011, pp. 53–60.
[15] D. Gorenflo and C. Gorenflo, “Effects of synthetic
speech, gender, and perceived similarity on attitudes
toward the augmented communicator,” Augmentative
and Alternative Communication, vol. 13, no. 2, pp.
87–91, Jan. 1997.
[16] A. Beck and M. Dennis, “Attitudes of children toward
a similar-aged child who uses augmentative
communication,” Augmentative and Alternative
Communication, vol. 12, no. 2, pp. 78–87, Jan. 1996.
[17] M. Lilienfeld and E. Alant, “Attitudes of children
toward an unfamiliar peer using an AAC device with
and without voice output,” Augmentative and
Alternative Communication, vol. 18, no. 2, pp. 91–
101, 2002.
[18] E.-J. Lee, “The more humanlike, the better? how
speech type and users’ cognitive style affect social
responses to computers,” Computers in Human
Behavior, vol. 26, no. 4, pp. 665–672, Jul. 2010.
[19] M. Coeckelbergh, “Personal robots, appearance, and
human good: a methodological reflection on
roboethics,” International Journal of Social Robotics,
vol. 1, no. 3, pp. 217–221, 2009.
[20] M. Mori, “The uncanny valley,” tranlation by Karl
MacDorman and Takashi Minato, Energy, vol. 7, no.
4, pp. 33–35, 1970.
[21] C. Bartneck, T. Kanda, H. Ishiguro, and N. Hagita,
“Is the uncanny valley an uncanny cliff?,” in The 16th
IEEE International Symposium on Robot and Human
interactive Communication, Jeju, Korea, 2007, pp.
368–373.
[22] K. Dautenhahn, “Methodology and themes of human-
robot interaction: A growing research field,”
International Journal of Advanced Robotic Systems,
vol. 4, no. 1, pp. 103–108, 2007.
[23] M. Coeckelbergh, “You, robot: on the linguistic
construction of artificial others,” AI & SOCIETY, vol.
26, no. 1, pp. 61–69, Aug. 2010.
[24] A. Mulac, J. J. Bradac, and P. Gibbons, “Empirical
support for the gender-as-culture hypothesis,” Human
Communication Research, vol. 27, no. 1, pp. 121–152,
2001.
[25] J. W. Pennebaker, M. R. Mehl, and K. G.
Niederhoffer, “Psychological aspects of natural
language use: Our words, our selves,” Annual review
of psychology, vol. 54, no. 1, pp. 547–577, 2003.
[26] S. R. Fussell, S. Kiesler, L. D. Setlock, and V. Yew,
“How people anthropomorphize robots,” in
Proceedings of the 3rd ACM/IEEE international
conference on Human robot interaction, Amsterdam,
2008, pp. 145–152.
[27] C. Nass and Y. Moon, “Machines and mindlessness:
Social responses to computers,” Journal of Social
Issues, vol. 56, no. 1, pp. 81–103, 2000.
[28] É. Székely, J. P. Cabral, P. Cahill, and J. Carson-
Berndsen, “Clustering expressive speech styles in
audiobooks using glottal source parameters,” in
Twelfth Annual Conference of the International
Speech Communication Association, Interspeech,
Florence, Italy, 2011, pp. 2409–2412.
[29] É. Székely, J. P. Cabral, M. Abou-Zleikha, P. Cahill,,
and J. Carson-Berndsen, “Evaluating expressive
speech synthesis from audiobooks in conversational
phrases,” in Proceedings of the Eighth International
Conference on Language Resources and Evaluation,
LREC '12, Istanbul, Turkey, 2012.
[30] S. Hennig, È. Székely, J. Carson-
Berndsen, and R. Chellali, “Listener evaluation of an
expressiveness scale in speech synthesis for conversati
onal phrases: implications for AAC,” in the Internatio
nal Society for Augmentative and Alternative Comm
unication 15th Biennial Conference, Pittsburg, PA, US
A, 2012.
[31] K. E. Scheibe and M. Erwin, “The Computer as
Alter,” The Journal of Social Psychology, vol. 108,
no. 1, pp. 103–109, 1979.
[32] R Development Core Team, R: A Language and
Environment for Statistical Computing. http://www.R-
project.org. Vienna Austria, 2008.