Conference PaperPDF Available

Does the Appearance of an Agent Affect How We Perceive his/her Voice?: Audio-visual Predictive Processes in Human-robot Interaction

Authors:

Figures

Content may be subject to copyright.
Does the Appearance of an Agent Affect How We Perceive
his/her Voice? Audio-visual Predictive Processes
in Human-robot Interaction
Busra Sarigul
Interdisciplinary Social Psychiatry Program, Ankara University, Turkey
Department of Psychology, Nuh Naci Yazgan University, Turkey
busra.srgl@gmail.com
Batuhan Hokelek
Department of Psychology
Bilkent University, Turkey
batuhan.hokelek@ug.bilkent.edu.tr
Imge Saltik
Interdisciplinary Neuroscience Program
Bilkent University, Turkey
imge.saltik@bilkent.edu.tr
Burcu A. Urgen
Department of Psychology & Interdisciplinary Neuroscience Program
National Magnetic Resonance Research Center (UMRAM)
Aysel Sabuncu Brain Research Center
Bilkent University, Turkey
burcu.urgen@bilkent.edu.tr
ABSTRACT
Robots increasingly become part of our lives. How we perceive
and predict their behavior has been an important issue in HRI. To
address this issue, we adapted a well-established prediction
paradigm from cognitive science for HRI. Participants listened a
greeting phrase that sounds either human-like or robotic. They
indicated whether the voice belongs to a human or a robot as fast
as possible with a key press. Each voice was preceded with a
human or robot image (a human-like robot or a mechanical robot)
to cue the participant about the upcoming voice. The image was
either congruent or incongruent with the sound stimulus. Our
findings show that people reacted faster to robotic sounds in
congruent trials than incongruent trials, suggesting the role of
predictive processes in robot perception. In sum, our study
provides insights about how robots should be designed, and
suggests that designing robots that do not violate our expectations
may result in a more efficient interaction between humans and
robots.
CCS CONCEPTS
Human-centered computing Human-computer interaction
(HCI) HCI design and evaluation methods User studies
KEYWORDS
Humanoid robots, robot design, audio-visual mismatch, prediction,
robotic voice, human perception, cognitive sciences
ACM Reference format:
Busra Sarigul, Imge Saltik, Batuhan Hokelek, Burcu A. Urgen. 2020. Does
the Appearance of an Agent Affect how we Perceive his/her Voice? Audio-
visual Predictive Processes in Human-robot Interaction. In Proceedings of
ACM HRI conference (HRI’20), March 23-26, 2020, Cambridge, UK. ACM, NY,
NY, USA, 3 pages. https://doi.org/10.1145/3371382.3378302
1. INTRODUCTION
One fundamental cognitive mechanism humans possess is to be
able to make predictions about what will happen in their
environment [1]. This skill is very important to take the
appropriate action based on what we perceive. Literature of
predictive processing heavily focuses on visual perception.
However, our daily experience is multi-modal in nature [2]. For
instance, imagine that a friend of yours sees you across the street
and waves to you. Upon seeing his/her gesture, you probably
predict that he/she will say “Hello!” to you. Thus, based on what
you see, you can have an expectation about what you will hear. As
humanoid robots increasingly become participants in our
environments such as hospitals, airports or schools, it is likely that
we learn from our multimodal experiences with them [3-5], and
predict how they will behave in our mutual interaction. For
instance, if a robot companion waves to us, we may expect that it
immediately says “Hello!”. Moreover, how it appears can give us
clues about how it will behave, and accordingly affect how we
perceive them [6-7]. Previous research shows that when there is a
mismatch between what we expect and what we perceive while
interacting with artificial agents, we may find them unacceptable,
eerie, or uncanny [8-11]. The aim of the present study is to
investigate whether we make predictions about how robots behave
based on how they appear, and if so what kind of predictions we
make, and whether those predictions are similar to the ones we
make for humans.
Figure 1. Visual stimuli in the experiment
2. METHODS
Seventeen healthy adults from Bilkent University (9 females,
Mean = 23.47, SD = 2.70) participated in the experiment. All
participants had normal vision and no history of neurological
disorders.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for third-party components of this
work must be honored. For all other uses, contact the Owner/Author.
HRI '20 Companion, March 2326, 2020, Cambridge, United Kingdom
© 2020 Copyright is held by the owner/author(s).
ACM ISBN 978-1-4503-7057-8/20/03.
https://doi.org/10.1145/3371382.3378302
Late-Breaking Report
HRI ’20 Companion, March 23–26, 2020, Cambridge, United Kingdom
430
Figure 2. Results of Condition 1 and Condition 2, and Comparison of Condition 1 and Condition 2.
Before the experiment, all subjects signed a consent form approved
by the Ethics Committee of Bilkent University.
Visual stimuli: The visual stimuli consist of static images of three
agents. We call them Human, Android, and Robot (Figure 1, and
also [9, 10, 12, 13]). Android and Robot are the same machine in
different appearances. Android has a more human-like
appearance.
Auditory stimuli: The auditory stimuli consist of two sound files
which last 2 seconds: the voice of a human saying ‘Good morning’
(human voice), and a modified version of it in which the voice
sounds robotic (robotic voice). To make the natural human voice
to robotic, the frequency of the original sound file was
manipulated. To determine whether the manipulation works and
people find the sound really robotic, we did a pilot study. We
created 14 different sound files in which the frequency was
manipulated between -5 Hz to -20 Hz with a step size of 2 Hz,
using the software Audacity 2.3.0. According to ratings, we
identified the most robotic voice based on the average of the
subjects, so this voice was used as the stimuli in the experiment
with the original human voice. We also added white noise to both
sound files to make the task harder. The subjects participated in an
experiment that included two conditions. Both conditions
consisted of 5 blocks, each of which had 80 trials. Each trial started
with a fixation cross (1 sec). It was followed by a visual cue (1 sec)
which was an image of Human or Robot in Condition 1, and
Human and Android in Condition 2. The visual cue was followed
by a 2 sec auditory stimulus: human voice or robotic voice. The
subjects’ task was to indicate whether the auditory stimulus was a
human voice or a robotic voice with a key press. The visual cue
informed the subjects about the upcoming auditory stimulus. 80%
of the trials, the visual cue was congruent with the auditory
stimulus, and in 20% of the trials it was incongruent. The order of
the conditions was counter-balanced across subjects.
3. RESULTS
Condition 1: There was a main effect of congruency on reaction
times (F(1,16) = 13.47, p<0.05). Subjects responded to auditory
targets significantly faster in congruent trials (M = 1.21 sec, SD =
0.04) than incongruent trials. (M = 1.30 sec, SD = 0.03). We also
conducted pair-wise t-tests between congruent and incongruent
conditions for each visual cue (Human and Robot) separately.
When the visual cue was Robot, subjects performed significantly
faster in congruent trials than incongruent trials (t(16)=-8.02,
p<0.05). When the visual cue was Human, subjects performed
similarly in congruent and incongruent conditions (t(16) = -0.82,
p=0.94) (See Figure 2, left).
Condition 2: There was a main effect of congruency on reaction
times. Subjects responded faster in congruent trials (M = 1.22, SD =
0.04) than incongruent trials (M = 1.28, SD = 0.04) (F(1,16) = 16.16,
p<0.05). Furthermore, we performed pair-wise t-tests between
congruent and incongruent conditions for each visual cue (Human
and Android) separately. When the visual cue was Android,
subjects performed significantly faster in congruent trials than
incongruent trials (t(16) = -7.98, p<0.05). When the visual cue was
Human, subjects performed similarly in congruent and
incongruent conditions (t(16) = 0.58, p=0.60) (See Figure 2).
Comparison of Condition 1 and Condition 2: We investigated
whether the human-likeness of the visual cue (Human, Android,
Robot) affects how people categorize auditory target, and whether
it interacts with congruency. So, we compared Condition 1 and
Condition 2. There was not a significant difference between the
Human part in Condition 1 and the Human part in Condition 2, so,
we included only one of the Human parts in the analysis. There
was a main effect of congruency on reaction times (F(1,16) = 31.22,
p<0.05) but there was no main effect of visual cue (F(1,16) = 0.01,
p=0.99). However, interestingly, there was an interaction between
congruency and visual cue (F(1,16) = 28.74, p<0.05). A closer look
at the pattern of results showed that the difference between the
congruent and incongruent conditions was largest for Robot,
followed by Android, followed by Human (See Figure 2, right).
4. DISCUSSION
We hypothesized that people would get faster in judging how
an agent sounded like if it was preceded by a congruent visual cue
than incongruent cue. Our results show that if the visual cue is a
robot, people expect that it would sound robotic as demonstrated
by shorter reaction times in congruent condition (robot cue and
robot voice) than incongruent condition (robot cue and human
voice). This was true whether the robot has a more human-like
appearance (Android) or less human-like appearance (Robot). An
unexpected finding in our study was that people respond to a
human voice similarly regardless of the visual cue it precedes. One
possible explanation for these results is that the human voice
stimulus was not ambiguous enough. People are more likely to use
cues (or prior knowledge) when the task at hand is hard (e.g. the
stimulus is ambiguous) [14]. One way to resolve this issue is to
increase the white noise in the voice stimuli and make the task
harder. Nevertheless, the use of a well-established prediction
paradigm from cognitive sciences in the present study has opened
a new avenue of research in HRI. Appearance and voice are only
two features among many, for which we seek a match in agent
perception. Future work should investigate what features of
artificial agents make us form expectations, how we do that, and
under what conditions these expectations are violated.
Late-Breaking Report
HRI ’20 Companion, March 23–26, 2020, Cambridge, United Kingdom
431
5. REFERENCES
[1] A. Clark (2013). Whatever next? Predictive brains, situated agents, and the
future of cognitive science. Behavioral and Brain Sciences, 36 (3), 181-204
[2] O. Doehrmann, & M.J. Naumer (2008). Semantics and the multisensory brain:
how meaning modulates processes of audio-visual integration. Brain research,
1242, 136-150
[3] C. Nass (2015). Wired for Speech: How Voice Activates and Advances the
Human-Computer Relationship. MIT Press, Cambridge, MA.
[4] S.E. Stern, J.W. Mullennix, I. Yaroslavsky (2006). Persuasion and social
perception of human vs. synthetic voice across person as source and computer
as source conditions. International Journal of Human-Computer Interaction, 64,
43-52.
[5] M.L. Walters, D.D. Dyrdal, K.L. Koay, K. Dautenhahn, R. te Boeckhorst (2008).
Human approach distances to a mechanical-looking robot with different voice
styles. Proceedings of RO-MAN, Munich, Germany.
[6] K. Zibrek, E. Kokkinara, R. Mcdonnell (2018). The effect of realistic appearance
of virtual characters in immersive environments Does the character’s
personality play a role? IEEE Transactions on Visualization and Computer
Graphics, 24 (4), 1681-1690.
[7] C. Mousas, D. Anastasiou, O. Spantidi (2018). The effects of appearance and
motion of virtual characters on emotional reactivity. Computers in Human
Behavior, 86, 99-108.
[8] Mori, M., MacDorman, K. F., & Kageki, N. (2012). The uncanny valley [from the
field]. IEEE Robotics & Automation Magazine, 19(2), 98-100.
[9] B.A. Urgen, M. Kutas, A.P. Saygin (2018). Uncanny valley as a window into
predictive processing in the social brain. Neuropsychologia, 114, 181-185
[10] W.J. Mitchell, K.A. Szerszen, A.S. Lu, P.W. Schermerhorn, M. Scheutz,
K.F.Macdorman (2011). A mismatch in the human realism of face and voice
produces an uncanny valley. Iperception, 2 (1), 10-12.
[11] A.P. Saygin, T.Chaminade, H.Ishiguro, J.Driver, C.Frith (2012). The thing that
should not be: predictive coding and the uncanny valley in perceiving human
and humanoid robot actions. Social Cognitive Affective Neuroscience, 7, 413-
422.
[12] B.A. Urgen, M. Plank, H. Ishiguro, H.Poizner, A.P.Saygin (2013). EEG theta and
mu oscillations during perception of human and robot actions. Frontiers in
Neurorobotics, 7:19.
[13] B.A. Urgen, S.Pehlivan, A.P.Saygin (2019). Distinct representations in occipito-
temporal, parietal, and premotor cortex during action perception revealed by
fMRI and computational modeling. Neuropsychologia, 127, 35-47.
[14] F.P. de Lange, M. Heilbron, P. Kok (2018). How do expectations shape
perception? Trends in Cognitive Sciences, 22 (9), 764-779.
Late-Breaking Report
HRI ’20 Companion, March 23–26, 2020, Cambridge, United Kingdom
432
... In addition, some participants suggested that a match between the anthropomorphism levels of the voice and verbal style would be important in improving the impressions of the robots. Prior studies revealed that the incongruent anthropomorphism level of the robots' appearance and voice elicited more eeriness and violated users' expectations [47,48]. Likewise, investigating whether the incongruent anthropomorphism level of the robot's voice and verbal style affects user experience might be interesting. ...
... Bayesian models are used to quantify human cognitive prediction processes, which can give credence to studies which link prediction errors to the occurrence of the uncanny valley effect. This has been particularly successful under conditions of mismatched voice and appearance [43] such as ours. ...
Article
In this paper, we investigate the effect of a realism mismatch in the voice and appearance of a photorealistic virtual character in both immersive and screen-mediated virtual contexts. While many studies have investigated voice attributes for robots, not much is known about the effect voice naturalness has on the perception of realistic virtual characters. We conducted the first experiment in Virtual Reality (VR) with over two hundred participants investigating the mismatch between realistic appearance and unrealistic voice on the feeling of presence, and the emotional response of the user to the character expressing a strong negative emotion. We predicted that the mismatched voice would lower social presence and cause users to have a negative emotional reaction and feelings of discomfort towards the character. We found that the concern for the virtual character was indeed altered by the unnatural voice, though interestingly it did not affect social presence.The second experiment was conducted with a view towards heightening the appearance realism of the same character for the same scenarios, with an additional lower level of voice realism employed to strengthen the mismatch of perceptual cues. While voice type did not appear to impact reports of empathic responses towards the character, there was an observed effect of voice realism on reported social presence, which was not detected in the first study. There were also significant results on affinity and voice trait measurements that provide evidence in support of perceptual mismatch theories of the Uncanny Valley.
... A higher level of humanoid is expected to carry a more sociable speech style, which promotes acceptance and preference. [29] also suggested that the congruent features of voice and appearance may promote efficiency in human-robot interaction. In addition, previous research has found that warmth and competence among all RoSAS and GodSpeed subscales are the two most crucial and decisive predictors for human preferences [30]. ...
... McGinn and Torre [114] found that participants were only able to match the voice of one robot, the PR2, to its body. Sarigul et al. [158] found that people were quicker to assign a robot voice to a robot image, rather than a human voice to a robot image. Gong and Lai [62] found that mixing a human voice with a TTS at the same time led to poorer performance, even though people thought that they had performed better and found that version of the system easier to use. ...
Article
Full-text available
Social robots, conversational agents, voice assistants, and other embodied AI are increasingly a feature of everyday life. What connects these various types of intelligent agents is their ability to interact with people through voice. Voice is becoming an essential modality of embodiment, communication, and interaction between computer-based agents and end-users. This survey presents a meta-synthesis on agent voice in the design and experience of agents from a human-centered perspective: voice-based human--agent interaction (vHAI). Findings emphasize the social role of voice in HAI as well as circumscribe a relationship between agent voice and body, corresponding to human models of social psychology and cognition. Additionally, changes in perceptions of and reactions to agent voice over time reveals a generational shift coinciding with the commercial proliferation of mobile voice assistants. The main contributions of this work are a vHAI classification framework for voice across various agent forms, contexts, and user groups, a critical analysis grounded in key theories, and an identification of future directions for the oncoming wave of vocal machines.
Article
Appearance and voice are essential factors impacting users' affective preferences for humanoid robots. However, little is known about how the appearance and voice of humanoid robots jointly influence users' affective preferences and visual attention. We conducted a mixed-design eye-tracking experiment to examine the multisensory integration effect of humanoid robot appearances and voices on users' affective preferences and visual attention. The results showed that the combinations of affectively preferred voices and appearances attracted more affective preferences and shorter average fixation durations. The combinations of non-preferred voices and preferred appearances captured less affective preferences and longer fixation durations. The results suggest that congruent combinations of affectively preferred voices and appearances might motivate a facilitation effect on users' affective preference and the depth of visual attention through audiovisual complements. Incongruent combinations of non-preferred voices and preferred appearances might stimulate an attenuation effect and result in less affective preferences and a deeper retrieval of visual information. Besides, the head attracted the most amount of visual attention regardless of voice conditions. This paper contributes to deepening the understanding of the multisensory integration effect on users' affective preferences and visual attention and providing practical implications for designing humanoid robots satisfying users' affective preferences.
Conference Paper
Full-text available
Mind perception is considered to be the ability to attribute mental states to non-human beings. As social robots increasingly become part of our lives, one important question for HRI is to what extent we attribute mental states to these agents and the conditions under which we do so. In the present study, we investigated the effect of appearance and the type of action a robot performs on mind perception. Participants rated videos of two robots in different appearances (one metallic, the other human-like), each of which performed four different actions (manipulating an object, verbal communication, non-verbal communication, and an action that depicts a biological need) on Agency and Experience dimensions. Our results show that the type of action that the robot performs affects the Agency scores. When the robot performs human-specific actions such as communicative actions or an action that depicts a biological need, it is rated to have more agency than when it performs a manipulative action. On the other hand, the appearance of the robot did not have any effect on the Agency or the Experience scores. Overall, our study suggests that the behavioral skills we build into social robots could be quite important in the extent we attribute mental states to them.
Article
Full-text available
Virtual characters that appear almost photo-realistic have been shown to induce negative responses from viewers in traditional media, such as film and video games. This effect, described as the uncanny valley, is the reason why realism is often avoided when the aim is to create an appealing virtual character. In Virtual Reality, there have been few attempts to investigate this phenomenon and the implications of rendering virtual characters with high levels of realism on user enjoyment. In this paper, we conducted a large-scale experiment on over one thousand members of the public in order to gather information on how virtual characters are perceived in interactive virtual reality games. We were particularly interested in whether different render styles (realistic, cartoon, etc.) would directly influence appeal, or if a character's personality was the most important indicator of appeal. We used a number of perceptual metrics such as subjective ratings, proximity, and attribution bias in order to test our hypothesis. Our main result shows that affinity towards virtual characters is a complex interaction between the character's appearance and personality, and that realism is in fact a positive choice for virtual characters in virtual reality.
Article
Full-text available
The perception of others' actions supports important skills such as communication, intention understanding, and empathy. Are mechanisms of action processing in the human brain specifically tuned to process biological agents? Humanoid robots can perform recognizable actions, but can look and move differently from humans, and as such, can be used in experiments to address such questions. Here, we recorded EEG as participants viewed actions performed by three agents. In the Human condition, the agent had biological appearance and motion. The other two conditions featured a state-of-the-art robot in two different appearances: Android, which had biological appearance but mechanical motion, and Robot, which had mechanical appearance and motion. We explored whether sensorimotor mu (8-13 Hz) and frontal theta (4-8 Hz) activity exhibited selectivity for biological entities, in particular for whether the visual appearance and/or the motion of the observed agent was biological. Sensorimotor mu suppression has been linked to the motor simulation aspect of action processing (and the human mirror neuron system, MNS), and frontal theta to semantic and memory-related aspects. For all three agents, action observation induced significant attenuation in the power of mu oscillations, with no difference between agents. Thus, mu suppression, considered an index of MNS activity, does not appear to be selective for biological agents. Observation of the Robot resulted in greater frontal theta activity compared to the Android and the Human, whereas the latter two did not differ from each other. Frontal theta thus appears to be sensitive to visual appearance, suggesting agents that are not sufficiently biological in appearance may result in greater memory processing demands for the observer. Studies combining robotics and neuroscience such as this one can allow us to explore neural basis of action processing on the one hand, and inform the design of social robots on the other.
Article
Full-text available
More than 40 years ago, Masahiro Mori, a robotics professor at the Tokyo Institute of Technology, wrote an essay [1] on how he envisioned people's reactions to robots that looked and acted almost like a human. In particular, he hypothesized that a person's response to a humanlike robot would abruptly shift from empathy to revulsion as it approached, but failed to attain, a lifelike appearance. This descent into eeriness is known as the uncanny valley. The essay appeared in an obscure Japanese journal called Energy in 1970, and in subsequent years, it received almost no attention. However, more recently, the concept of the uncanny valley has rapidly attracted interest in robotics and other scientific circles as well as in popular culture. Some researchers have explored its implications for human-robot interaction and computer-graphics animation, whereas others have investigated its biological and social roots. Now interest in the uncanny valley should only intensify, as technology evolves and researchers build robots that look human. Although copies of Mori's essay have circulated among researchers, a complete version hasn't been widely available. The following is the first publication of an English translation that has been authorized and reviewed by Mori. (See “Turning Point” in this issue for an interview with Mori.).
Article
Full-text available
Brains, it has recently been argued, are essentially prediction machines. They are bundles of cells that support perception and action by constantly attempting to match incoming sensory inputs with top-down expectations or predictions. This is achieved using a hierarchical generative model that aims to minimize prediction error within a bidirectional cascade of cortical processing. Such accounts offer a unifying model of perception and action, illuminate the functional role of attention, and may neatly capture the special contribution of cortical processing to adaptive success. This target article critically examines this "hierarchical prediction machine" approach, concluding that it offers the best clue yet to the shape of a unified science of mind and action. Sections 1 and 2 lay out the key elements and implications of the approach. Section 3 explores a variety of pitfalls and challenges, spanning the evidential, the methodological, and the more properly conceptual. The paper ends (sections 4 and 5) by asking how such approaches might impact our more general vision of mind, experience, and agency.
Article
Full-text available
There is evidence that people react more positively when they are presented with faces that are consistent with their voices. Nass and Brave [2005]. Wired for speech: How voice Activates and Advances the Human–computer Relationship. MIT Press, Cambridge, MA] found that computerized and human faces were perceived more positively when paired, respectively, with synthesized versus human voices than when paired with inconsistent voices. The present study sought to examine whether this type of inconsistency would effect perceptions of persuasive messages delivered by humans versus computers. We created a situation in which reactions to computer synthesized speech were compared to human speech when the speech was either from a person or a computer. This paper presents two studies, one using audio taped stimuli and one using videotaped stimuli, with type of speech (human versus computer synthesized) manipulated factorially with source (person versus computer). As hypothesized, both studies suggest that in the human as source condition, human voice is perceived more favorably than synthetic voice. However, in the computer as source condition, both human and computer voice were rated similarly. We discuss these findings in terms of consistency as well as group processes effects that may be occurring.
Article
Visual processing of actions is supported by a network consisting of occipito-temporal, parietal, and premotor regions in the human brain, known as the Action Observation Network (AON). In the present study, we investigate what aspects of visually perceived actions are represented in this network using fMRI and computational modeling. Human subjects performed an action perception task during scanning. We characterized the different aspects of the stimuli starting from purely visual properties such as form and motion to higher-aspects such as intention using computer vision and categorical modeling. We then linked the models of the stimuli to the three nodes of the AON with representational similarity analysis. Our results show that different nodes of the network represent different aspects of actions. While occipito-temporal cortex performs visual analysis of actions by means of integrating form and motion information, parietal cortex builds on these visual representations and transforms them into more abstract and semantic representations coding target of the action, action type and intention. Taken together, these results shed light on the neuro-computational mechanisms that support visual perception of actions and provide support that AON is a hierarchical system in which increasing levels of the cortex code increasingly complex features.
Article
Perception and perceptual decision-making are strongly facilitated by prior knowledge about the probabilistic structure of the world. While the computational benefits of using prior expectation in perception are clear, there are myriad ways in which this computation can be realized. We review here recent advances in our understanding of the neural sources and targets of expectations in perception. Furthermore, we discuss Bayesian theories of perception that prescribe how an agent should integrate prior knowledge and sensory information, and investigate how current and future empirical data can inform and constrain computational frameworks that implement such probabilistic integration in perception.