ArticlePDF Available

Abstract and Figures

Recent work in cognitive science suggests that our expectations affect visual perception. With the rise of artificial agents in human life in the last few decades, one important question is whether our expectations about non-human agents such as humanoid robots affect how we perceive them. In the present study, we addressed this question in an audio–visual context. Participants reported whether a voice embedded in a noise belonged to a human or a robot. Prior to this judgment, they were presented with a human or a robot image that served as a cue and allowed them to form an expectation about the category of the voice that would follow. This cue was either congruent or incongruent with the category of the voice. Our results show that participants were faster and more accurate when the auditory target was preceded by a congruent cue than an incongruent cue. This was true regardless of the human-likeness of the robot. Overall, these results suggest that our expectations affect how we perceive non-human agents and shed light on future work in robot design.
This content is subject to copyright. Terms and conditions apply.
International Journal of Social Robotics (2023) 15:855–865
https://doi.org/10.1007/s12369-023-00990-6
Audio–Visual Predictive Processing in the Perception of Humans
and Robots
Busra Sarigul1·Burcu A. Urgen2,3,4
Accepted: 3 March 2023 / Published online: 5 April 2023
© The Author(s) 2023
Abstract
Recent work in cognitive science suggests that our expectations affect visual perception. With the rise of artificial agents
in human life in the last few decades, one important question is whether our expectations about non-human agents such as
humanoid robots affect how we perceive them. In the present study, we addressed this question in an audio–visual context.
Participants reported whether a voice embedded in a noise belonged to a human or a robot. Prior to this judgment, they were
presented with a human or a robot image that served as a cue and allowed them to form an expectation about the category of
the voice that would follow. This cue was either congruent or incongruent with the category of the voice. Our results show
that participants were faster and more accurate when the auditory target was preceded by a congruent cue than an incongruent
cue. This was true regardless of the human-likeness of the robot. Overall, these results suggest that our expectations affect
how we perceive non-human agents and shed light on future work in robot design.
Keywords Prediction ·Expectation violation ·Human–robot interaction ·Audio–visual mismatch
1 Introduction
Advances in artificial intelligence in the last few decades
have introduced us to humanoid robots that we encounter
everywhere ranging from classrooms to airports to shopping
malls to hospitals. While their presence in our daily lives
has brought a lot of excitement, how humans perceive and
interact with them has become an important research topic
in cognitive science. Do we perceive them differently from
the way we perceive other humans? How important is it that
they look or sound human or behave like humans? What are
our expectations from robots? To what extent do they fulfill
BBusra Sarigul
b.sariguel@iwm-tuebingen.de
BBurcu A. Urgen
burcu.urgen@bilkent.edu.tr
1Leibniz-Institut für Wissensmedien, Tübingen, Germany
2Department of Psychology, Bilkent University, Ankara,
Turkey
3Interdisciplinary Neuroscience Program, Bilkent University,
Ankara, Turkey
4Aysel Sabuncu Brain Research Center, National Magnetic
Resonance Imaging Research Center (UMRAM), Ankara,
Turkey
our expectations? These are some of the questions cognitive
scientists are interested in addressing not only to be able to
better understand human nature but also to be able to guide
the design of robots in the future.
In his classical work, The Design of Everyday Things,
Don Norman [1] provides important insights about how cog-
nitive sciences can help in the design of artefacts including
machines such as robots. According to Norman [1], the
design artifacts should be adapted to the minds of their users,
and this is why one needs to understand the human mind
first. This implies that a collaboration between human–robot
interaction and cognitive sciences is necessary. Indeed, the
use of robots in well-established cognitive psychology and
neuroscience paradigms in the last decade has proven use-
ful to understand how humans respond to non-human agents
as compared to their human counterparts, and what kind of
principles we should follow in humanoid robot design [24].
One of the cognitive psychology/neuroscience paradigms
that have been successfully applied in human–robot inter-
action is the expectation-violation paradigm [5,6]. These
paradigms have been developed to understand the nature of
information processing in a variety of perceptual and cogni-
tive tasks [712] and have been instrumental to come up with
recent theories of human brain and cognition such as pre-
dictive coding or computation [1315]. According to these
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
856 International Journal of Social Robotics (2023) 15:855–865
theories, perception is not a purely bottom-up or stimulus-
driven process, rather, expectations and prior knowledge
play an important role in how we perceive our environ-
ment. A growing body of empirical work in psychology and
neuroscience are in line with these theories showing that
participants respond faster and more accurately when they
perceive events that are expected compared to the ones that
are unexpected [912]. These results suggest that humans
constantly predict what would come next and this in turn
determines what they perceive [14].
Recent work at the intersection of cognitive science and
social robotics has shown that humans can extend their
prediction skills to the perception of robots and form expec-
tations about robots based on their prior experience [5,6,16,
17]. These studies manipulated expectations towards robots
by means of using stimuli that have mismatches in a variety
of visual dimensions including appearance (form), motion,
and interaction. In other words, these mismatch paradigms
aim to induce certain expectations based on a particular cue,
and at the same time present another cue that usually does
not match that cue, resulting in expectation violation. For
instance, Urgen et al. [6] show that the appearance of a
robot can elicit certain expectations in humans about how
the robot would move, and when the robot does not move in
an expected way, an N400 ERP effect is observed indicating
that the expectations are violated. Using a similar paradigm,
[5] showed differential activity in the parietal cortex for an
agent that moved in an unexpected way compared to others
that moved in an expected way, which they interpreted as a
prediction error within the framework of predictive coding
[13,14]. Furthermore, in a study that investigates sensori-
motor signaling in human–robot interaction, [17] shows that
people show lower variability in their performance when a
human-like robot commits a human-like error compared to
a mechanical error and that the pattern is reversed when the
agent is non-human-like morphologically.
Other HRI studies explored mismatches in multisensory
contexts. While vision seems to be the dominant modality
in many HRI studies that investigate how humans perceive
robots, [18] highlights the critical role of voice in commu-
nication and interaction with artificial agents. Accordingly,
there is a growing body of research that examines the role of
voice in HRI in combination with other visual features such as
the appearance or movement of artificial agents [1930]. For
instance, several studies show that the mismatch between the
visual appearance and voice of an artificial agent induces the
uncanny valley effect [19], impairs emotion recognition, and
negatively impacts likability and believability [30]. In a simi-
lar vein, [22] shows that the inconsistency between the facial
proportions and vocal realism of an artificial agent reduces
its credibility and attractiveness. A study with children [20]
shows that the interaction between voice and other visual
features such as appearance and movement affect the per-
ceived lifelikeness and politeness of a robot. People also find
artificial agents with a human-like voice more expressive,
understandable, and likable [21], or attribute more human-
like attributes evidenced by drawing tasks (such as facial
features) [29], than the ones with a synthetic voice.
One drawback of many studies that study HRI in an
audio–visual context is that they usually use subjective mea-
sures in the form of self-reports such as fear and eeriness [19],
credibility or attractiveness [22], politeness and lifelikeness
[20], likability, expressiveness, and understandability [21],
drawings [29], or emotion labeling [30] to evaluate artificial
agents rather than more objective measures such as reaction
time or accuracy. Although self-reports can be instrumen-
tal in providing an initial assessment and uncovering social
behavior under a variety of tasks, they fall short for a number
of reasons. First, self-reports are susceptible to the awareness
and the expressiveness of the participants and may provide
an incomplete or biased picture of human behavior if partic-
ipants lack these skills [31]. Second, self-reports usually do
not provide a mechanistic understanding which would help
with both explaining and predicting human behavior [6,32].
Indeed, Greenwald and Banaji [33] recommend the use of
implicit measures to better understand human social cogni-
tion. To support this effort, many tasks have been developed
such as priming and implicit association tasks that usually
rely on reaction times [34], as well as eye-tracking [32]
and neurophysiological measures [6] recorded within strong
cognitive psychology paradigms in human–robot interac-
tion. Some studies even directly compared the results of
explicit and implicit measurements. A common finding of
these studies is that explicit and implicit measures are modu-
lated differently by the experimental conditions that are under
investigation [32,35,36]. Therefore, given the limitations
of explicit measures, it is important to benefit from implicit
measures recorded under well-established paradigms to gain
a better understanding of human perception and cognition
in human–robot interaction, especially in multisensory con-
texts.
The aim of the present study is to investigate the perception
of human and synthetic voices in the presence of congru-
ent or incongruent visual cues about the agents that produce
those voices using a prediction paradigm. More specifically,
we aim to address whether we make predictions about how
robots sound based on how they look and whether those pre-
dictions are similar to the ones we make for humans. To this
end, we used an expectation-violation paradigm in which
human participants judged whether a greeting word sounded
human-like or synthetic (‘robotic’). This sound was preceded
by a picture of a human or a robot and informed the partic-
ipants with a certain probability about the sound that would
follow (thus form expectations). The hypothesis is that peo-
ple would discriminate the robot sounds faster when they are
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
International Journal of Social Robotics (2023) 15:855–865 857
preceded by a robot picture in contrast to a human picture
just as they would do so with human sounds that are preceded
by human pictures.
2 Method
2.1 Participants
30 healthy adults from the university community (16 females,
Mean age =25.2, SD =0.65) participated in the exper-
iment. All participants had normal or corrected-to-normal
vision and hearing. The sample size of the study was deter-
mined by a power analysis prior to data collection. The
minimum required sample size was determined to be 30 by
using G*Power (with alpha =0.05, beta =0.90, η2=0.25).
The study was approved by the Human Research Ethics Com-
mittee of the university and all subjects signed a consent form
before the study.
2.2 Stimuli
2.2.1 Visual Stimuli
The visual stimuli consisted of static images of three agents.
We call them Human, Android, and Robot (see Fig. 1).
Android and Robot are the same machine in different appear-
ances. Android has a more human-like appearance, and was
modeled from the Human agent, whereas Robot has a more
mechanical appearance as the clothing is removed. Android
is the robot Repliee Q2 which was developed at Osaka Uni-
versity. The images in Fig. 1were captured from the videos of
Saygin-Ishiguro database [5,6], the agents were doing hand
waving gesture. The images were 240 ×240 pixels in size,
and all three were matched in terms of their low-level proper-
ties (luminance and spatial frequency) with SHINE Toolbox
[37].
2.2.2 Auditory Stimuli
The auditory stimuli consisted of two sound files which lasted
2 s: the voice of a human saying ‘Good morning’ (Human
Voice), and a modified version of it in which the voice sounds
synthetic (we call it ‘Robotic Voice’ within the context of this
study). We explored several sound programs which create
synthetic voices usually associated with robots considering
our experience with science-fiction movies, smart devices,
voice assistants, and video games, and discovered that the
main manipulation on these sounds is to play with its echo
and frequency. To create a synthetic voice that would be
associated with a robot in a controlled manner, we modi-
fied the human voice by means of manipulating only these
two features and keeping everything else constant. We used
the audio library AudioLib in Python [38] for this modifica-
tion. The library has 5 different sound types: Ghost, Radio,
Robotic, Echo, and Darth Vader. We conducted a pilot study
in the lab with a small group of people to check whether
applying any of these filters actually worked, and the ghost
was found to be the most synthetic sound that was associ-
ated with a robot. To compensate for the echo factor in this
synthetic voice, echo (0.05) was added to the human voice.
Human and synthetic audio files were otherwise matched in
terms of their amplitude (i.e., loudness) using Adobe Audi-
tion CC (13.0.6). We also added white noise to both sound
files to make the task harder, as previous research on predic-
tion shows that the effect of prediction is strongest when the
stimulus is ambiguous [11]. In order to decide on the task dif-
ficulty, we added different levels of white noise (soft: 1/20,
medium: 1/8, severe: 1/2) and tested them in a pilot study. It
suggested that the visual cue (i.e. prior) was used when the
fraction of the white noise level was 1/2.
2.3 Procedure
Subjects participated in two experiments (Experiment 1 and
Experiment 2). The order of the experiments was counter-
balanced across subjects. In both experiments, the subjects
were seated 57 cm away from a computer screen. Their heads
were stabilized with a chinrest. Before each experiment, the
subjects were introduced to visual and auditory stimuli and
were given verbal instructions. When introducing the visual
stimuli, it was stated to the participants that the Android is
a type of robot. In addition, the human voice was told to
belong to the human in the Human image, and the synthetic
voice was told to belong to the agent in the Robot or Android
image depending on the experiment. They also did a prac-
tice session to make sure that they understood the task. The
experiment was programmed in Psychtoolbox-3 [39,40].
2.3.1 Experiment 1
Experiment 1 consisted of 5 blocks, each containing 80 trials.
Each trial started with a fixation cross on a gray background
(1 s), which was followed by a visual cue (1 s), an image of
a human or a mechanical robot (Human and Robot agents in
Fig. 1). Following the visual cue, a 2 s auditory stimulus was
presented, either a human or a synthetic (robotic) voice (See
Fig. 2). The task of the subjects was to indicate whether the
sound was human-like or robot-like by pressing a key.
The visual cue informed the subjects about the upcom-
ing auditory stimulus category. Following the previous work
that used prediction paradigms [10,11,41,42], in 80% of the
trials, the visual cue was congruent with the auditory stim-
ulus (e.g., human image and human voice or robot image
and robotic voice), whereas, in 20% of the trials, the visual
cue was incongruent with the auditory stimulus (e.g. human
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
858 International Journal of Social Robotics (2023) 15:855–865
Fig. 1 Visual stimuli in the
experiments consist of images of
three agents with different
degrees of human-likeness: a
human (Human), and two robots,
one having more human-like
appearance (Android) and one
having less human-like
appearance (Robot)
Fig. 2 Each trial in Experiment 1 consists of a fixation screen, a visual
cue (Human or Robot) and an auditory target (human or robotic voice)
after which subjects need to respond with a key press
image and robot-like voice or robot image and human voice,
see Fig. 3).
2.3.2 Experiment 2
Experiment 2 is identical to Experiment 1 except the visual
cue screen. As a visual cue, subjects were shown the image of
either a human or a human-like robot (Human and Android
agents in Fig. 4).
Similar to Experiment 1, there were two types of trials:
congruent trials and incongruent trials. In congruent trials
(80% of total trials), the category of visual cue matched the
category of the auditory stimulus (e.g., human image and
human voice, or a robot image and robotic voice). In incon-
gruent trials, the category of the visual cue did not match
the category of the auditory target (e.g., human appearance
and robotic voice, or robotic appearance and human voice,
see Fig. 5). The total number of trials was the same with
Experiment 1.
Note that we did not include the human, android, and robot
conditions in a single experiment as it would require the
generation of three levels of voice stimuli, which may not
necessarily match perfectly with the human-likeness level of
Fig. 3 There are two types of trials in Experiment 1. ACongruent trials
in which the category of the visual cue and the auditory target match
(e.g., human appearance (Human) and human voice, or robotic appear-
ance (Robot) and synthetic (robotic) voice), BIncongruent trials in
which the category of the visual cue and the auditory target do not
match (e.g. human appearance (Human) and robotic voice, or robotic
appearance (Robot) and human voice)
the images. So, for the sake of simplicity and interpretability,
and following the two-category structure of previous predic-
tion paradigms [12,42], we conducted two experiments in
which we compared a human and a robot, and across the two
experiments, we compared the effect of human-likeness of
the robot.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
International Journal of Social Robotics (2023) 15:855–865 859
Fig. 4 Each trial in Experiment 2 consists of a fixation screen, a visual
cue (Human or Android), and an auditory target (human or robotic
voice) after which subjects need to respond with a keypress
Fig. 5 There are two types of trials in Experiment 2. ACongruent trials
in which the category of the visual cue and the auditory target match
(e.g., human appearance (Human) and human voice, or robotic appear-
ance (Android) and robotic voice), BIncongruent trials in which the
category of the visual cue and the auditory target do not match (e.g.
human appearance (Human) and robotic voice, or robotic appearance
(Android) and human voice)
2.3.3 Statistical Analysis
We conducted separate ANOVAs for Experiment 1 and
Experiment 2, and an additional ANOVA to compare the
results of Experiment 1 and Experiment 2.
Experiment 1 We conducted 2 (Congruency: Congru-
ent, Incongruent) ×2 (Visual Cue: Human, Robot) mixed
ANOVA to investigate the effects of congruency and visual
cue (agent) on reaction times and accuracy. The congruency
was taken as a between-subjects factor due to the unbalanced
number of trials between its levels, and the visual cue was
taken as a within-subject variable.
Experiment 2 We conducted 2 (Congruency: Congruent,
Incongruent) ×2 (Visual Cue: Human, Android) repeated
measures ANOVA to investigate the effects of congruency
and visual cue (agent) on reaction times and accuracy. The
congruency was taken as a between-subjects factor due to
the unbalanced number of trials between its levels, and the
visual cue was taken as a within-subject variable.
Comparison of Experiment 1 and Experiment 2 We con-
ducted a 4 (Visual Cue: Human 1 (Experiment 1), Robot,
Human 2 (Experiment 2), Android) ×2 (Congruency: Con-
gruent, Incongruent) ×2 (Experiment Order: 1, 2) mixed
ANOVA to investigate whether the congruency, the human-
likeness of the agent, the order of Experiment 1 and 2 (Robot
or Android first) and their interaction affect reaction times or
accuracy.
3 Results
3.1 Experiment 1 (Human, Robot)
3.1.1 Accuracy
The data met the assumptions of ANOVA, so we ran 2 ×
2 mixed-design ANOVA with a within-subjects factor of
visual cue (human, robot) and a between-subject factor of
congruency (congruent, incongruent) on the accuracy scores.
There was a main effect of congruency on the accuracy scores
(F(1,58) =22.66, p< 0.05, η2=0.28). Congruent trials were
overall more accurate than incongruent trials (Fig. 6). There
was no significant effect of visual cue (F(1,58) =0.13, p=
0.72), nor the interaction between congruency and visual cue
(F(1,58) =0.09, p=0.76) on the accuracy scores.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
860 International Journal of Social Robotics (2023) 15:855–865
Fig. 6 Reaction Times (RT) on correct trials (left) and accuracy (%) results (right) of Experiment 1. Error bars show the standard error of the mean
(SEM)
3.1.2 Reaction Times (Correct Trials)
The data met the assumptions of ANOVA, so we ran 2 ×
2 mixed-design ANOVA with a within-subjects factor of
visual cue (human, robot) and a between-subject factor of
congruency (congruent, incongruent) on the reaction times
of correct trials. There was a main effect of congruency on
the reaction time of correct trials (F(1,58) =21.27, p< 0.05,
η2=0.27). Subjects were significantly faster in congruent
trials than they were in incongruent trials (Fig. 6). There was
also a main effect of the visual cue on reaction times (F(1,58)
=16.20, p< 0.05, η2=0.22). Subjects were significantly
faster when the visual cue was Human than it was Robot
(Fig. 6). There was no interaction between the congruency
and the visual cue on the reaction times (F(1,58) =0.04, p
=0.83).
3.2 Experiment 2 (Human, Android)
3.2.1 Accuracy
The data met the assumptions of ANOVA, so we ran 2 ×2
mixed-design ANOVA with a within-subjects factor of visual
cue (human, android) and a between-subject factor of con-
gruency (congruent, incongruent) on the accuracy scores.
There was a main effect of congruency on accuracy scores
(F(1,58) =21.61, p< 0.05, η2=0.27). Subjects were sig-
nificantly more accurate in congruent trials than incongruent
trials (Fig. 7). There was no significant effect of the visual
cue on accuracy scores (F(1,58) =0.41, p=0.53). There
was no interaction between congruency and accuracy either
(F(1,58) =0.04, p=0.85).
3.2.2 Reaction Times (Correct Trials)
The data met the assumptions of ANOVA, so we ran 2 ×2
mixed-design ANOVA with a within-subjects factor of visual
cue (human, android) and a between-subject factor of con-
gruency (congruent, incongruent) on the reaction times of
correct trials. There was a main effect of congruency on the
reaction time of correct trials (F(1,58) =20.48, p< 0.05, η2
=0.26). Subjects were significantly faster in congruent trials
than they were in incongruent trials (Fig. 7). There was also
a main effect of the visual cue on reaction times (F(1,58) =
25.57, p< 0.05, η2=0.31). Subjects were significantly faster
when the visual cue was Human than it was Android (Fig. 7).
There was no significant interaction between congruency and
visual cue (F(1,58) =1.16, p=0.29).
3.3 The Human-Likeness Dimension: The
Comparison of Experiment 1 and Experiment 2
In addition to the main analyses reported above, we explored
whether the human-likeness of the agent in the spectrum
of Human-Android-Robot affected the reaction times or
accuracy. To this end, we compared the reaction times
of Experiment 1 and Experiment 2. Since we have four
agents (visual cues) in the two experiments (Experiment
1: Human–Robot and Experiment 2: Human–Android), we
included all of them as Human1, Robot, Human2, and
Android. In addition to the visual cue, we also included the
congruency and order of the experiments in a 4 ×2×2
mixed ANOVA.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
International Journal of Social Robotics (2023) 15:855–865 861
Fig. 7 Reaction Times (RT) on correct trials (left) and Accuracy (%) results (right) of Experiment 2. Error bars show the standard error of the mean
(SEM)
3.3.1 Accuracy
Data met the assumptions of running an ANOVA. There was a
main effect of congruency on the accuracy scores (F(1,56) =
23.93, p< 0.05, η2=0.30). Subjects were significantly more
accurate in congruent trials than incongruent trials. There
was no significant effect of the visual cue on the accuracy
scores (F(3,168) =0.57, p=0.64). There was no significant
effect of experiment order either (F(1,56) =0.01,p=0.93).
None of the interactions were significant (Cue ×Congru-
ency: F(3,168) =0.27, p=0.85; Cue ×Order: F(1,168) =
0.80, p=0.49; Congruency ×Order: F(1,56) =0.01,p=
0.93; Cue ×Congruency ×Order: F(3,168)=0.07,p=
0.98).
3.3.2 Reaction Times (Correct Trials)
Data met the assumption of homogeneity (Levene’s test p>
0.05) but violated the assumption of sphericity (Mauchly’s
test, (5) =65.88, p< 0.05). Therefore, we used Greenhouse-
–Geisser correction wherever needed. There was a main
effect of the visual cue on the reaction times (F(1.74, 97.24)
=6.87, p< 0.05, η2=0.11). Planned contrasts showed that
the reaction times for Human 1 are significantly faster than
Robot (p=0.01, η2=0.11) and Android (p< 0.05, η2=
0.31) but did not differ from Human 2 (p=0.72); the reac-
tion times for Robot are significantly slower than Human 2
(p< 0.05, η2=0.22) but did not differ from Android (p=
0.97); and the reaction times for Human 2 are significantly
faster than Android (p< 0.05, η2=0.15).
There was a main effect of congruency on reaction times
(F(1,56) =23.36, p< 0.05, η2=0.29). Subjects were sig-
nificantly more accurate in congruent trials than incongruent
trials. There was no significant effect of the experiment order
on reaction times (F(1,56) =0.01, p=0.96). None of the
interactions were significant (Cue ×Congruency: F(1.74,
97.24) =0.27, p=0.85; Cue ×Order: F(1.74, 97.24) =
0.10, p=0.96; Congruency ×Order: F(1,56) =0.18, p=
0.68; Cue ×Congruency x Order: F(1.74, 97.24) =0.20,p
=0.79).
4 Discussion
We investigated whether expectations about artificial agents
affect our perception. To this end, we used a well-
known prediction paradigm from cognitive psychology in
a human–robot interaction context. We hypothesized that
people would get faster in judging how an agent sounds
(human-like or synthetic) if it was preceded by a congru-
ent visual cue (e.g. a robot picture for a robot-like voice)
than an incongruent visual cue (e.g. a human picture for a
synthetic, robot-like voice).
Our results suggest that people form expectations about
how an agent sounds based on the visual appearance of the
agent. If the visual cue is a robot, people expect that it would
sound synthetic, as demonstrated by shorter reaction times
and more accurate responses when the appearance and voice
were congruent than when they were incongruent. This was
true whether the robot has a more human-like appearance or
a less human-like appearance. These results are consistent
with previous work that suggests that predictive processes
underlie our perception [911,13,14] including humans and
robots. In other words, it seems that we can extend our predic-
tive capabilities to perceive artificial agents, and just like our
interaction with other humans, our expectations can affect
how we perceive non-human agents.
An important contribution of our study to the previous
work on prediction in HRI is its multimodal nature. Although
there are studies that examined the effect of expectations on
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
862 International Journal of Social Robotics (2023) 15:855–865
the perception of robots, most of these studies were done
in the visual modality [5,6,17]. Given the recent work that
highlights the importance of voice in HRI [18] and the devel-
opments in text-to-speech technology, it has become essential
to go beyond the visual modality and incorporate the effects
of voice on communication and interaction with artificial
agents. Studies that support this effort usually manipulate
the congruity of voice and appearance cues and measure a
variety of things regarding the artificial agents such as their
attractiveness and credibility [22], likeability and believabil-
ity [21], perceived lifelikeness and politeness [20], as well as
emotion recognition [30], embodiment [29], and the uncanny
valley [19]. Our study extends this body of work in two
ways. First, rather than presenting the visual and auditory
aspects of the stimuli simultaneously, it presents them con-
secutively in a prediction paradigm where the visual stimulus
serves as a prior (cue) for the upcoming auditory stimulus.
The advantage of this method is that it allows us to study
the effect of expectations on perception more directly, by
involving explicit priors, rather than making assumptions or
post hoc conclusions about predictive mechanisms. Second,
unlike previous work that used explicit measures in the form
of self-reports in the multimodal perception of robots, we
used implicit measures such as reaction times and accuracy.
One advantage of implicit measures is that they are more
objective and less susceptible to the participants’ awareness
and the ability to express their introspective states [31]. More
importantly, they are much better at providing a mechanistic
understanding of human behavior and cognition than self-
reports [31,34], thus allowing us to make more direct links
with the perception literature in cognitive sciences. Con-
sistent with previous work on predictive processing in the
perception of simple or complex object stimuli [911], we
found that reaction times get longer, and accuracy scores
get lower when we encounter artificial agents that we do not
expect. This in turn suggests that our expectations affect how
we perceive non-human agents as they do with other natural
object categories.
Our study has several implications in various fields of HRI
that intersects with predictive processing. One implication
concerns the design of robots and the successful interaction
between humans and robots. Previous work suggests that it
is better to design artificial agents that do not violate our
expectations because doing otherwise may elicit undesirable
responses in humans while they interact with those agents,
such as the uncanny valley [5,6,19,4348], impairments in
emotion recognition [30], and decreased likeability [21] and
credibility [22]. A second implication concerns the specific
user groups for which the robots will be developed. Predic-
tions stem from prior knowledge, which in turn implies that
any variability in prior knowledge about robots can affect to
what extent predictive mechanisms are utilized. For instance,
an engineer in Japan who is heavily exposed to robots may
not be surprised by a metallic-looking robot speaking with a
humanlike voice, unlike a person who has never interacted
with a robot. The person in the former case would generate
minimal prediction errors while the latter would have large
prediction errors. Similar concerns may apply when we con-
sider different generations. For instance, children who are
born in the last decade in the technology era may have dif-
ferent expectations from robots compared to the elderly who
met robots in their adult life. Future work should investigate
how familiarity with robots can affect our prediction abili-
ties and their consequences. This will enable the design of
customized robots for different end users.
Our study has several limitations. The first concerns the
choice of voice stimuli. To create the synthetic, what we
called ‘robotic’, voice, we recorded and modified a natural
human voice using a variety of sound parameters (frequency
and echo). We acknowledge that there is not a natural ‘robotic
voice’ category out there. So, we did our manipulation based
on what we consider how a typical robot voice sounds like
based on our experience with voice assistants, smart devices,
science fiction movies, and video games. We also acknowl-
edge that not all robot voices are the same but rather there may
be a family of synthetic voices that are associated with robots.
An inspiration for us in creating such synthetic voices was
some available software libraries that modify sound stimuli
to create a variety of non-human-like sounds, e.g., ghost-
like, robotic, etc. While we did not run a separate study
in which we examined the discriminability of the modified
voices from a natural human voice, our pilot study gave us
some insights into which parameter combinations elicited the
most synthetic responses. Although we found this method as
the most systematic way of manipulating the voice stimuli,
it has some shortcomings. First, the synthetic voice trans-
formed from a real human voice may inherently include some
human cues as compared to voices that are completely syn-
thetic, e.g., the ones generated with text-to-speech or voice
synthesis methods [49]. So, it may be difficult to categorize
them as non-human. Since the robots we used as stimuli in the
present study were humanoid in nature, it may not be unrea-
sonable to have some human cues in the voice. Nevertheless,
text-to-speech or voice/speech synthesis methods can be con-
sidered in future studies as an alternative. Second, it may be
the way the stimuli were presented to the participants before
the experiment that biased their perception of the “modified
human voice” as a “robotic’ voice. Future work can inves-
tigate what kind of deviations from a natural human voice
would lead people to categorize voices as non-human-like
or synthetic in comparison to the natural human voices and
completely synthetic voices generated with text-to-speech or
voice/speech synthesis methods.
A related second limitation of the study is the lack of
a variety of voice stimuli that come in different degrees of
human likeness, unlike visual image stimuli. Since to the
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
International Journal of Social Robotics (2023) 15:855–865 863
best of our knowledge, this is the first study that employs an
explicit prediction paradigm in a multimodal context in HRI,
we wanted to keep things simple and follow similar binary
paradigms in cognitive sciences [12,41,42]. Future work
can extend this work using a variety of mismatches between
voice and appearance, similar to [50].
Another line of work that was not addressed in the present
study but is worth pursuing is reversing the order of the
modalities in the prediction paradigm. That is, one can use
the auditory stimuli as the cue (prior) and the image stim-
uli as the target and investigate whether voices can influence
how we perceive the bodies of agents. This work could show
whether the predictive processing in a multimodal context is
reciprocal in nature across the two modalities involved.
5 Conclusion
We investigated whether the expectations about an agent
affect how we perceive that agent. More specifically, we
examined how we perceive the voice of an agent if our expec-
tations based on what we see are not met. Our results show
that the present study provides insights into how we per-
ceive and interact with robots. It seems that we can extend
our predictive capabilities to the perception of robots, and
just like our interaction with other humans, our expectations
can affect how we perceive robots. In sum, we would inter-
act with artificial agents much more efficiently if they are
designed in such a way that they do not violate our expec-
tations. The use of a well-established prediction paradigm
from cognitive sciences in the present study has opened a
new avenue of research in human–robot interaction. Appear-
ance and voice are only two features among many, for which
we seek a match in agent perception. Future work should
investigate what features of artificial agents make us form
expectations, how we do that, and under what conditions
these expectations are violated.
The present study sets a good example of how the col-
laboration between human–robot interaction and cognitive
sciences can be fruitful and useful for both sides [3,51,52].
Our study not only suggests possible principles for robot
design but also shows how fundamental cognitive mecha-
nisms such as prediction can generalize to agents that we
have not evolved with over many generations. As such,
our study shows that artificial agents such as robots can be
great experimental tools for cognitive science to improve our
understanding of the human mind.
Acknowledgements All listed authors offered substantial contributions
to the study conception and drafting of the manuscript. Furthermore, all
authors approved of the final submission and agreed to be accountable
for all aspects of the work. The authors would like to thank Gaye skın
for help with the audio stimuli.
Funding Open Access funding enabled and organized by Projekt
DEAL. There is no funding information to declare.
Data Availability We provide all data and materials of this study in an
Open Science Framework repository (https://osf.io/2wsug/).
Declarations
Conflict of interest The authors declare that they have no conflict of
interest.
Ethics Approval Approval of the university’s ethical committee was
obtained before the study started (ID: 2019_01_29_01).
Consent to Participate Participants were informed about the study’s
procedure in the first stage and that they should only begin the experi-
ment if they agree to the conditions described there.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing, adap-
tation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indi-
cate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your
intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copy-
right holder. To view a copy of this licence, visit http://creativecomm
ons.org/licenses/by/4.0/.
References
1. Norman D (2013) The design of everyday things: revised and
expanded edition. Basic books
2. MacDorman KF, Ishiguro H (2006) The uncanny advantage of
using androids in cognitive and social science research. Interact
Stud 7:297–337. https://doi.org/10.1075/is.7.3.03mac
3. Cross ES, Hortensius R, Wykowska A (2019) From social brains
to social robots: applying neurocognitive insights to human–robot
interaction. Philos Trans R Soc B 374(1771):20180024. https://doi.
org/10.1098/rstb.2018.0024
4. Cross ES, Ramsey R (2021) Mind meets machine: towards a cog-
nitive science of human–machine interactions. Trends Cogn Sci
25(3):200–212. https://doi.org/10.1016/j.tics.2020.11.009
5. Saygin AP, Chaminade T, Ishiguro H, Driver J, Frith C (2012) The
thing that should not be: predictive coding and the uncanny valley
in perceiving human and humanoid robot actions. Soc Cong Affect
Neurosci 7:413–422. https://doi.org/10.1093/scan/nsr025
6. Urgen BA, Kutas M, Saygin AP (2018) Uncanny val-
ley as a window into predictive processing in the social
brain. Neuropsychologia 114:181–185. https://doi.org/10.1016/j.
neuropsychologia.2018.04.027
7. Kutas M, Hillyard SA (1980) Reading senseless sentences: brain
potentials reflect semantic incongruity. Science 207:203–205.
https://doi.org/10.1126/science.7350657
8. Kutas M, Federmeier KD (2011) Thirty years and counting: find-
ing meaning in the N400 component of the event-related brain
potential (ERP). Annu Rev Psychol 62:621–647. https://doi.org/
10.1146/annurev.psych.093008.131123
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
864 International Journal of Social Robotics (2023) 15:855–865
9. Kok P, Brouwer GJ, van Gerven MA, de Lange FP (2013) Prior
expectations bias sensory representations in visual cortex. J Neu-
rosci Res 33(41):16275–16284. https://doi.org/10.1523/jneurosci.
0742-13.2013
10. Kok P, de Lange FP (2015) Predictive coding in sensory cortex. In:
An introduction to model-based cognitive neuroscience, Springer,
New York, pp 221–244
11. De Lange FP, Heilbron M, Kok P (2018) How do expectations
shape perception? Trends Cogn Sci 22(9):764–779. https://doi.org/
10.1016/j.tics.2018.06.002
12. Urgen BM, Boyaci H (2021) Unmet expectations delay sensory
processes. Vis Res 181:1–9. https://doi.org/10.1016/j.visres.2020.
12.004
13. Friston K (2010) The free-energy principle: a unified brain theory?
Nat Rev Neurosci 11(2):127–138. https://doi.org/10.1038/nrn2787
14. Clark A (2013) Whatever next? Predictive brains, situated agents,
and the future of cognitive science. BehavBrain Sci 36(3):181–204.
https://doi.org/10.1017/s0140525x12000477
15. Heeger DJ (2017) Theory of cortical function. Proc Natl Acad Sci
114(8):1773–1782. https://doi.org/10.1073/pnas.1619788114
16. Ho CC, MacDorman KF, Pramono ZD (2008) Human emotion and
the uncanny valley: a GLM, MDS, and Isomap analysis of robot
video ratings. In: 2008 3rd ACM/IEEE international conference
on human–robot interaction (HRI), IEEE, pp. 169–176. https://doi.
org/10.1145/1349822.1349845
17. Ciardo F, De Tommaso D, Wykowska A (2022) Joint action with
artificial agents: human-likeness in behaviour and morphology
affects sensorimotor signaling and social inclusion. Comput Hum
Behav 132:107237
18. Seaborn K, Miyake NP, Pennefather P, Otake-Matsuura M (2021)
Voice in human–agent interaction: a survey. ACM Comput Surv
(CSUR) 54(4):1–43
19. Mitchell WJ, Szerszen SKA, Lu AS, Schermerhorn PW, Scheutz
M, MacDorman KF (2011) A mismatch in the human realism of
face and voice produces an uncanny valley. Percept 2(1):10–12.
https://doi.org/10.1068/i0415
20. Hastie H, Lohan K, Deshmukh A, Broz F,Aylett R (2017) The inter-
action between voice and appearance in the embodiment of a robot
tutor. In: International conference on social robotics, Springer,
Cham. pp 64–74. https://doi.org/10.1007/978-3-319-70022-9_7
21. Cabral JP, Cowan BR, Zibrek K, McDonnell R (2017) The
influence of synthetic voice on the evaluation of a virtual char-
acter. In: INTERSPEECH, pp 229–233. https://doi.org/10.21437/
Interspeech.2017-325
22. Stein JP, Ohler P (2018) Uncanny… but convincing? Inconsis-
tency between a virtual agent’s facial proportions and vocal realism
reduces its credibility and attractiveness, but not its persuasive suc-
cess. Interact Comput 30(6):480–491. https://doi.org/10.1093/iwc/
iwy023
23. McGinn C, Torre I (2019) Can you tell the robot by the voice? An
exploratory study on the role of voice in the perception of robots.
In: 2019 14th ACM/IEEE international conference on human-robot
interaction (HRI), IEEE, pp 211–221. https://doi.org/10.1109/HRI.
2019.8673279
24. Doehrmann O, Naumer MJ (2008) Semantics and the multisensory
brain: how meaning modulates processes of audio–visual integra-
tion. Brain Res 1242:136–150. https://doi.org/10.1016/j.brainres.
2008.03.071
25. Hein G, Doehrmann O, Müller NG, Kaiser J, Muckli L, Naumer
MJ (2007) Object familiarity and semantic congruency modu-
late responses in cortical audiovisual integration areas. J Neu-
rosci Res 27(30):7881–7887. https://doi.org/10.1523/jneurosci.
1740-07.2007
26. Laurienti PJ, Kraft RA, Maldjian JA, Burdette JH, Wallace MT
(2004) Semantic congruence is a critical factor in multisensory
behavioral performance. Exp Brain Res 158(4):405–414. https://
doi.org/10.1007/s00221-004-1913-2
27. TalsmaD (2015) Predictive coding and multisensory integration: an
attentional account of the multisensory mind. Front Integr Neurosci
9(19):19. https://doi.org/10.3389/fnint.2015.00019
28. Nie J, Park M, Marin, AL, Sundar SS (2012) Can you hold
my hand? Physical warmth in human-robot interaction. In: 2012
7th ACM/IEEE international conference on human–robot interac-
tion (HRI), IEEE, pp 201–202. https://doi.org/10.1145/2157689.
2157755
29. Mara M, Schreibelmayr S, Berger F (2020) Hearing a nose? User
expectations of robot appearance induced by different robot voices.
In: Companion of the 2020 ACM/IEEE international conference
on human–robot interaction, pp 355–356, https://doi.org/10.1145/
3371382.3378285
30. Tsiourti C, Weiss A, Wac K, Vincze M (2019) Multimodal inte-
gration of emotional signals from voice, body, and context: effects
of (in) congruence on emotion recognition and attitudes towards
robots. Int J Soc Robot 11(4):555–573. https://doi.org/10.1007/
s12369-019-00524-z
31. Nosek BA, Hawkins CB, Frazier RS (2011) Implicit social
cognition: from measures to mechanisms. Trends Cogn Sci
15(4):152–159
32. Kompatsiari K, Ciardo F, De Tommaso D, Wykowska A (2019)
Measuring engagement elicited by eye contact in human–robot
interaction. In: 2019 IEEE/RSJ international conference on intel-
ligent robots and systems (IROS), IEEE, pp 6979–6985
33. Greenwald AG, Banaji MR (1995) Implicit social cognition: atti-
tudes, self-esteem, and stereotypes. Psychol Rev 102(1):4
34. Fazio RH, Olson MA (2003) Implicit measures in social cognition
research: their meaning and use. Annu Rev Psychol 54(1):297–327
35. Li Z, Terfurth L, Woller JP, Wiese E (2022) Mind the
machines: applying implicit measures of mind perception to social
robotics. In: 2022 17th ACM/IEEE international conference on
human–robot interaction (HRI), IEEE, pp 236–245
36. Saltık ˙
I (2022). Explicit and implicit measurement of mind per-
ception in social robots through individual differences modulation,
MS thesis, Bilkent University
37. Willenbockel V, Sadr J, Fiset D, Horne GO, Gosselin F, Tanaka JW
(2010) Controlling low-level image properties: the SHINE tool-
box. Behav Res Methods 42(3):671–684. https://doi.org/10.1167/
10.7.653
38. Peirce J, Gray J, Halchenko Y, Britton D, Rokem A, Strangman G
(2011) PsychoPy: a psychology software in Python. https://media.
readthedocs.org/pdf/psychopy-hoechenberger/latest/psychopy-
hoechenberger.pdf
39. Brainard DH, Vision S (1997) The psychophysics toolbox. Spat
Vis 10(4):433–436. https://doi.org/10.1163/156856897X00357
40. Pelli DG, Vision S (1997) The VideoToolbox software for visual
psychophysics: transforming numbers into movies. Spat Vis
10:437–442. https://doi.org/10.1163/156856897X00366
41. Kok P, Jehee JF, De Lange FP (2012) Less is more: expecta-
tion sharpens representations in the primary visual cortex. Neuron
75(2):265–270. https://doi.org/10.1016/j.neuron.2012.04.034
42. De Loof E, Van Opstal F, Verguts T (2016) Predictive information
speeds up visual awareness in an individuation task by modulating
threshold setting, not processing efficiency. Vis Res 121:104–112.
https://doi.org/10.1016/j.visres.2016.03.002
43. Yamamoto K, Tanaka S, Kobayashi H, Kozima H, Hashiya K
(2009) A non-humanoid robot in the “uncanny valley”: experi-
mental analysis of the reaction to behavioral contingency in 2–3
year old children. PLoS ONE 4(9):e6974. https://doi.org/10.1371/
journal.pone.0006974
44. Cheetham M, Pavlovic I, Jordan N, Suter P, Jancke L (2013) Cate-
gory processing and the human likeness dimension of the uncanny
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
International Journal of Social Robotics (2023) 15:855–865 865
valley hypothesis: eye-tracking data. Front Psychol 4:108. https://
doi.org/10.3389/fnhum.2011.00126
45. Tinwell A, Grimshaw M, Williams A (2010) Uncanny behaviour in
survival horror games. J Gaming Virtual Worlds 2(1):3–25. https://
doi.org/10.1386/jgvw.2.1.3_1
46. MacDorman KF, Chattopadhyay D (2016) Reducing consistency
in human realism increases the uncanny valley effect; increasing
category uncertainty does not. Cognition 146:190–205. https://doi.
org/10.1016/j.cognition.2015.09.019
47. Tinwell A, Grimshaw M, Nabi DA (2015) The effect of onset asyn-
chrony in audio–visual speech and the Uncanny Valley in virtual
characters. Int J Mech Robot 2(2):97–110. https://doi.org/10.1504/
IJMRS.2015.068991
48. Lee EJ (2010) The more humanlike, the better? How speech type
and users’ cognitive style affect social responses to computers.
Comput Hum Behav 26(4):665–672. https://doi.org/10.1016/j.chb.
2010.01.003
49. Li M, Guo F, Chen J, Duffy VG (2022) Evaluating users’ audi-
tory affective preference for humanoid robot voicesthrough neural
dynamics. Int J Human–Comput Interact. https://doi.org/10.1080/
10447318.2022.2108586
50. Yorgancigil E, Yildirim F, Urgen BA, Erdogan SB (2022) An
exploratory analysis of the neural correlates of human–robot inter-
actions with functional near infrared spectroscopy. Front Human
Neurosci. https://doi.org/10.3389/fnhum.2022.883905
51. Saygin A, Thierry C, Urgen B, Ishiguro H (2011) Cognitive neu-
roscience and robotics: a mutually beneficial joining of forces. In:
Robotics: science and systems (RSS)
52. Wiese E, Metta G, Wykowska A (2017) Robots as intentional
agents: using neuroscientific methods to make robots appear more
social. Front Psychol 8:1663
Publisher’s Note Springer Nature remains neutral with regard to juris-
dictional claims in published maps and institutional affiliations.
Busra Sarigul is a research associate in Everyday Media Lab, Leibniz
Institut für Wissensmedien (IWM), Germany. She is currently pursu-
ing a doctoral degree in the Department of Psychology, University of
Tübingen. She received her MS in Interdisciplinary Social Psychiatry
and BA in Psychology from Ankara University. Her research interests
include Human-Agent Interaction, Smart Speakers, and Multisensory
Integration. She is currently working on her PhD thesis investigating
the communicative qualities of human-agent interaction based on the
relationship between speech styles and gender.
Burcu A. Urgen is an Assistant Professor at the Department of Psy-
chology, Bilkent University. She is also affiliated with Aysel Sabuncu
Brain Research Center and National Magnetic Resonance Research
Center (UMRAM). She received her PhD in Cognitive Science from
University of California, San Diego (USA) in 2015. Prior to her PhD,
she did her BS in Computer Engineering at Bilkent University, and MS
in Cognitive Science at Middle East Technical University. Following
her PhD, she worked as a postdoctoral researcher at the Department
of Neuroscience, University of Parma (Italy), with Professor Guy A.
Orban. Dr. Ürgen’s primary research area is human visual perception
with a focus on biological motion and action perception. In addition
to behavioral methods, she uses a wide range of invasive and non-
invasive neuroimaging techniques including fMRI, EEG, and intracra-
nial recordings to study the neural basis of visual perception. Her
research commonly utilizes state-of-the-art computational techniques
including machine learning, computer vision, and effective connectiv-
ity. Besides her basic cognitive neuroscience research, Dr. Ürgen also
pursues interdisciplinary research between social robotics and cogni-
tive neuroscience to investigate the human factors that lead to suc-
cessful interaction with artificial agents such as robots. Dr. Ürgen’s
research is supported by TÜB˙
ITAK and TÜSEB grants.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Moreover, Sarigul and Urgen [46] showed that participants had shorter reaction times and more accurate responses when interacting with a robot that had a synthetic voice matching its mechanical appearance rather than a human voice. This congruence effect is evidence that matching a robot's voice and appearance improve participants' behavioral responses. ...
... Two different hypotheses can be derived from the literature. On the one hand, it has already been shown that an anthropomorphic appearance should match its communication [45,46] and the emotionality it contains [44]. As a consequence, affective speech, which is a human-like trait, should align better a human-like robot according to the congruence effect. ...
... The instructions of study II stated that the appearance of the robots should not be considered when evaluating the audios, and at least for the perceived attribution of a mind, the appearance of the robot had neither a positive nor a negative influence. Nevertheless, with this first approach, it was possible to investigate whether the matching hypothesis applies [46,52], which in our case showed that a human-like voice is even more beneficial for a technical appearance at least for social presence ratings. ...
Article
Full-text available
The attribution of mind to others, either humans or artificial agents, can be conceptualized along two dimensions: experience and agency. These dimensions are crucial in interactions with robots, influencing how they are perceived and treated by humans. Specifically, a higher attribution of agency to robots is associated with greater perceived responsibility, while a higher attribution of experience enhances sympathy towards them. One potential strategy to increase the attribution of experience to robots is the application of affective communication induced via prosody and verbal content such as emotional words and speech style. In two online studies (NI = 30, NII = 60), participants listened to audio recordings in which robots introduced themselves. In study II, robot pictures were additionally presented to investigate potential matching effects between appearance and speech. Our results showed that both the use of emotional words and speaking expressively significantly increased the attributed experience of robots, whereas the attribution of agency remained unaffected. Findings further indicate that speaking expressively and using emotional words enhanced the perception of human-like qualities in artificial communication partners, with a more pronounced effect observed for technical robots compared to human-like robots. These insights can be used to improve the affective impact of synthesized robot speech and thus potentially increase the acceptance of robots to ensure long-term use.
... Ultimately, naturalness research should also systematically consider interactions between vocal and visual aspects of naturalness in combination. Indeed, accumulating evidence suggests a complex interplay of visual appearance, vocal features, behavior, and the interactional context for the acceptance of virtual agents [28,[31][32][33][106][107][108][109][110][111][112][113]. ...
... Humanoid robots have a strong representation in the media, which might shape the public's perception and assumed capabilities. For different robot appearances, the expected voice and resulting effects might differ [97]. ...
Article
Full-text available
With the increasing performance of text-to-speech systems and their generated voices indistinguishable from natural human speech, the use of these systems for robots raises ethical and safety concerns. A robot with a natural voice could increase trust, which might result in over-reliance despite evidence for robot unreliability. To estimate the influence of a robot's voice on trust and compliance, we design a study that consists of two experiments. In a pre-study ( N1=60N_{1}=60 ) the most suitable natural and mechanical voice for the main study are estimated and selected for the main study. Afterward, in the main study ( N2=68N_{2}=68 ), the influence of a robot's voice on trust and compliance is evaluated in a cooperative game of Battleship with a robot as an assistant. During the experiment, the acceptance of the robot's advice and response time are measured, which indicate trust and compliance respectively. The results show that participants expect robots to sound human-like and that a robot with a natural voice is perceived as safer. Additionally, a natural voice can affect compliance. Despite repeated incorrect advice, the participants are more likely to rely on the robot with the natural voice. The results do not show a direct effect on trust. Natural voices provide increased intelligibility, and while they can increase compliance with the robot, the results indicate that natural voices might not lead to over-reliance. The results highlight the importance of incorporating voices into the design of social robots to improve communication, avoid adverse effects, and increase acceptance and adoption in society.
Article
Full-text available
Functional near infrared spectroscopy (fNIRS) has been gaining increasing interest as a practical mobile functional brain imaging technology for understanding the neural correlates of social cognition and emotional processing in the human prefrontal cortex (PFC). Considering the cognitive complexity of human-robot interactions, the aim of this study was to explore the neural correlates of emotional processing of congruent and incongruent pairs of human and robot audio-visual stimuli in the human PFC with fNIRS methodology. Hemodynamic responses from the PFC region of 29 subjects were recorded with fNIRS during an experimental paradigm which consisted of auditory and visual presentation of human and robot stimuli. Distinct neural responses to human and robot stimuli were detected at the dorsolateral prefrontal cortex (DLPFC) and orbitofrontal cortex (OFC) regions. Presentation of robot voice elicited significantly less hemodynamic response than presentation of human voice in a left OFC channel. Meanwhile, processing of human faces elicited significantly higher hemodynamic activity when compared to processing of robot faces in two left DLPFC channels and a left OFC channel. Significant correlation between the hemodynamic and behavioral responses for the face-voice mismatch effect was found in the left OFC. Our results highlight the potential of fNIRS for unraveling the neural processing of human and robot audio-visual stimuli, which might enable optimization of social robot designs and contribute to elucidation of the neural processing of human and robot stimuli in the PFC in naturalistic conditions.
Article
Full-text available
Social robots, conversational agents, voice assistants, and other embodied AI are increasingly a feature of everyday life. What connects these various types of intelligent agents is their ability to interact with people through voice. Voice is becoming an essential modality of embodiment, communication, and interaction between computer-based agents and end-users. This survey presents a meta-synthesis on agent voice in the design and experience of agents from a human-centered perspective: voice-based human--agent interaction (vHAI). Findings emphasize the social role of voice in HAI as well as circumscribe a relationship between agent voice and body, corresponding to human models of social psychology and cognition. Additionally, changes in perceptions of and reactions to agent voice over time reveals a generational shift coinciding with the commercial proliferation of mobile voice assistants. The main contributions of this work are a vHAI classification framework for voice across various agent forms, contexts, and user groups, a critical analysis grounded in key theories, and an identification of future directions for the oncoming wave of vocal machines.
Article
Full-text available
Expectations strongly affect and shape our perceptual decision-making processes. Specifically, valid expectations speed up perceptual decisions, and determine what we see in a noisy stimulus. Despite the well-established effects of expectations on decision-making, whether and how they affect low-level sensory processes remain elusive. To address this problem, we investigated the effect of expectation on temporal thresholds in an individuation task (detection of the position of an intact image, a house or face). We found that compared to a neutral baseline, thresholds increase when the intact images are of the unexpected category, but remain unchanged when they are of the expected category. Using a recursive Bayesian model with dynamic priors we show that delay in sensory processes is the result of further processing, consequently longer time, required in case of violated expectations. Expectations, however, do not alter internal parameters of the system. These results reveal that sensory processes are delayed when expectations are not met, and a simple parsimonious computational model can successfully explain this effect.
Article
Full-text available
As robots advance from the pages and screens of science fiction into our homes, hospitals, and schools, they are poised to take on increasingly social roles. Consequently, the need to understand the mechanisms supporting human–machine interactions is becoming increasingly pressing. We introduce a framework for studying the cognitive and brain mechanisms that support human–machine interactions, leveraging advances made in cognitive neuroscience to link different levels of description with relevant theory and methods. We highlight unique features that make this endeavour particularly challenging (and rewarding) for brain and behavioural scientists. Overall, the framework offers a way to study the cognitive science of human–machine interactions that respects the diversity of social machines, individuals’ expectations and experiences, and the structure and function of multiple cognitive and brain systems.
Conference Paper
Full-text available
It is well established that a robot's visual appearance plays a significant role in how it is perceived. Considerable time and resources are usually dedicated to help ensure that the visual aesthetics of social robots are pleasing to users and helps facilitate clear communication. However, relatively little consideration is given to how the voice of the robot should sound, which may have adverse effects on acceptance and clarity of communication. In this study, we explore the mental images people form when they hear robots speaking. In our experiment, participants listened to several voices, and for each voice they were asked to choose a robot, from a selection of eight commonly used social robot platforms, that was best suited to have that voice. The voices were manipulated in terms of naturalness, gender, and accent. Results showed that a) participants seldom matched robots with the voices that were used in previous HRI studies, b) the gender and naturalness vocal manipulations strongly affected participants' selection, and c) the linguistic content of the utterances spoken by the voices does not affect people's selection. This finding suggests that people associate voices with robot pictures, even when the content of spoken utterances was unintelligible. Our findings indicate that both a robot's voice and its appearance contribute to robot perception. Thus, giving a mismatched voice to a robot might introduce a confounding effect in HRI studies. We therefore suggest that voice design should be considered more thoroughly when planning spoken human-robot interactions.
Article
Users' affective preference for voices has become a topic of great interest with the prevalence of humanoid robots. Nevertheless, the affective preference formation for humanoid voices remains unknown, and its evaluation lacks objective methods. Consequently, we conducted an EEG experiment to unravel the underlying neural dynamics and evaluate users' affective preference for humanoid robot voices. Significantly larger P2, P3, and LPP amplitudes, enhanced theta, and decreased alpha oscillations were observed when users affectively preferred humanoid robot voices. The results suggest that the neural dynamics underlying users' affective preference for humanoid robot voices might primarily consist of early detection of affective information in voices, further processing of affective information, and later evaluative categorization of affective preference. Moreover, the neural indicators could distinguish users' affective preferences for humanoid robot voices. The study contributes to understanding the auditory affective preference formation for humanoid robot voices and providing a neurological evaluation method.
Article
Sensorimotor signaling is a key mechanism underlying coordination in humans. The increasing presence of artificial agents, including robots, in everyday contexts, will make joint action with them as common as a joint action with other humans. The present study investigates under which conditions sensorimotor signaling emerges when interacting with them. Human participants were asked to play a musical duet either with a humanoid robot or with an algorithm run on a computer. The artificial agent was programmed to commit errors. Those were either human-like (simulating a memory error) or machine-like (a repetitive loop of back-and-forth taps). At the end of the task, we tested the social inclusion toward the artificial partner by using a ball-tossing game. Our results showed that when interacting with the robot, participants showed lower variability in their performance when the error was human-like, relative to a mechanical failure. When the partner was an algorithm, the pattern was reversed. Social inclusion was affected by human-likeness only when the partner was a robot. Taken together, our findings showed that coordination with artificial agents, as well as social inclusion, are influenced by how human-like the agent appears, both in terms of morphological traits and in terms of behaviour.