Conference PaperPDF Available

Human-aware Robot Social Behaviour Dynamic for Trustworthy Human-Robot Interaction

Authors:

Abstract and Figures

Robots with multimodal social cues can be widely applied for natural human-robot interaction. The physical presence of those robots can be used to explore whether or how the robot can relieve the loneliness and social isolation of older adults. Natural and trustworthy interpersonal communication involves multimodal social cues with verbal and nonverbal behaviors, for example, speech, body language, facial expression, and the gaze. Humans always take the attention, intention, and preference of the interactor into consideration to adjust behaviors dynamically. Accordingly, a social robot should factor in human states into the loop to generate and conduct multimodal behaviors in a natural and trustworthy human-robot interaction setting. In this abstract, we explore how we can endow a social robot with dynamical social behaviors with the human in the loop and whether the human-aware robot social behavior dynamics make a difference in trustworthy human-robot interaction. Index Terms-multimodal social behavior, human-in-the-loop
Content may be subject to copyright.
Human-aware Robot Social Behaviour Dynamic for
Trustworthy Human-Robot Interaction
Chuang Yu
Department of Computer Science
University of Manchester
Manchester, United Kingdom
chuang.yu@manchester.ac.uk
Helen Hastie
Department of Computer Science
Heriot-Watt University
Edinburgh, United Kingdom
h.hastie@hw.ac.uk
Angelo Cangelosi
Department of Computer Science
University of Manchester
Manchester, United Kingdom
angelo.cangelosi@manchester.ac.uk
Abstract—Robots with multimodal social cues can be widely
applied for natural human-robot interaction. The physical pres-
ence of those robots can be used to explore whether or how the
robot can relieve the loneliness and social isolation of older adults.
Natural and trustworthy interpersonal communication involves
multimodal social cues with verbal and nonverbal behaviors, for
example, speech, body language, facial expression, and the gaze.
Humans always take the attention, intention, and preference of
the interactor into consideration to adjust behaviors dynamically.
Accordingly, a social robot should factor in human states into the
loop to generate and conduct multimodal behaviors in a natural
and trustworthy human-robot interaction setting. In this abstract,
we explore how we can endow a social robot with dynamical
social behaviors with the human in the loop and whether the
human-aware robot social behavior dynamics make a difference
in trustworthy human-robot interaction.
Index Terms—multimodal social behavior, human-in-the-loop
I. INTRODUCTION
Can human-aware robot social behavior dynamics make
a difference in trustworthy HRI? We will explore human-
aware multimodal robot behavior generation models. Namely,
the robot behavior generation will also take human behavior
into consideration. Multimodal robot behavior includes verbal
(age/gender controllable robot speech) and nonverbal behav-
ior (speech-driven robot gesture and facial expressions). The
pipeline is shown in Fig. 1. Those models will be integrated
into the robot behavior architecture to validate the related
effects on human-robot trust.
II. METHODOLOGY AND DISCUSSION
With regards non-verbal behavior, we will explore the
interactive robot speech-driven gesture/face generation with
the human state in the loop. Most of the past works only
utilized robot speech to guide the robot gesture/face genera-
tion without consideration of the human interactor states [2].
However, in real-time human-human interaction, the human
cognitive load, emotion, attention, preference, and multimodal
behaviors will reflect interactor behavior patterns. Similarly,
the human-in-the-loop robot gesture generation model may
make a difference for a trustworthy human-robot interaction.
This work can be built based on our past speaking robot
gesture generation work [4] to explore the effectiveness.
We thank UKRI Node on Trust (EP/V026682/1) for support.
Fig. 1. Pipeline of human-in-the-loop robot behavior dynamics.
With regards verbal behaviour, we focus on speech syn-
thesis, specifically gender-style-controllable robot speech that
takes into consideration both human preference and robot gen-
der characteristics. Firstly, the robot’s gender identity based on
its physical appearance includes male, female, or non-binary
[1]. The robot speech generation should consider appearance-
based gender identity for gender style dynamics. Secondly,
the robot’s gender identity is very subjective to each human
interactor. So the gender-based speech generation also should
regard the human preference dynamic. The work will be
upgraded from our past genderless robot speech generation
work [3].
REFERENCES
[1] Perugia, G., Guidi, S., Bicchi, M., Parlangeli, O.: The shape of our bias:
Perceived age and gender in the humanoid robots of the abot database. In:
Proceedings of the 2022 ACM/IEEE International Conference on Human-
Robot Interaction. pp. 110–119 (2022)
[2] Yoon, Y., Cha, B., Lee, J.H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech
gesture generation from the trimodal context of text, audio, and speaker
identity. ACM Transactions on Graphics (TOG) 39(6), 1–16 (2020)
[3] Yu, C., Fu, C., Chen, R., Tapus, A.: First attempt of gender-free
speech style transfer for genderless robot. In: Proceedings of the 2022
ACM/IEEE International Conference on Human-Robot Interaction. pp.
1110–1113 (2022)
[4] Yu, C., Tapus, A.: Srg 3: Speech-driven robot gesture generation with gan.
In: 2020 16th International Conference on Control, Automation, Robotics
and Vision (ICARCV). pp. 759–766. IEEE (2020)
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The present study was aimed at determining the age and gender distribution of the humanoid robots in the ABOT dataset, and providing a systematic data-driven formalization of the process of age and gender categorization of humanoid robots. We involved 153 participants in an online study and asked them to rate the humanoid robots in the ABOT dataset in terms of perceived age, femininity, masculinity, and gender neutrality. Our analyses disclosed that most of the robots in the ABOT dataset were perceived as young adults, and the vast majority of them were attributed a neutral or masculine gender. By merging our data with the data in the ABOT dataset, we discovered that humanlikeness is crucial to elicit social categorization. Moreover, we found out that body manipulators (e.g., legs, torso) guide the attribution of masculinity, surface look features (e.g., eyelashes, apparel) the attribution of femininity, and that robots without facial features (e.g., head, eyes) are perceived as older. Finally, yet importantly, we unveiled that men tend to attribute lower age scores and higher femininity ratings to humanoid robots than women. Our work provides evidence of an existing underlying bias in the design of humanoid robots that needs to be addressed: the under-representation of feminine robots and lack of representation of androgynous ones. We make the results of this study publicly available to the HRI community by attaching the dataset we collected to the present paper and creating a dedicated website.
Conference Paper
Full-text available
The pipeline of gender-free robot speech synthesis. The text-to-speech (TTS) synthesizer takes the genderless speech style embedding and text as inputs to output the gender-free voice, which can be used in the genderless robot, for example, the Pepper robot. The genderless speech style embedding is a function of the male and female speech style embedding distributions. The male and female embedding are extracted from a speaker gender encoder. The speaker gender encoder is independent from TTS synthesizer during training. Z 1 is the male speech style embedding and Z 2 is the female speech style embedding. f (Z 1 , Z 2) is the gender-free speech style embedding, which is obtained from the existed female and male speech style embedding. Abstract-Some robots for human-robot interaction are designed with female or male physical appearance. Other robots are endowed with no gender characteristics, namely genderless robots, such as Pepper and NAO robot. A robot with male or female physical appearance should possess the mapped speech gender style during a natural human-robot interaction, which can be learned from humans' male or female speech. In this paper, we make a new trial to synthesis gender-free speeches for physically genderless robots, which is promising in order to improve a more natural human-robot interaction with genderless robots. Our gender style-controlled speech synthesizer takes the speech text and gender style embedding as inputs to generate speech audio. A speech gender encoder network is used to extract the embedding of the speech gender style with female and male speeches as input. Based on the distribution of the female and male gender style embedding, we explore the gender-free speech style embedding space where we sample some gender-free embedding vectors to generate genderless speech audio. This is a preliminary work where we show how the genderless speech audio wave will be synthesized from text. Index Terms-genderless robot speech, speech style transfer, text-to-speech synthesis I. I NTRODUCTION Gender is an essential characteristic of humans. How about robot gender identity? Robot gender characteristics in robotics design, including robot behavior design and appearance design , make a big difference in human-robot interaction [1]. Several works show that robot's behaviors, including verbal and non-verbal behaviors, should match its gender to make an appropriate human-robot interaction [2] [3]. How about the genderless robots, for example, the Pepper robot and NAO robot, who have no clear gender identity [4] [5]? It is not suitable to use the speech with male or female style on a genderless robot. We explore here the area of gender-free speech synthesis for the genderless robots. In the paper, the terminology genderless and gender free mean that a robot does not identify with male or female and that it is difficult or not possible to recognize it as male or female from the generated speech. Recently, supervised and unsupervised text-to-speech (TTS)
Conference Paper
Full-text available
The human gestures occur spontaneously and usually they are aligned with speech, which leads to a natural and expressive interaction. Speech-driven gesture generation is important in order to enable a social robot to exhibit social cues and conduct a successful human-robot interaction. In this paper, the generation process involves mapping acoustic speech representation to the corresponding gestures for a humanoid robot. The paper proposes a new GAN (Generative Adversarial Network) architecture for speech to gesture generation. Instead of the fixed mapping from one speech to one gesture pattern, our end-to-end GAN structure can generate multiple mapped gestures patterns from one speech (with multiple noises) just like humans do. The generated gestures can be applied to social robots with arms. The evaluation result shows the effectiveness of our generative model for speech-driven robot gesture generation.
Article
For human-like agents, including virtual avatars and social robots, making proper gestures while speaking is crucial in human-agent interaction. Co-speech gestures enhance interaction experiences and make the agents look alive. However, it is difficult to generate human-like gestures due to the lack of understanding of how people gesture. Data-driven approaches attempt to learn gesticulation skills from human demonstrations, but the ambiguous and individual nature of gestures hinders learning. In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce a new quantitative evaluation metric for gesture generation models. Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models. We further confirm that our model is able to work with synthesized audio in a scenario where contexts are constrained, and show that different gesture styles can be generated for the same speech by specifying different speaker identities in the style embedding space that is learned from videos of various speakers. All the code and data is available at https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.