Conference PaperPDF Available

Towards a Framework for Social Robot Co-speech Gesture Generation with Semantic Expression


Abstract and Figures

The ability to express semantic co-speech gestures in an appropriate manner of the robot is needed for enhancing the interaction between humans and social robots. However, most of the learning-based methods in robot gesture generation are unsatisfactory in expressing the semantic gesture. Many generated gestures are ambiguous, making them difficult to deliver the semantic meanings accurately. In this paper, we proposed a robot gesture generation framework that can effectively improve the semantic gesture expression ability of social robots. In this framework, the semantic words in a sentence are selected and expressed by clear and understandable co-speech gestures with appropriate timing. In order to test the proposed method, we designed an experiment and conducted the user study. The result shows that the performances of the gesture generated by the proposed method are significantly improved compared to the baseline gesture in three evaluation factors: human-likeness, naturalness and easiness to understand.
Content may be subject to copyright.
Towards a Framework for Social Robot
Co-speech Gesture Generation with Semantic
Heng Zhang1, Chuang Yu2, and Adriana Tapus1
1Autonomous Systems and Robotics Lab/U2IS, ENSTA Paris, Institut
Polytechnique de Paris, Paris, France {heng.zhang,adriana.tapus}
2Cognitive Robotics Laboratory, University of Manchester, Manchester, UK
Abstract. The ability to express semantic co-speech gestures in an ap-
propriate manner of the robot is needed for enhancing the interaction
between humans and social robots. However, most of the learning-based
methods in robot gesture generation are unsatisfactory in expressing the
semantic gesture. Many generated gestures are ambiguous, making them
difficult to deliver the semantic meanings accurately. In this paper, we
proposed a robot gesture generation framework that can effectively im-
prove the semantic gesture expression ability of social robots. In this
framework, the semantic words in a sentence are selected and expressed
by clear and understandable co-speech gestures with appropriate timing.
In order to test the proposed method, we designed an experiment and
conducted the user study. The result shows that the performances of the
gesture generated by the proposed method are significantly improved
compared to the baseline gesture in three evaluation factors: human-
likeness, naturalness and easiness to understand.
Keywords: Social Robot, Semantic, Robot gesture, Human-Robot In-
1 Introduction
The co-speech gesture plays an important role in human communication [1]. It
is semantically related to speech and increases the efficiency of information ex-
change [2]. In addition, it can also convey the speaker’s emotion and attitude,
which makes the communication be perceived more natural [3]. As in the inter-
action between humans, the social robot is also hoped to behave like a human
by using co-speech gestures when it communicates with humans [4, 5], which
contributes to building human users’ confidence for long-term interaction with
the robot.
Psychological research on co-speech gestures has been quite fruitful. One of
the most widely known works is the theory of David McNeil [6, 8]. According to
Supported by ENSTA Paris, Institut Polytechnique de Paris, France and the CSC
PhD Scholarship.
2 Heng Zhang, Chuang Yu, and Adriana Tapus
his theory [6], gestures can be basically classified as two types, namely, the imag-
istic gesture and the beat gesture. The imagistic gesture includes three subtypes,
which are iconic gesture, metaphoric gesture, and deictic gesture, respectively.
As mentioned before, gestures play important roles in face-to-face conversation.
The imagistic gesture is usually related to the semantic content in speech. For
example, people will display their two empty palm hands to strengthen the ”no
idea” in their words [6]. Unlike the imagistic gesture, the beat gesture is usually
not related to semantic content, but used to emphasize the intonation of speech,
thus conveying the speaker’s emotions [7]. Both the imagistic gesture and the
beat gesture associated with speech can help the speaker to express more clearly
or in a more subtle way.
According to the CASA paradigm [9], the robots are also expected to be
endowed with human social capabilities although they are man-made intelligent
agents. Consequently, when pursuing the natural and human-like co-speech ges-
ture generation methods of the social robot, both the imagistic gesture and beat
gesture should be taken into account. In this paper, we will introduce a co-speech
gesture generation framework that considers both of these two kinds of gestures.
The rest of this paper is structured as follows: We discuss some related works
in co-speech gesture generation in Section2; We introduce the proposed co-speech
gesture generation framework in Section3; We present the experimental design
and results in Section 4 and Section 5, respectively. Finally, the conclusions and
future work are part of Section6.
2 Related Work
There have been some related works on co-speech gesture generation. Although
some methods were originally intended to generate actions for virtual agents,
they could also be applied to humanoid social robots. According to the way of
generation, most methods can be divided into two categories, namely, rule-based
methods [19, 11] and data-driven methods [18, 17].
The rule-based method refers to using handcrafted rules to generate gestures.
Once the rules are designed, it will be a simple and convenient way for users to
generate the gestures. Kopp et al. proposed a behavior generation framework
that includes an XML-based language, BML(Behavior Makeup Language). It
defined some rules of gesture expression and can be coupled with other gesture
generation systems [13]. The BEAT toolkit developed by the MIT media lab is
a famous rule-based gesture generation system. It integrated the rules derived
from the prior knowledge of human conversation into the toolkit. By inputting
the given text to BEAT, the user can obtain a description of the co-speech
gesture, which can be used in an animation system to generate the specific
gesture sequence [10]. Some commercial social robots also developed the rule-
based gesture generation methods to allow the users to design the robot gesture
quickly, such as the AnimatedSpeech function of Pepper and Nao robots [12]. The
rule-based methods are simpler and more convenient than the pure manual way,
and the generated gestures can also express semantics clearly. However, because
Title Suppressed Due to Excessive Length 3
of the limited rules, the gestures are usually form monotonous and repetitive,
which is not conducive to long-term human-Robot interaction.
Fig. 1. Overview of our proposed framework
With the development of deep learning, the data-driven method has been in-
creasingly used in the domain of Human-Robot Interaction [14, 15]. Yoon et al.
introduced an end-to-end co-speech gesture generation method. They used the
TED Gesture Dataset to train an Encoder-Decoder model. The trained model
is able to generate the gesture sequence by giving the text as the input [16].
Kucherenko et al. proposed a framework that used the audio and text of speech
as input. In this way, both the acoustic and semantic representations would con-
tribute to the co-speech generation [17]. Compared to the rule-based method,
the data-driven gesture generation method performs much better in terms of va-
riety. However, the data-driven methods are weak in expressing semantics. The
main reason is that in a data-driven approach, the training goal is to make the
loss (the difference between the generated angles and the actual angles) as small
as possible to achieve convergence of the model. However, the co-speech gesture
that humans use to express the semantic meaning is usually clear and unique.
The data-driven model cannot understand the semantics of the gesture, so the
imagistic gesture and the beat gesture are with the same weights during train-
4 Heng Zhang, Chuang Yu, and Adriana Tapus
ing. Therefore, although the loss is small, the generated results are ambiguous,
making them difficult to deliver the semantic meanings accurately.
In view of the shortages of these two kinds of methods, we propose a co-
speech gesture generation framework that combines the advantages of these two
kinds of methods and avoids their disadvantages, as shown in Figure 1.
In this framework, we built a semantic gesture dataset named SWAG (Seman-
tic Word Associated Gesture) dataset for expressing semantics, and an Encoder-
Decoder model for generating baseline gestures. The SWAG dataset contains a
list of highly-used semantic gestures, and each gesture corresponds to a seman-
tic word. These gestures are saved in the form of TXT files that are extracted
by the BlazePose [20]. We also modeled an Encoder-Decoder neural network to
generate the baseline gesture by inputting the acoustic features of the speech au-
dio. Moreover, a semantic gesture selection mechanism and a co-speech gesture
synthesis mechanism were designed to select the semantic gesture from SWAG
and use word-level timestamps to embed the selected gesture into the baseline
gesture appropriately. There are two contributions to our current work:
1) The first semantic gesture dataset SWAG that contains a list of semantic-
related gestures;
2) Designing a co-speech gesture synthesis mechanism using the word-level
timestamp sequence that is able to embed the semantic gesture into the
baseline gesture at appropriate timing.
3 Gesture Generation Framework
Our proposed gesture generation framework mainly includes four parts: an Encoder-
Decoder beat gesture generation model, a semantic word associated gesture
dataset(SWAG), a semantic gesture selection mechanism, and a co-speech ges-
ture synthesis mechanism. These four parts work together to generate the co-
speech gesture that can clearly deliver semantics.
3.1 Encoder-Decoder Model
In this paper, we used an Encoder-Decoder model to generate the baseline ges-
tures. The model takes the speech audio as input and outputs the corresponding
co-speech gesture. The architecture of the model is as shown in Figure 2.
The model consists of two components, which are the Encoder and the De-
coder, respectively. We divided the model training into two steps. In this work,
the dataset used to train the model is the gesture dataset built by Ylva Fer-
stl et al. [23], which contains the speech audio and the motion data in form of
BioVision Hierarchy format (BVH).
Firstly, we used an Autoencoder model to learn the gesture representation.
There is a bottleneck layer in the Autoencoder. This layer is able to force the net-
work to compress the representations of the original input data, thus obtaining
Title Suppressed Due to Excessive Length 5
Fig. 2. The Encoder-Decoder model used to generate baseline gesture
the representation of the gesture gin a lower dimensionality. In the Autoen-
coder model, the input is motion mand output is motion ˆ
m. We defined the
loss function as:
m) = argmin ˆ
In this step, the gesture representations gand the parameters of the Decoder
were kept. In the following next step, we trained an Encoder to map the speech
audio to the gesture representation g. The input is the Mel-Frequency Cepstral
Coefficients (MFCCs) extracted from the speech audio. The output is the gesture
representation in the previous step. In this step, the parameters of the Encoder
were kept. This model is partly based on the work of Kucherenko et al. in [22].
After finishing the training of these two parts separately, we connect them
together to form a complete Encoder-Decoder model.
3.2 SWAG Dataset
In this work, we built an extensible semantic gesture dataset named SWAG.
This dataset contains a list of semantic gestures(joint angles) in the form of
TXT files, and each gesture corresponds to a semantic word. The dataset is
preliminary work and it can be expanded easily in future work.
We divided the whole process of building the dataset into three steps. Firstly,
we selected the semantic words borrowed from the semantic word list of the
AnimatedSpeech function of Pepper robot [21]. At the early stage, we listed 67
frequently used semantic words. Secondly, a volunteer performed all these words
by co-speech gestures. Meanwhile, his gestures were recorded by the Logitech HD
Webcam C930e (1080p, 30FPS), and the video was split into 67 pieces. Thirdly,
the 3D coordinates of joints were extracted by the BlazePose as shown in Figure
3. Since we just need the motion data of the upper body, we only saved the
6 Heng Zhang, Chuang Yu, and Adriana Tapus
Fig. 3. The Joints recognized by the BlazePose
3D coordinates information of the following joints: wrist, elbow, shoulder, and
hip. Considering the actual degree of freedom (DOF) of the Pepper robot, the
rotation angles of each joint were calculated according to the 3D coordinates
information in every frame. In addition, we also used the Dlib toolkit and the
solvePnP function of OpenCV to estimate the head pose. To smooth the gesture,
all the extracted motion data was processed by a median filter(kernel = 5).
Finally, we obtained the dataset that included 67 TXT-based gesture files.
3.3 Co-speech Gesture Synthesis Mechanism
In this paper, we proposed a co-speech gesture synthesis mechanism to synthesize
the gesture with semantics. We are going to solve two main problems through
this synthesis mechanism:
1) Which words in a sentence should be selected to be expressed by the semantic
2) How to determine the timing for a semantic gesture embedding?
In the mechanism, the baseline gesture, the semantic gesture, the timestamp
sequence, and the original audio are required. We have introduced how to gen-
erate the baseline gesture in the previous subsection. It is worth mentioning
that the baseline gesture generated by the NN model are in the form of Euler
angles. In order to be retargeted to the robot, we further converted them into
joint angles. Next, we will explain how to select the appropriate gestures for the
semantic words of a sentence from the SWAG.
Title Suppressed Due to Excessive Length 7
Fig. 4. The interpolation
After inputting the speech audio into our framework, the audio would be con-
verted to text through the Audio2Text program. Meanwhile, we could obtain a
speech words list and a word-level timestamp sequence by using the Microsoft
Azure SDK. Firstly, we used the rule-based Toolkit BEAT to determine the
words to be selected to express semantics. These words would be selected again
depending on whether there is the corresponding motion data in the SWAG
database. Secondly, according to the selected words, the corresponding times-
tamp would be used to determine the gesture embedding timing. To make the
semantic gesture and baseline gesture connect smoothly, we interpolated the
joint angle data of ten frames before timestamp T of the baseline motion and
the beginning data of the semantic motion. The same operation would be done
at the end of the semantic motion embedding as shown in Fig.4. In this way, the
data of semantic gesture replaces the baseline data at an appropriate position.
Finally, the input audio would also be used as the speech audio of the robot.
4 Experimental Design
The previous parts explained the principles of the co-speech gesture generation
method. In this section, we introduce the experiment that we conducted to test
the performance of our proposed method.
4.1 Videos
We used the proposed method and the baseline method (only the Encoder-
Decoder model) to generate gestures with 5 fixed short paragraphs, respectively.
This set composed of 10 gesture data was used with the Pepper robot, which is an
humanoid social robot developed by SoftBank Robotics. We recorded Pepper’s
gestures into 10 videos by a camera, and we also generated the corresponding
subtitles for each video.
8 Heng Zhang, Chuang Yu, and Adriana Tapus
4.2 Questionnaire and Participants
These videos were then used to make a questionnaire. In the questionnaire, the
order of the videos was random. After each video, we asked the participants
to answer a set of three questions in terms of the degree of human-likeness,
naturalness, and easiness of understanding of the speech. These were
1) To what extent are the gestures of the robot similar to those of a human?
2) How natural are the robot’s gestures?
3) To what extent do robot’s gestures make the speech easier to understand?
All questions were using 10 Likert scale, where 10 indicates the best per-
formance. The questionnaire was then distributed online to students in Institut
Polytechnique de Paris and the University of Manchester. 50 participants took
part in our online study.
5 Results
Participants’ ratings on 10 robot’s performances are summarized in Figure 5.
We used the Shapiro-Wilk test to verify the normality of the data. Data does
not conform to a normal distribution. Consequently, we used the Kruskal-Wallis
test to verify the statistical difference of the gesture generated by the two dif-
ferent methods under each of the three evaluation factors (i.e., human-likeness,
naturalness and easiness to understand). The results are shown in Figure 5.
Fig. 5. The results obtained by the 2 methods with respect to the three factors (i.e.,
human-likeness, naturalness and easiness to understand)
When considering the factor of ”human-likeness”, the χ2= 85.9, p <0.01;
When considering the factor of ”naturalness”, the χ2= 65.7, p <0.01; When
Title Suppressed Due to Excessive Length 9
considering the factor of ”easiness to understand”, the χ2= 70.8, p <0.01. The
above results confirmed that there are significant differences between partici-
pants’ ratings on the gesture generated by the proposed method and the gesture
generated by the baseline method in terms of all three factors. In addition, with
these three evaluation factors, the participants’ ratings on the gesture generated
by the proposed method are significantly higher than the gesture generated by
the baseline method.
6 Conclusions and Future Work
In this work, we proposed a robot co-speech gesture generation framework. In
this framework, we combined the rule-based method and the learning-based
method to improve the robot’s ability to express semantics by embedding the se-
mantic gesture into the baseline gesture. The result of the user study shown that
the performances of the gesture generated by the proposed method are signifi-
cantly improved compared to the baseline gesture with respect to the ”human-
likeness”, ”naturalness”, and ”easiness to understand”.
There are still some parts that could be improved in our current work. Firstly,
the SWAG dataset is of a small size, and therefore, it should be further expanded
to meet more dialogue scenarios in future work. Secondly, the semantic words
selecting mechanism is still rudimentary and not conducive to the end-to-end
gesture generation target. Future work will focus on the training of a neural
network to replace the current mechanism.
1. J. A. Graham and M. Argyle, “A cross-cultural study of the communication of
extra-verbal meaning by gestures (1),” International Journal of Psychology, vol. 10,
no. 1, pp. 57–67, 1975.
2. J. Holler and K. Wilkin, “Communicating common ground: How mutually shared
knowledge influences speech and gesture in a narrative task,” Language and cognitive
processes, vol. 24, no. 2, pp. 267–289, 2009.
3. S. Mozziconacci, “Emotion and attitude conveyed in speech by means of prosody,”
in For the 2nd Workshop on Attitude, Personality and Emotions in User-Adapted
Interaction, 2001, pp. 1–10.
4. S. Rossi, A. Rossi, and K. Dautenhahn, “The secret life of robots: Perspectives
and challenges for robot’s behaviours during non-interactive tasks,” International
Journal of Social Robotics, vol. 12, no. 6, pp. 1265–1278, 2020.
5. A. Tapus, M. Maja, and B. Scassellatti, “The grand challenges in socially assistive
robotics,” IEEE Robotics and Automation Magazine, vol. 14, no. 1, pp. N–A, 2007.
6. D. McNeill, “Hand and mind1,” Advances in Visual Semiotics, p. 351, 1992.
7. E. Bozkurt, Y. Yemez, and E. Erzin, “Multimodal analysis of speech and arm motion
for prosody-driven synthesis of beat gestures,” Speech Communication, vol. 85, pp.
29–42, 2016.
8. D. McNeill, “So you think gestures are nonverbal?” Psychological review, vol. 92,
no. 3, p. 350, 1985.
10 Heng Zhang, Chuang Yu, and Adriana Tapus
9. B. Reeves and C. Nass, “The media equation: How people treat computers, televi-
sion, and new media like real people,” Cambridge, UK, vol. 10, p. 236605, 1996.
10. J. Cassell, H. H. Vilhj´almsson, and T. Bickmore, “Beat: the behavior expression
animation toolkit,” in Life-Like Characters. Springer, 2004, pp. 163–185.
11. P. Bremner, A. G. Pipe, M. Fraser, S. Subramanian, and C. Melhuish, “Beat
gesture generation rules for human-robot interaction,” in RO-MAN 2009-The 18th
IEEE International Symposium on Robot and Human Interactive Communication.
IEEE, 2009, pp. 1029–1034.
12. A. K. Pandey and R. Gelin, “A mass-produced sociable humanoid robot: Pepper:
The first machine of its kind,” IEEE Robotics & Automation Magazine, vol. 25,
no. 3, pp. 40–48, 2018.
13. S. Kopp, B. Krenn, S. Marsella, A. N. Marshall, C. Pelachaud, H. Pirker, K. R.
Th´orisson, and H. Vilhj´almsson, “Towards a common framework for multimodal
generation: The behavior markup language,” in International workshop on intelli-
gent virtual agents. Springer, 2006, pp. 205–217.
14. C. Yu and A. Tapus, “Interactive robot learning for multimodal emotion recogni-
tion,” in International Conference on Social Robotics. Springer, 2019, pp. 633–642.
15. C. Yu, C. Fu, R. Chen, and A. Tapus, “First attempt of gender-free speech style
transfer for genderless robot,” in Proceedings of the 2022 ACM/IEEE International
Conference on Human-Robot Interaction, 2022, pp. 1110–1113.
16. Y. Yoon, W.-R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee, “Robots learn social
skills: End-to-end learning of co-speech gesture generation for humanoid robots,” in
2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019,
pp. 4303–4309.
17. T. Kucherenko, P. Jonell, S. van Waveren, G. E. Henter, S. Alexandersson, I. Leite,
and H. Kjellstr¨om, “Gesticulator: A framework for semantically-aware speech-driven
gesture generation,” in Proceedings of the 2020 International Conference on Multi-
modal Interaction, 2020, pp. 242–250.
18. C. Yu and A. Tapus, “Srg 3: Speech-driven robot gesture generation with gan,” in
2020 16th International Conference on Control, Automation, Robotics and Vision
(ICARCV). IEEE, 2020, pp. 759–766.
19. C.-M. Huang and B. Mutlu, “Robot behavior toolkit: generating effective social
behaviors for robots,” in 2012 7th ACM/IEEE International Conference on Human-
Robot Interaction (HRI). IEEE, 2012, pp. 25–32.
20. V. Bazarevsky, I. Grishchenko, K. Raveendran, T. Zhu, F. Zhang, and M. Grund-
mann, “Blazepose: On-device real-time body pose tracking,” arXiv preprint
arXiv:2006.10204, 2020.
21. Alanimatedspeech,
22. T. Kucherenko, D. Hasegawa, G. E. Henter, N. Kaneko, and H. Kjellstr¨om, “An-
alyzing input and output representations for speech-driven gesture generation,” in
Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents,
2019, pp. 97–104.
23. Y. Ferstl and R. McDonnell, “Investigating the use of recurrent motion modelling
for speech gesture generation,” in Proceedings of the 18th International Conference
on Intelligent Virtual Agents, 2018, pp. 93–98.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The pipeline of gender-free robot speech synthesis. The text-to-speech (TTS) synthesizer takes the genderless speech style embedding and text as inputs to output the gender-free voice, which can be used in the genderless robot, for example, the Pepper robot. The genderless speech style embedding is a function of the male and female speech style embedding distributions. The male and female embedding are extracted from a speaker gender encoder. The speaker gender encoder is independent from TTS synthesizer during training. Z 1 is the male speech style embedding and Z 2 is the female speech style embedding. f (Z 1 , Z 2) is the gender-free speech style embedding, which is obtained from the existed female and male speech style embedding. Abstract-Some robots for human-robot interaction are designed with female or male physical appearance. Other robots are endowed with no gender characteristics, namely genderless robots, such as Pepper and NAO robot. A robot with male or female physical appearance should possess the mapped speech gender style during a natural human-robot interaction, which can be learned from humans' male or female speech. In this paper, we make a new trial to synthesis gender-free speeches for physically genderless robots, which is promising in order to improve a more natural human-robot interaction with genderless robots. Our gender style-controlled speech synthesizer takes the speech text and gender style embedding as inputs to generate speech audio. A speech gender encoder network is used to extract the embedding of the speech gender style with female and male speeches as input. Based on the distribution of the female and male gender style embedding, we explore the gender-free speech style embedding space where we sample some gender-free embedding vectors to generate genderless speech audio. This is a preliminary work where we show how the genderless speech audio wave will be synthesized from text. Index Terms-genderless robot speech, speech style transfer, text-to-speech synthesis I. I NTRODUCTION Gender is an essential characteristic of humans. How about robot gender identity? Robot gender characteristics in robotics design, including robot behavior design and appearance design , make a big difference in human-robot interaction [1]. Several works show that robot's behaviors, including verbal and non-verbal behaviors, should match its gender to make an appropriate human-robot interaction [2] [3]. How about the genderless robots, for example, the Pepper robot and NAO robot, who have no clear gender identity [4] [5]? It is not suitable to use the speech with male or female style on a genderless robot. We explore here the area of gender-free speech synthesis for the genderless robots. In the paper, the terminology genderless and gender free mean that a robot does not identify with male or female and that it is difficult or not possible to recognize it as male or female from the generated speech. Recently, supervised and unsupervised text-to-speech (TTS)
Conference Paper
Full-text available
The human gestures occur spontaneously and usually they are aligned with speech, which leads to a natural and expressive interaction. Speech-driven gesture generation is important in order to enable a social robot to exhibit social cues and conduct a successful human-robot interaction. In this paper, the generation process involves mapping acoustic speech representation to the corresponding gestures for a humanoid robot. The paper proposes a new GAN (Generative Adversarial Network) architecture for speech to gesture generation. Instead of the fixed mapping from one speech to one gesture pattern, our end-to-end GAN structure can generate multiple mapped gestures patterns from one speech (with multiple noises) just like humans do. The generated gestures can be applied to social robots with arms. The evaluation result shows the effectiveness of our generative model for speech-driven robot gesture generation.
Full-text available
Some applications of service robots within domestic and working environments are envisaged to be a significant part of our lives in the not too distant future. They are developed to autonomously accomplish different tasks either on behalf of or in collaboration with a human being. Robots can perceive and interpret data from the external environment, so they also collect personal information and habits; they can plan, navigate, and manipulate objects, eventually intruding in our personal space and disturbing us in the current activities. Indeed, such capabilities need to be socially enhanced to ensure their effective deployment and to favour a significant social impact. The modelling and evaluation of a service robot’s behaviour, while not interacting with a human, have only been marginally considered in the last few years. But these can be expected to play a key role in developing socially acceptable robotic applications that can be used widely. To explore this research direction, we present research objectives related to the effective development of socially-aware service robots that are not involved in tasks that require explicit interaction with a person. Such discussion aims at highlighting some of the future challenges that will be posed for the social robotics community in the next years.
Full-text available
As robotics technology evolves, we believe that personal social robots will be one of the next big expansions in the robotics sector. Based on the accelerated advances in this multidisciplinary domain and the growing number of use cases, we can posit that robots will play key roles in everyday life and will soon coexist with us, leading all people to a smarter, safer, healthier, and happier existence.
Conference Paper
Interaction plays a critical role in skills learning for natural communication. In human-robot interaction (HRI), robots can get feedback during the interaction to improve their social abilities. In this context, we propose an interactive robot learning framework using mul-timodal data from thermal facial images and human gait data for online emotion recognition. We also propose a new decision-level fusion method for the multimodal classification using Random Forest (RF) model. Our hybrid online emotion recognition model focuses on the detection of four human emotions (i.e., neutral, happiness, angry, and sadness). After conducting offline training and testing with the hybrid model, the accuracy of the online emotion recognition system is more than 10% lower than the offline one. In order to improve our system, the human verbal feedback is injected into the robot interactive learning. With the new online emotion recognition system, a 12.5% accuracy increase compared with the online system without interactive robot learning is obtained.
Conference Paper
This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences. We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.
Conference Paper
The growing use of virtual humans demands generating increasingly realistic behavior for them while minimizing cost and time. Gestures are a key ingredient for realistic and engaging virtual agents and consequently automatized gesture generation has been a popular area of research. So far, good gesture generation has relied on explicit formulation of if-then rules and probabilistic modelling of annotated features. Machine learning approaches have yielded only marginal success, indicating a high complexity of the speech-to-motion learning task. In this work, we explore the use of transfer learning using previous motion modelling research to improve learning outcomes for gesture generation from speech. We use a recurrent network with an encoder-decoder structure that takes in prosodic speech features and generates a short sequence of gesture motion. We pre-train the network with a motion modelling task. We recorded a large multimodal database of conversational speech for the purpose of this work.
We propose a framework for joint analysis of speech prosody and arm motion towards automatic synthesis and realistic animation of beat gestures from speech prosody and rhythm. In the analysis stage, we first segment motion capture data and speech audio into gesture phrases and prosodic units via temporal clustering, and assign a class label to each resulting gesture phrase and prosodic unit. We then train a discrete hidden semi-Markov model (HSMM) over the segmented data, where gesture labels are hidden states with duration statistics and frame-level prosody labels are observations. The HSMM structure allows us to effectively map sequences of shorter duration prosodic units to longer duration gesture phrases. In the analysis stage, we also construct a gesture pool consisting of gesture phrases segmented from the available dataset, where each gesture phrase is associated with a class label and speech rhythm representation. In the synthesis stage, we use a modified Viterbi algorithm with a duration model, that decodes the optimal gesture label sequence with duration information over the HSMM, given a sequence of prosody labels. In the animation stage, the synthesized gesture label sequence with duration and speech rhythm information is mapped into a motion sequence by using a multiple objective unit selection algorithm. Our framework is tested using two multimodal datasets in speaker-dependent and independent settings. The resulting motion sequence when accompanied with the speech input yields natural-looking and plausible animations. We use objective evaluations to set parameters of the proposed prosody-driven gesture animation system, and subjective evaluations to assess quality of the resulting animations. The conducted subjective evaluations show that the difference between the proposed HSMM based synthesis and the motion capture synthesis is not statistically significant. Furthermore, the proposed HSMM based synthesis is evaluated significantly better than a baseline synthesis which animates random gestures based on only joint angle continuity.
English and Italian encoders were asked to communicate two-dimensional shapes to decoders of their own culture, with and without the use of hand gestures, for materials of high and low verbal codability. The decoders drew what they thought the shapes were and these were rated by English and Italian judges, for similarity to the originals. Higher accuracy scores were obtained by both the English and the Italians, when gestures were allowed, for materials of both high and low codability; but the effect of using gestures was greater for materials of low codability. Improvement in performance when gestures were allowed was greater for the Italians than for the English for both levels of codability. An analysis of the recorded verbal utterances has shown that the detriment in communication accuracy with the elimination of gestures cannot be attributed to disruption of speech performance; rather, changes in speech content occur indicating an increased reliance on verbal means of conveying spatial information. Nevertheless, gestures convey this kind of semantic information more accurately and evidence is provided for the gestures of the Italians communicating this information more effectively than those of the English.