Conference Paper

Human Gesture Recognition with a Flow-based Model for Human Robot Interaction

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Human skeleton-based gesture classification plays a dominant role in social robotics. Learning the variety of human skeleton-based gestures can help the robot to continuously interact in an appropriate manner in a natural human-robot interaction (HRI). In this paper, we proposed a Flow-based model to classify human gesture actions with skeletal data. Instead of inferring new human skeleton actions from noisy data using a retrained model, our end-to-end model can expand the diversity of labels for gesture recognition from noisy data without retraining the model. At first, our model focuses on detecting five human gesture actions (i.e., come on, right up, left up, hug, and noise-random action). The accuracy of our online human gesture recognition system is as well as the offline one. Meanwhile, both attain 100% accuracy among the first four actions. Our proposed method is more efficient for inference of new human gesture action without retraining, which acquires about 90% accuracy for noise-random action. The gesture recognition system has been applied to the robot's reaction toward the human gesture, which is promising to facilitate a natural human-robot interaction.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The human gestures occur spontaneously and usually they are aligned with speech, which leads to a natural and expressive interaction. Speech-driven gesture generation is important in order to enable a social robot to exhibit social cues and conduct a successful human-robot interaction. In this paper, the generation process involves mapping acoustic speech representation to the corresponding gestures for a humanoid robot. The paper proposes a new GAN (Generative Adversarial Network) architecture for speech to gesture generation. Instead of the fixed mapping from one speech to one gesture pattern, our end-to-end GAN structure can generate multiple mapped gestures patterns from one speech (with multiple noises) just like humans do. The generated gestures can be applied to social robots with arms. The evaluation result shows the effectiveness of our generative model for speech-driven robot gesture generation.
Chapter
Full-text available
Interaction plays a critical role in skills learning for natural communication. In human-robot interaction (HRI), robots can get feedback during the interaction to improve their social abilities. In this context, we propose an interactive robot learning framework using multimodal data from thermal facial images and human gait data for online emotion recognition. We also propose a new decision-level fusion method for the multimodal classification using Random Forest (RF) model. Our hybrid online emotion recognition model focuses on the detection of four human emotions (i.e., neutral, happiness, angry, and sadness). After conducting offline training and testing with the hybrid model, the accuracy of the online emotion recognition system is more than 10% lower than the offline one. In order to improve our system, the human verbal feedback is injected into the robot interactive learning. With the new online emotion recognition system, a 12.5% accuracy increase compared with the online system without interactive robot learning is obtained.
Article
Full-text available
Gesturing is an important modality in human–robot interaction. Up to date, gestures are often implemented for a specific robot configuration and therefore not easily transferable to other robots. To cope with this issue, we presented a generic method to calculate gestures for social robots. The method was designed to work in two modes to allow the calculation of different types of gestures. In this paper, we present the new developments of the method. We discuss how the two working modes can be combined to generate blended emotional expressions and deictic gestures. In certain situations, it is desirable to express an emotional condition through an ongoing functional behavior. Therefore, we implemented the possibility of modulating a pointing or reaching gesture into an affective gesture by influencing the motion speed and amplitude of the posture. The new implementations were validated on virtual models with different configurations, including those of the robots NAO and Justin.
Article
Full-text available
We propose a deep learning framework for modeling complex high-dimensional densities via Non-linear Independent Component Estimation (NICE). It is based on the idea that a good representation is one in which the data has a distribution that is easy to model. For this purpose, a non-linear deterministic transformation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized distribution, i.e., resulting in independent latent variables. We parametrize this transformation so that computing the determinant of the Jacobian and inverse Jacobian is trivial, yet we maintain the ability to learn complex non-linear transformations, via a composition of simple building blocks, each based on a deep neural network. The training criterion is simply the exact log-likelihood, which is tractable, and unbiased ancestral sampling is also easy. We show that this approach yields good generative models on four image datasets and can be used for inpainting.
Article
Full-text available
Action recognition has become a very important topic in computer vision, with many fundamental applications, in robotics, video surveillance, human-computer interaction, and multimedia retrieval among others and a large variety of approaches have been described. The purpose of this survey is to give an overview and categorization of the approaches used. We concentrate on approaches that aim on classification of full-body motions, such as kicking, punching, and waving, and we categorize them according to how they represent the spatial and temporal structure of actions; how they segment actions from an input stream of visual data; and how they learn a view-invariant representation of actions.
Thesis
Having a natural interaction makes a significant difference in a successful human-robot interaction (HRI). The natural HRI refers to both human multimodal behavior understanding and robot verbal or non-verbal behavior generation. Humans can naturally communicate through spoken dialogue and non-verbal behaviors. Hence, a robot should perceive and understand human behaviors so as to be capable of producing a natural multimodal and spontaneous behavior that matches the social context. In this thesis, we explore human behavior understanding and robot behavior generation for natural HRI. This includes multimodal human emotion recognition with visual information extracted from RGB-D and thermal cameras and non-verbal multimodal robot behavior synthesis.Emotion recognition based on multimodal human behaviors during HRI can help robots understand user states and exhibit a natural social interaction. In this thesis, we explored multimodal emotion recognition with thermal facial information and 3D gait data in HRI scene when the emotion cues from thermal face and gait data are difficult to disguise. A multimodal database with thermal face images and 3D gait data was built through the HRI experiments. We tested the various unimodal emotion classifiers (i.e., CNN, HMM, Random Forest model, SVM) and one decision-based hybrid emotion classifier on the database for offline emotion recognition. We also explored an online emotion recognition system with limited capability in the real-time HRI setting. Interaction plays a critical role in skills learning for natural communication. Robots can get feedback during the interaction to improve their social abilities in HRI.To improve our online emotion recognition system, we developed an interactive robot learning (IRL) model with the human in the loop. The IRL model can apply the human verbal feedback to label or relabel the data for retraining the emotion recognition model in a long-term interaction situation. After using the interactive robot learning model, the robot could obtain a better emotion recognition accuracy in real-time HRI.The human non-verbal behaviors such as gestures and face action occur spontaneously with speech, which leads to a natural and expressive interaction. Speech-driven gesture and face action generation are vital to enable a social robot to exhibit social cues and conduct a successful HRI. This thesis proposes a new temporal GAN (Generative Adversarial Network) architecture for a one-to-many mapping from acoustic speech representation to the humanoid robot's corresponding gestures. We also developed an audio-visual database to train the speaking gesture generation model. The database includes the speech audio data extracted directly from the videos and the associated 3D human pose data extracted from 2D RGB images. The generated gestures from the trained co-speech gesture synthesizer can be applied to social robots with arms. The evaluation result shows the effectiveness of our generative model for speech-driven robot gesture generation. Moreover, we developed an effective speech-driven facial action synthesizer based on GAN, i.e., given an acoustic speech, a synchronous and realistic 3D facial action sequence is generated. A mapping between the 3D human facial actions to real robot facial actions that regulate the Zeno robot facial expression is completed. The application of co-speech non-verbal robot behaviors (gesture and face action) synthesis for the social robot can make a friendly and natural human-robot interaction.
Article
Recent advances in device-free wireless sensing have created the emerging technique of device-free human gesture recognition (DFHGR), which could recognize human gestures by analyzing their shadowing effect on surrounding wireless signals. DFHGR has many potential applications in the fields of human-machine interaction, smart home, intelligent space, etc.. State-of-the-art work has achieved satisfactory recognition accuracy when there are a sufficient number of training samples. However, it is time-consuming and labor-intensive to collect samples, thus how to realize DFHGR under a small training sample set becomes an urgent problem to solve. Motivated by the excellent ability of generative adversarial network in synthesizing samples, in this paper, we explore and exploit the idea of leveraging it to realize virtual samples augmentation. Specifically, we firstly design a single scenario network with new architecture and better designed loss function to generate virtual samples using a few number of real samples. Then, we further develop a scenario transferring network to generate virtual samples by utilizing the real samples not only from the current scenario, but also from another available scenario as well, which could improve the quality of synthesized samples with the extra knowledge learned from another scenario. We design a mmWave based DFHGR testbed to test the proposed networks, extensive experimental results demonstrate that the augmented virtual samples are of high quality and facilitate DFHGR systems to achieve better accuracy.
Conference Paper
Human emotion detection is an important aspect in social robotics and HRI. In this paper, we propose a vision-based multimodal emotion recognition method based on gait data and facial thermal images designed for social robots. Our method can detect 4 human emotional states (i.e., neutral, happiness, anger, and sadness). We gathered data from 25 participants in order to build-up an emotion database for training and testing our classification models. We implemented and tested several approaches such as Convolutional Neural Network (CNN), Hidden Markov Model (HMM), Support Vector Machine (SVM), and Random Forest (RF). These were trained and tested in order to compare the emotion recognition ability and to find the best approach. We designed a hybrid model with both the gait and the thermal data and the accuracy of our system shows an improvement of 10% over the other models based on our emotion database. This will be explored in a real-time HRI scenario.
Conference Paper
Interaction plays a critical role in skills learning for natural communication. In human-robot interaction (HRI), robots can get feedback during the interaction to improve their social abilities. In this context, we propose an interactive robot learning framework using mul-timodal data from thermal facial images and human gait data for online emotion recognition. We also propose a new decision-level fusion method for the multimodal classification using Random Forest (RF) model. Our hybrid online emotion recognition model focuses on the detection of four human emotions (i.e., neutral, happiness, angry, and sadness). After conducting offline training and testing with the hybrid model, the accuracy of the online emotion recognition system is more than 10% lower than the offline one. In order to improve our system, the human verbal feedback is injected into the robot interactive learning. With the new online emotion recognition system, a 12.5% accuracy increase compared with the online system without interactive robot learning is obtained.
Article
Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that using a PAF-only refinement is able to achieve a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
  • P Diederik
  • Max Kingma
  • Welling
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
Semi-supervised classification with graph convolutional networks
  • N Thomas
  • Max Kipf
  • Welling
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
Shakir Mohamed, and Balaji Lakshminarayanan. 2021. Normalizing Flows for Probabilistic Modeling and Inference
  • George Papamakarios
  • T Eric
  • Danilo Nalisnick
  • Jimenez Rezende
George Papamakarios, Eric T Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. 2021. Normalizing Flows for Probabilistic Modeling and Inference. J. Mach. Learn. Res. 22, 57 (2021), 1-64.
The effects of robot's body gesture and gender in human-robot interaction
  • Eunil Park
  • Ki Joon Kim
  • Park Eunil
Eunil Park, Ki Joon Kim, and Angel P del Pobil. 2011. The effects of robot's body gesture and gender in human-robot interaction. Human-Computer Interaction 6 (2011), 91-96.
Variational inference with normalizing flows
  • Danilo Rezende
  • Shakir Mohamed
Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In International conference on machine learning. PMLR, 1530-1538.
Visualizing data using t-SNE
  • Laurens Van Der Maaten
  • Geoffrey Hinton
  • der Maaten Laurens Van
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
Pooja Guhan, Aniket Bera, and Dinesh Manocha. 2021. Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents
  • Uttaran Bhattacharya
  • Nicholas Rewkowski
  • Abhishek Banerjee
  • Pooja Guhan
  • Aniket Bera
  • Dinesh Manocha
  • Bhattacharya Uttaran
Normalizing Flows for Probabilistic Modeling and Inference
  • George Papamakarios
  • T Eric
  • Danilo Nalisnick
  • Shakir Jimenez Rezende
  • Balaji Mohamed
  • Lakshminarayanan
  • Papamakarios George
Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows
  • Danilo Rezende
  • Shakir Mohamed
  • Rezende Danilo