ArticlePublisher preview available

A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation

Wiley
Computer Graphics Forum
Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co‐speech gestures is a long‐standing problem in computer animation and is considered an enabling technology for creating believable characters in film, games, and virtual social spaces, as well as for interaction with social robots. The problem is made challenging by the idiosyncratic and non‐periodic nature of human co‐speech gesture motion, and by the great diversity of communicative functions that gestures encompass. The field of gesture generation has seen surging interest in the last few years, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep‐learning‐based generative models that benefit from the growing availability of data. This review article summarizes co‐speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule‐based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text and non‐linguistic input. Concurrent with the exposition of deep learning approaches, we chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method (e.g., optical motion capture or pose estimation from video). Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human‐like motion; grounding the gesture in the co‐occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.
This content is subject to copyright. Terms and conditions apply.
EUROGRAPHICS 2023
A. Bousseau and C. Theobalt
(Guest Editors)
COMPUTER GRAPHICS forum
Volume 42 (2023), Number 2
STAR State of The Art Report
A Comprehensive Review of
Data-Driven Co-Speech Gesture Generation
S. Nyatsanga1T. Kucherenko2C. Ahuja3G. E. Henter4M. Neff1
1University of California, Davis, USA
2SEED - Electronic Arts, Stockholm, Sweden
3Meta AI, USA
4Division of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
Figure 1: Co-speech gesture generation approaches can be divided into rule-based and data-driven. Rule-based systems use carefully
designed heuristics to associate speech with gesture (Section 4). Data-driven approaches associate speech and gesture through statistical
modeling (Section 5.2), or by learning multimodal representations using deep generative models (Section 5.3). The main input modalities
are speech audio in an intermediate representation; text transcript of speech; humanoid pose in joint position or angle form; and control
parameters for motion design intent. Virtual agents and social robotics are the main research applications, although also compatible with
games and film VFX.
Abstract
Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic
generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling
technology for creating believable characters in film, games, and virtual social spaces, as well as for interaction with social
robots. The problem is made challenging by the idiosyncratic and non-periodic nature of human co-speech gesture motion,
and by the great diversity of communicative functions that gestures encompass. The field of gesture generation has seen
surging interest in the last few years, owing to the emergence of more and larger datasets of human gesture motion, combined
with strides in deep-learning-based generative models that benefit from the growing availability of data. This review article
summarizes co-speech gesture generation research, with a particular focus on deep generative models. First, we articulate
the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule-based and classical
statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an
organizing principle, examining systems that generate gestures from audio, text and non-linguistic input. Concurrent with the
exposition of deep learning approaches, we chronicle the evolution of the related training data sets in terms of size, diversity,
motion quality, and collection method (e.g., optical motion capture or pose estimation from video). Finally, we identify key
research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the
gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation;
and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges,
as well as the limitations of these approaches, and point toward areas of future development.
Keywords: co-speech gestures, gesture generation, deep learning, virtual agents, social robotics
CCS Concepts
Computing methodologies Animation; Machine learning; Human-centered computing Human computer inter-
action (HCI);
© 2023 Eurographics - The European Association
for Computer Graphics and John Wiley & Sons Ltd.
DOI: 10.1111/cgf.14776
... This richness makes gesture selection and animation a fundamental challenge in embodied agent research. To solve this challenge, researchers have focused on various automated approaches to select and generate virtual human gestures, often relying on data-driven or analysis-driven approaches [for an overview, see 34,37]. ...
... Work in this area relies on, for example, deep learning techniques. A recent review of this deep learning work [34] has identified several key limitations, including a lack of designer control over the performance and the limited ability of current approaches to realize semantically meaningful gestures such as metaphoric gestures. Recent work has become more focused on the use of Transformer-based and diffusion-based generative models for the selection of gestures. ...
Preprint
Full-text available
Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions. These gestures significantly influence the addressee's engagement, recall, comprehension, and attitudes toward the speaker. Similarly, they impact interactions between humans and embodied virtual agents. The process of selecting and animating meaningful gestures has thus become a key focus in the design of these agents. However, automating this gesture selection process poses a significant challenge. Prior gesture generation techniques have varied from fully automated, data-driven methods, which often struggle to produce contextually meaningful gestures, to more manual approaches that require crafting specific gesture expertise and are time-consuming and lack generalizability. In this paper, we leverage the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures. We first describe how information on gestures is encoded into GPT-4. Then, we conduct a study to evaluate alternative prompting approaches for their ability to select meaningful, contextually relevant gestures and to align them appropriately with the co-speech utterance. Finally, we detail and demonstrate how this approach has been implemented within a virtual agent system, automating the selection and subsequent animation of the selected gestures for enhanced human-agent interactions.
... 2 Related works 2.1 Co-speech gesture generation Data-driven approaches have gained popularity due to their ability to reduce manual efforts in designing rules, unlike rulebased methods. Nyatsanga et al. (2023) provide a comprehensive survey of these methods. Data-driven approaches are categorized into statistical and learning-based methods. ...
... Gestures in human communication significantly enhance semantic transmission, emotional expression, and conversational flow [1]. Co-speech gesture generation focuses on enabling virtual characters or robots to produce synchronized, natural gestures alongside speech, advancing humancomputer interaction towards greater naturalness and intelligence [2] [3]. This is crucial in applications requiring enhanced realism and immersion, such as virtual agents [4] and humanoid robots [5], where natural and contextappropriate gestures greatly improve interaction quality. ...
Preprint
Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to parse speech content and generate reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech gestures.First, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to generate context-aware gesture labels. Subsequently, we constructed an intent chain-annotated text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results demonstrate that SARGes achieves highly semantically-aligned gesture labeling (50.2% accuracy) with efficient single-pass inference (0.4 seconds). The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.
... Digital human animation has become a foundational technology across various domains, including virtual avatars [32], human-computer interaction [42], and creative content generation [37]. Among these, audio-driven gesture synthesis-the generation of human-like gestures synchronized with input audio-plays a vital role in creating immersive and realistic virtual characters. ...
Preprint
Full-text available
Audio-driven human gesture synthesis is a crucial task with broad applications in virtual avatars, human-computer interaction, and creative content generation. Despite notable progress, existing methods often produce gestures that are coarse, lack expressiveness, and fail to fully align with audio semantics. To address these challenges, we propose ExGes, a novel retrieval-enhanced diffusion framework with three key designs: (1) a Motion Base Construction, which builds a gesture library using training dataset; (2) a Motion Retrieval Module, employing constrative learning and momentum distillation for fine-grained reference poses retreiving; and (3) a Precision Control Module, integrating partial masking and stochastic masking to enable flexible and fine-grained control. Experimental evaluations on BEAT2 demonstrate that ExGes reduces Fr\'echet Gesture Distance by 6.2\% and improves motion diversity by 5.3\% over EMAGE, with user studies revealing a 71.3\% preference for its naturalness and semantic relevance. Code will be released upon acceptance.
Article
Purpose Human behavior recognition poses a pivotal challenge in intelligent computing and cybernetics, significantly impacting engineering and management systems. With the rapid advancement of autonomous systems and intelligent manufacturing, there is an increasing demand for precise and efficient human behavior recognition technologies. However, traditional methods often suffer from insufficient accuracy and limited generalization ability when dealing with complex and diverse human actions. Therefore, this study aims to enhance the precision of human behavior recognition by proposing an innovative framework, dynamic graph convolutional networks with multi-scale position attention (DGCN-MPA) to sup. Design/methodology/approach The primary applications are in autonomous systems and intelligent manufacturing. The main objective of this study is to develop an efficient human behavior recognition framework that leverages advanced techniques to improve the prediction and interpretation of human actions. This framework aims to address the shortcomings of existing methods in handling the complexity and variability of human actions, providing more reliable and precise solutions for practical applications. The proposed DGCN-MPA framework integrates the strengths of convolutional neural networks and graph-based models. It innovatively incorporates wavelet packet transform to extract time-frequency characteristics and a MPA module to enhance the representation of skeletal node positions. The core innovation lies in the fusion of dynamic graph convolution with hierarchical attention mechanisms, which selectively attend to relevant features and spatial relationships, adjusting their importance across scales to address the variability in human actions. Findings To validate the effectiveness of the DGCN-MPA framework, rigorous evaluations were conducted on benchmark datasets such as NTU-RGB + D and Kinetics-Skeleton. The results demonstrate that the framework achieves an F1 score of 62.18% and an accuracy of 75.93% on NTU-RGB + D and an F1 score of 69.34% and an accuracy of 76.86% on Kinetics-Skeleton, outperforming existing models. These findings underscore the framework’s capability to capture complex behavior patterns with high precision. Originality/value By introducing a dynamic graph convolutional approach combined with multi-scale position attention mechanisms, this study represents a significant advancement in human behavior recognition technologies. The innovative design and superior performance of the DGCN-MPA framework contribute to its potential for real-world applications, particularly in integrating behavior recognition into engineering and autonomous systems. In the future, this framework has the potential to further propel the development of intelligent computing, cybernetics and related fields.
Article
Full-text available
Modeling virtual agents with behavior style is one factor for personalizing human-agent interaction. We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers including those unseen during training. Our model performs zero-shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers. We view style as being pervasive; while speaking, it colors the communicative behaviors expressivity while speech content is carried by multimodal signals and text. This disentanglement scheme of content and style allows us to directly infer the style embedding even of a speaker whose data are not part of the training phase, without requiring any further training or fine-tuning. The first goal of our model is to generate the gestures of a source speaker based on the content of two input modalities–Mel spectrogram and text semantics. The second goal is to condition the source speaker's predicted gestures on the multimodal behavior style embedding of a target speaker. The third goal is to allow zero-shot style transfer of speakers unseen during training without re-training the model. Our system consists of two main components: (1) a speaker style encoder network that learns to generate a fixed-dimensional speaker embedding style from a target speaker multimodal data (mel-spectrogram, pose, and text) and (2) a sequence-to-sequence synthesis network that synthesizes gestures based on the content of the input modalities—text and mel-spectrogram—of a source speaker and conditioned on the speaker style embedding. We evaluate that our model is able to synthesize gestures of a source speaker given the two input modalities and transfer the knowledge of target speaker style variability learned by the speaker style encoder to the gesture generation task in a zero-shot setup, indicating that the model has learned a high-quality speaker representation. We conduct objective and subjective evaluations to validate our approach and compare it with baselines.
Article
Full-text available
We present ZeroEGGS, a neural network framework for speech‐driven gesture generation with zero‐shot style control by example. This means style can be controlled via only a short example motion clip, even for motion styles unseen during training. Our model uses a Variational framework to learn a style embedding, making it easy to modify style through latent space manipulation or blending and scaling of style embeddings. The probabilistic nature of our framework further enables the generation of a variety of outputs given the input, addressing the stochastic nature of gesture motion. In a series of experiments, we first demonstrate the flexibility and generalizability of our model to new speakers and styles. In a user study, we then show that our model outperforms previous state‐of‐the‐art techniques in naturalness of motion, appropriateness for speech, and style portrayal. Finally, we release a high‐quality dataset of full‐body gesture motion including fingers, with speech, spanning across 19 different styles. Our code and data are publicly available at https://github.com/ubisoft/ubisoft‐laforge‐ZeroEGGS.
Chapter
Full-text available
Achieving realistic, vivid, and human-like synthesized conversational gestures conditioned on multi-modal data is still an unsolved problem due to the lack of available datasets, models and standard evaluation metrics. To address this, we build Body-Expression-Audio-Text dataset, BEAT, which has i) 76 h, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations. Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with facial expressions, emotions, and semantics, in addition to the known correlation with audio, text, and speaker identity. Based on this observation, we propose a baseline model, Cascaded Motion Network (CaMN), which consists of above six modalities modeled in a cascaded architecture for gesture synthesis. To evaluate the semantic relevancy, we introduce a metric, Semantic Relevance Gesture Recall (SRGR). Qualitative and quantitative experiments demonstrate metrics’ validness, ground truth data quality, and baseline’s state-of-the-art performance. To the best of our knowledge, BEAT is the largest motion capture dataset for investigating human gestures, which may contribute to a number of different research fields, including controllable gesture synthesis, cross-modality analysis, and emotional gesture recognition. The data, code and model are available on https://pantomatrix.github.io/BEAT/.
Article
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. The dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in a dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier, we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which has been a difficult problem in the field. The evaluation results show some synthetic gesture conditions being rated as significantly more human-like than 3D human motion capture. To the best of our knowledge, this has not been demonstrated before. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. We also find that conventional objective metrics do not correlate well with subjective human-likeness ratings in this large evaluation. The one exception is the Fréchet gesture distance (FGD), which achieves a Kendall’s tau rank correlation of around -0.5 . Based on the challenge results we formulate numerous recommendations for system building and evaluation.
Chapter
This book describes research in all aspects of the design, implementation, and evaluation of embodied conversational agents as well as details of specific working systems. Embodied conversational agents are computer-generated cartoonlike characters that demonstrate many of the same properties as humans in face-to-face conversation, including the ability to produce and respond to verbal and nonverbal communication. They constitute a type of (a) multimodal interface where the modalities are those natural to human conversation: speech, facial displays, hand gestures, and body stance; (b) software agent, insofar as they represent the computer in an interaction with a human or represent their human users in a computational environment (as avatars, for example); and (c) dialogue system where both verbal and nonverbal devices advance and regulate the dialogue between the user and the computer. With an embodied conversational agent, the visual dimension of interacting with an animated character on a screen plays an intrinsic role. Not just pretty pictures, the graphics display visual features of conversation in the same way that the face and hands do in face-to-face conversation among humans. This book describes research in all aspects of the design, implementation, and evaluation of embodied conversational agents as well as details of specific working systems. Many of the chapters are written by multidisciplinary teams of psychologists, linguists, computer scientists, artists, and researchers in interface design. The authors include Elisabeth Andre, Norm Badler, Gene Ball, Justine Cassell, Elizabeth Churchill, James Lester, Dominic Massaro, Cliff Nass, Sharon Oviatt, Isabella Poggi, Jeff Rickel, and Greg Sanders.
Article
Automatic synthesis of realistic co-speech gestures is an increasingly important yet challenging task in artificial embodied agent creation. Previous systems mainly focus on generating gestures in an end-to-end manner, which leads to difficulties in mining the clear rhythm and semantics due to the complex yet subtle harmony between speech and gestures. We present a novel co-speech gesture synthesis method that achieves convincing results both on the rhythm and semantics. For the rhythm, our system contains a robust rhythm-based segmentation pipeline to ensure the temporal coherence between the vocalization and gestures explicitly. For the gesture semantics, we devise a mechanism to effectively disentangle both low- and high-level neural embeddings of speech and motion based on linguistic theory. The high-level embedding corresponds to semantics, while the low-level embedding relates to subtle variations. Lastly, we build correspondence between the hierarchical embeddings of the speech and the motion, resulting in rhythm- and semantics-aware gesture synthesis. Evaluations with existing objective metrics, a newly proposed rhythmic metric, and human feedback show that our method outperforms state-of-the-art systems by a clear margin.
Conference Paper
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. This year’s dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which previously was a major challenge in the field. The evaluation results are a revolution, and a revelation. Some synthetic conditions are rated as significantly more human-like than human motion capture. To the best of our knowledge, this has never been shown before on a high-fidelity avatar. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings.