Gérard Bailly

Gérard Bailly
  • Directeur de Recherches CNRS
  • Managing Director at GIPSA-lab

Cognitive Robotics; Expressive Audiovisual TTS for Conversational Avatars; Computed-Assisted Training of Reading Fluency

About

357
Publications
61,905
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,482
Citations
Introduction
Gérard Bailly is a specialist of speech communication. He is senior CNRS Researcher at GIPSA-Lab, Grenoble-FR. He was deputy director of the lab (2008-12). He supervised 35 PhD thesis, authored >50 papers and 250 papers in major international confs and co-edited “Talking Machines: Theories, Models & Designs” (1992), “Improvements in Speech Synthesis” (2002) & “Audiovisual speech processing” (2012). His interest is multimodal interaction with conversational agents (incl. humanoid robot iCub)
Current institution
GIPSA-lab
Current position
  • Managing Director
Additional affiliations
January 2015 - present
GIPSA-lab
Position
  • Managing Director
Description
  • CRISSP stands for Cognitive Robotics, Interactive Systems & Speech Processing (see http://www.gipsa-lab.grenoble-inp.fr/en/crissp.php)
January 1984 - June 1986
National Institute of Scientific Research
Position
  • Research Assistant
January 1998 - January 2015
GIPSA-lab
Position
  • Managing Director
Education
October 1998 - October 2000
Grenoble Institute of Technology
Field of study
  • Speech communication
September 1981 - December 1983
Grenoble Institute of Technology
Field of study
  • Speech & Signal Processing

Publications

Publications (357)
Article
Full-text available
The importance of gaze in human-robot interaction (HRI) is well documented, particularly for its contribution in regulating turn-taking and managing roles in conversations. However, few gaze models have yet been proposed for multi-party interactions, i.e. several people facing the robot. In this paper, we propose to build and evaluate a gaze contro...
Preprint
Full-text available
This paper presents a novel approach for the automatic generation of Cued Speech (ACSG), a visual communication system used by people with hearing impairment to better elicit the spoken language. We explore transfer learning strategies by leveraging a pre-trained audiovisual autoregressive text-to-speech model (AVTacotron2). This model is reprogram...
Article
Full-text available
We present THERADIA WoZ, an ecological corpus designed for audiovisual research on affect in healthcare. Two groups of senior individuals, consisting of 52 healthy participants and 9 individuals with Mild Cognitive Impairment (MCI), performed Computerised Cognitive Training (CCT) exercises while receiving support from a virtual assistant, tele-oper...
Chapter
Full-text available
This article presents a characterisation of formant trajectories based on the tracking of each resonance of the vocal tract. Thanks to an original method called nomograms with decoupled cavities, the optimal constriction locations of the area functions of ten prototypical French vowels are given, together with the main affiliations of each formant:...
Conference Paper
Full-text available
As autonomous interactive agents become increasingly prevalent, it is crucial for these virtual agents to understand and respond to both our verbal content and emotions, enabling deeper interactions. Despite significant advancements in the automatic recognition and understanding of human speech, challenges remain in accurately identifying and addre...
Article
Full-text available
The Blizzard Challenge has benchmarked progress in Text-to-Speech (TTS) since 2005. The Challenge has seen important milestones passed, with results suggesting that synthetic speech was indistinguishable from natural speech in terms of intelligibility in 2021 and that by that same year it was perhaps even indistinguishable in naturalness. The high...
Conference Paper
Full-text available
Aims of this study are: 1) identify respiratory features that could serve as objective markers of fluency of aloud reading; 2) investigate the effects of computer-assisted training on the improvement of speech-breathing coordination. Our training method combines the principles of repeated and assisted close-shadowed reading. Reading assistance take...
Presentation
Full-text available
Several training programs have been shown to significantly improve abilities in reading rate, prosody, and comprehension. Among them are repeated reading focusing on prosody (Calet et al., 2017), choral reading with an expert model (Chard, 2002), and visual cueing of syntactic boundaries (Levasseur, 2008). In this study, we used a reading karaoke a...
Poster
Full-text available
The ability to coordinate phonation and breathing is a key challenge for reading aloud. Young readers frequently struggle to properly place respiratory pauses and often end phonation in apnea (Godde 2022). Aim is to provide a game involving a repeated reading-while-listening task to improve sensory-motor planning (Godde, 2017). There are four readi...
Conference Paper
Full-text available
This study aims, on one hand, to identify respiratory indicators that could be considered as the hallmark of fluency improvement, and on the other hand, to examine the effects of computer-assisted reading training on the progression of respiratory/speech coordination. 66 students (grades 3-5) were divided into three groups according to the training...
Article
Full-text available
Background Impaired cognitive function is observed in many pathologies, including neurodegenerative diseases such as Alzheimer disease. At present, the pharmaceutical treatments available to counter cognitive decline have only modest effects, with significant side effects. A nonpharmacological treatment that has received considerable attention is c...
Conference Paper
Full-text available
We developed a web app for ascribing verbal descriptions to expressive audiovisual utterances. These descriptions are limited to lists of adjective tags that are either suggested via a navigation in emotional latent spaces built using discriminant analysis of BERT embeddings, or entered freely by participants. We show that such verbal descriptions...
Chapter
Given the importance of gaze in Human-Robot Interactions (HRI), many gaze control models have been developed. However, these models are mostly built for dyadic face-to-face interaction. Gaze control models for multiparty interaction are more scarce. We here propose and evaluate data-driven gaze control models for a robot game animator in a three-pa...
Chapter
The human gaze direction is the sum of the head and eye movements. The coordination of these two segments has been studied and models of the contribution of head movement to the gaze of virtual agents or robots have been proposed. However, these coordination models are mostly not trained nor evaluated in an interaction context, and may underestimat...
Preprint
BACKGROUND Impaired cognitive function is observed in many pathologies, including neurodegenerative diseases such as Alzheimer disease. At present, the pharmaceutical treatments available to counter cognitive decline have only modest effects, with significant side effects. A nonpharmacological treatment that has received considerable attention is c...
Chapter
This paper presents a study on different NLP solutions for French homographs disambiguation for text-to-speech systems. Solutions are compared using a home-made corpus of 8137 sentences extracted from the Web, comprising roughly one hundred instances of each of 34 pairs of prototypical words. A disambiguation system based on per-case Linear Discrim...
Article
Full-text available
We propose a computational framework for estimating multidimensional subjective ratings of the reading performance of young readers from speech-based objective measures. We combine linguistic features (number of correct words, repetitions, deletions, insertions uttered per minute, etc.) with prosodic features. Expressivity is particularly difficult...
Article
Full-text available
Pauses when reading aloud play an essential role in reading and listening comprehension (for a review: Godde et al., 2020). Among the various types of pauses, breathing pauses during oral reading are particularly important. Their placement, frequency and duration tell us about breath and voice coordination as well as articulatory planning. These sk...
Chapter
Full-text available
An emerging research trend associating social robotics and social-cognitive psychology offers preliminary evidence that the mere presence of humanoid robots may have the same effects as human presence on human performance, provided the robots are anthropomorphized to some extent (attribution to mental states to the robot being present). However, wh...
Conference Paper
Full-text available
Neural vocoders are systematically evaluated on homogeneous train and test databases. This kind of evaluation is efficient to compare neural vocoders in their "comfort zone", yet it hardly reveals their limits towards unseen data during training. To compare their extrapolation capabilities, we introduce a methodology that aims at quantifying the ro...
Chapter
Full-text available
Digital plays a key role in the transformation of medicine. Beyond the simple computerisation of healthcare systems, many non-drug treatments are now possible thanks to digital technology. Thus, interactive stimulation exercises can be offered to people suffering from cognitive disorders, such as developmental disorders, neurodegenerative diseases,...
Conference Paper
Full-text available
Le projet FLUENCE (e-Fran, PIA2) a permis la réalisation de trois applications mobiles visant à améliorer les performances des élèves en lecture (EVASION et ELARGIR) et en compréhension orale de l'anglais (LUCIOLE). Les applications ont été déployées dans les classes de l'académie de Grenoble (France) auprès de 722 élèves suivis du CP au CE2. L'ent...
Article
Full-text available
The present work reviews the current knowledge of the development of reading prosody, or reading aloud with expression, in young children. Prosody comprises the variables of timing, phrasing, emphasis and intonation that speakers use to convey meaning. We detail the subjective rating scales proposed as a means of assessing performance in young read...
Article
Full-text available
Échelle Multi-Dimensionnelle de Fluence : A new tool to assess reading fluency including prosody in French, calibrated for grade 2 to 5 Children’s reading fluency is usually assessed in classrooms using the instruction: ‘‘read as fast as you can.’’ This instruction tends to perpetuate the confusion between fluency and speed. However, reading fast i...
Poster
Full-text available
Pausing when reading aloud is essential to comprehension of both listeners and readers. This skill evolves from the early stage of reading acquisition to reading expertise. The placement and duration of respiratory pauses tell us about the breath-voice coordination and so the planning when reading aloud. In a developmental perspective, this study a...
Conference Paper
Full-text available
Reading prosody is a skill developing from the early reading acquisition to the end of education. Prosody development has an impact on reading comprehension. The RAKE app is a reading karaoke used to perform an audiovisual enhanced reading-while-listening task. In this study we used RAKE in a 10 weeks training program with grade 3 to grade 5 pupils...
Poster
Full-text available
Reading prosody is a skill developing from the early reading acquisition to the end of education. Prosody development has an impact on reading comprehension. The RAKE app is a reading karaoke used to perform an audio-visual enhanced reading-while-listening task. In this study we used RAKE in a 10 weeks training program with grade 3 to grade 5 pupil...
Preprint
Full-text available
How can we learn, transfer and extract handwriting styles using deep neural networks? This paper explores these questions using a deep conditioned autoencoder on the IRON-OFF handwriting data-set. We perform three experiments that systematically explore the quality of our style extraction procedure. First, We compare our model to handwriting benchm...
Conference Paper
Full-text available
Prosody in speech is used to communicate a variety of linguistic, paralinguistic and non-linguistic information via multiparametric contours. The Superposition of Functional Contours (SFC) model is capable of extracting the average shape of these elementary contours through iterative analysis-by-synthesis training of neural network contour generato...
Chapter
This volume presents a state-of-the-art of current research on the role of eye gaze in different types of interaction, including human-human and human-computer interaction. Approaching the phenomenon from different disciplinary and methodological angles, the chapters in the volume are united through a shared technological approach, viz. the use of...
Conference Paper
Full-text available
We consider the problem of learning to localize a speech source using a humanoid robot equipped with a binaural hearing system. We aim to map binaural audio features into the relative angle between the robot’s head direction and the target source direction based on a sensorimotor training framework. To this end, we make the following contributions:...
Preprint
Full-text available
Evaluating the style of handwriting generation is a challenging problem, since it is not well defined. It is a key component in order to develop in developing systems with more personalized experiences with humans. In this paper, we propose baseline benchmarks, in order to set anchors to estimate the relative quality of different handwriting style...
Conference Paper
Full-text available
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural net...
Conference Paper
Full-text available
This paper presents a new teleoperation system – called stereo gaze-contingent steering (SGCS) – able to seamlessly control the vergence, yaw and pitch of the eyes of a humanoid robot – here an iCub robot – from the actual gaze direction of a remote pilot. The video stream captured by the cameras embedded in the mobile eyes of the iCub are fed into...
Preprint
Full-text available
The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. More traditional intonation models have given way to the overwhelming performance of artificial intelligence (AI) techniques for training model-free, end-...
Preprint
Full-text available
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural net...
Conference Paper
Full-text available
The Superposition of Functional Contours (SFC) prosody model decomposes the intonation and duration contours into elementary contours that encode specific linguistic functions. It can be used to extract these functional contours at multiple linguistic levels. The PySFC system, which incorporates the SFC, can thus be used to analyse the significance...
Article
We speak to express ourselves. Sometimes words can capture what we mean; sometimes we mean more than can be said. This is where our visible gestures - those dynamic oscillations of our gaze, face, head, hand, arms and bodies – help. Not only do these co-verbal visual signals help express our intentions, attitudes and emotion, they also help us enga...
Article
Full-text available
An important problem in computer animation of virtual characters is the expression of complex mental states during conversation using the coordinated prosody of voice, rhythm, facial expressions, and head and gaze motion. In this work, the authors propose an expressive conversion method for generating natural speech and facial animation in a variet...
Article
Human interactions are driven by multi-level perception-action loops. Interactive behavioral models are typically built using rule-based methods or statistical approaches such as Hidden Markov Model (HMM), Dynamic Bayesian Network (DBN), etc. In this paper, we present the multimodal interactive data and our behavioral model based on recurrent neura...
Poster
Full-text available
Telepresence refers to a set of tools that allows a person to be “present” in a distant environment, by a sufficiently realistic representation of it through a set of multimodal stimuli experienced by the distant devices via its sensors. Immersive Telepresence follows this trend and, thanks to the capabilities given by virtual reality devices, repl...
Article
In this work we explore the capability of audiovisual prosodic features (such as fundamental frequency, head motion or facial expressions) to discriminate among different dramatic attitudes. We extract the audiovisual parameters from an acted corpus of attitudes and structure them as frame, syllable and sentence-level features. Using Linear Discrim...
Article
Reading while listening to texts (RWL) is a promising way to improve the learning benefits provided by a reading experience. In an exploratory study, we investigated the effect of synchronizing the highlighting of words (visual) with their auditory (speech) counterpart during a RWL task. Forty French children from 3rd to 5th grade read short storie...
Conference Paper
Full-text available
Robotics is a reasonably mature technology when robots are restricted to operating with well-known and well-engineered environments, e.g. in manufacturing robotics or domestic applications such as vacuum cleaning or lawn mowing. For more diverse tasks and open-ended environments, robotic behaviours are mainly hand-tuned: for most of the one million...
Conference Paper
Full-text available
Socially assistive robot with interactive behavioral capability have been improving quality of life for a wide range of users by taking care of elderlies, training individuals with cognitive disabilities or physical rehabilitation, etc. While the interactive behavioral policies of most systems are scripted, we discuss here key features of a new met...
Conference Paper
Full-text available
Incremental text-to-speech systems aim at synthesizing a text 'on-the-fly', while the user is typing a sentence. In this context, this article addresses the problem of the part-of-speech tagging (POS, i.e. lexical category) which is a critical step for accurate grapheme-to-phoneme conversion and prosody estimation. Here, the main challenge is to es...
Conference Paper
Full-text available
Several socially assistive robot (SAR) systems have been proposed and designed to engage people into various interactive exercises such as physical training [1], neuropsychological rehabilitation [2] or cognitive assistance [3]. While the interactive behavioral policies of most systems are scripted, we discuss here key features of a new methodology...
Article
This article investigates the use of statistical mapping techniques for the conversion of articulatory movements into audible speech with no restriction on the vocabulary, in the context of a silent speech interface driven by ultrasound and video imaging. As a baseline, we first evaluated the GMM-based mapping considering dynamic features, proposed...
Article
The goal of this paper is to model the coverbal behavior of a subject involved in face-to-face social interactions. For this end, we present a multimodal behavioral model based on a Dynamic Bayesian Network (DBN). The model was inferred from multimodal data of interacting dyads in a specific scenario designed to foster mutual attention and multimod...
Book
Full-text available
Cet ouvrage rassemble les travaux d’études et de recherche effectués dans le cadre du cours «Cognition, Affects et Interaction » que nous avons animé au 1er semestre 2015-2016. Cette deuxième édition de cours poursuit le principe inauguré en 2014 : aux cours magistraux donnés sur la thématique "Cognition, Interaction & Affects" qui donnent les outi...
Article
Full-text available
This paper addresses the adaptation of an acoustic-articulatory model of a reference speaker to the voice of another speaker, using a limited amount of audio-only data. In the context of pronunciation training, a virtual talking head displaying the internal speech articulators (e.g., the tongue) could be automatically animated by means of such a mo...
Article
Recent developments in human-robot interaction show how the ability to communicate with people in a natural way is of great importance for artificial agents. The implementation of facial expressions has been found to significantly increase the interaction capabilities of humanoid robots. For speech, displaying a correct articulation with sound is m...
Conference Paper
Full-text available
The focus of this study is the generation of expressive audiovisual speech from neutral utterances for 3D virtual actors. Taking into account the segmental and suprasegmental aspects of audiovisual speech, we propose and compare several computational frameworks for the generation of expressive speech and face animation. We notably evaluate a standa...
Conference Paper
Full-text available
This article reports the use of a karaoke technique to drive the visual attention span (VAS) of subjects reading a text while listening to the text spelled aloud by a reading tutor. We tested the impact of computer-assisted synchronous reading (S+) that emphasizes words when they are uttered, vs. non-synchronous reading (S-) in a reading while list...
Conference Paper
Full-text available
Incremental speech synthesis aims at delivering the synthetic voice while the sentence is still being typed. One of the main challenges is the online estimation of the target prosody from a partial knowledge of the sentence's syntactic structure. In the context of HMM-based speech synthesis, this typically results in missing segmental and suprasegm...
Article
Full-text available
The aim of this paper is to model multimodal perception-action loops of human behavior in face-to-face interactions. To this end, we propose trainable behavioral models that predict the optimal actions for one specific person given others’ perceived actions and the joint goals of the interlocutors. We first compare sequential models—in particular d...
Article
Full-text available
We here propose to use immersive teleoperation of a humanoid robot by a human pilot for artificially providing the robot with social skills. This so-called beaming approach of learning by demonstration (the robot passively experience social behaviors that can be further modeled and used for autonomous control) offers a unique way to study embodied...
Conference Paper
Full-text available
Recent developments in human-robot interaction show how the ability to communicate with people in a natural way is of great importance for artificial agents. The implementation of facial expressions has been found to significantly increase the interaction capabilities of humanoid robots. For speech, displaying a correct articulation with sound is m...
Conference Paper
Full-text available
The purpose of this work is to evaluate the contribution of audio-visual prosody to the perception of complex mental states of virtual actors. We propose that global audio-visual prosodic contours - i.e. melody, rhythm and head movements over the utterance - constitute discriminant features for both the generation and recognition of social attitude...
Conference Paper
Full-text available
The purpose of this work is to evaluate the contribution of audio-visual prosody to the perception of complex mental states of virtual actors. We propose that global audio-visual prosodic contours -i.e. melody, rhythm and head movements over the utterance -consti-tute discriminant features for both the generation and recognition of social attitudes...
Conference Paper
Full-text available
Modeling multimodal perception-action loops in face-toface interactions is a crucial step in the process of building sensory-motor behaviors for social robots or users-aware Embodied Conversational Agents (ECA). In this paper, we compare trainable behavioral models based on sequential models (HMMs) and classifiers (SVMs and Decision Trees) inherent...
Article
Full-text available
La Singularité technologique par Romain Astouric & Henri Aribert-Desjardins . . . . . . . . . . . . 1 Ethique, responsabilité et statut juridique du robot compagnon: revue et perspectives par Anne Boulange & Carole Jaggie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Robot compagnon et fiction par Yannick Bourrier &...
Article
Full-text available
This paper focuses on the study of the convergence between characteristics of speech segments- i.e. spectral characteristics of speech sounds - during live interactions between speaking dyads. The interaction data has been collected using an original verbal game called 'verbal dominoes' that provides a dense sampling of the acoustic spaces of the i...

Questions

Questions (2)
Question
We recently bought an iCub2 with enhanced communication abilities (see Nina below). We are currently working on visual attention and try to characterize the perception of the robot's gaze direction by human observers. We got surprised! The morphology of robotic eyes with no deformation of the eyelids and palpebral commissure strongly biases the estimation of eyes direction as soon as the gaze is averted.
Are you aware of any study (similar to what SAMER AL MOUBAYED and KTH colleagues have done with Furhat) on robots?
Thank you in advance for your help!
Question
Automatic dictation challenges text-to-speech synthesis in several apects: pausing should allow trainees to comfortably write down the text (taking into account orthographic, lexical, morpho-syntactic difficulties, etc). Prosody of dictation is also very particular: clear articulation and ample prosodic patterns should enlighten grammatical issues, etc. I will be pleased to get references and comments

Network

Cited By