Gérard Bailly

Gérard Bailly
GIPSA-lab · Speech & Cognition dpt.

Directeur de Recherches CNRS
Computed-assisted training of reading fluency of young readers; expressive audiovisual TTS for conversational avatar

About

330
Publications
53,177
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,150
Citations
Introduction
Gérard Bailly is a specialist of speech communication. He is senior CNRS Researcher at GIPSA-Lab, Grenoble-FR. He was deputy director of the lab (2008-12). He supervised 35 PhD thesis, authored >50 papers and 250 papers in major international confs and co-edited “Talking Machines: Theories, Models & Designs” (1992), “Improvements in Speech Synthesis” (2002) & “Audiovisual speech processing” (2012). His interest is multimodal interaction with conversational agents (incl. humanoid robot iCub)
Additional affiliations
January 2015 - present
GIPSA-lab
Position
  • Managing Director
Description
  • CRISSP stands for Cognitive Robotics, Interactive Systems & Speech Processing (see http://www.gipsa-lab.grenoble-inp.fr/en/crissp.php)
January 1998 - January 2015
GIPSA-lab
Position
  • Managing Director
June 1986 - present
French National Centre for Scientific Research
Position
  • Directeur de Recherches
Education
October 1998 - October 2000
Grenoble Institute of Technology
Field of study
  • Speech communication
September 1981 - December 1983
Grenoble Institute of Technology
Field of study
  • Speech & Signal Processing

Publications

Publications (330)
Chapter
Given the importance of gaze in Human-Robot Interactions (HRI), many gaze control models have been developed. However, these models are mostly built for dyadic face-to-face interaction. Gaze control models for multiparty interaction are more scarce. We here propose and evaluate data-driven gaze control models for a robot game animator in a three-pa...
Chapter
The human gaze direction is the sum of the head and eye movements. The coordination of these two segments has been studied and models of the contribution of head movement to the gaze of virtual agents or robots have been proposed. However, these coordination models are mostly not trained nor evaluated in an interaction context, and may underestimat...
Chapter
This paper presents a study on different NLP solutions for French homographs disambiguation for text-to-speech systems. Solutions are compared using a home-made corpus of 8137 sentences extracted from the Web, comprising roughly one hundred instances of each of 34 pairs of prototypical words. A disambiguation system based on per-case Linear Discrim...
Article
Full-text available
We propose a computational framework for estimating multidimensional subjective ratings of the reading performance of young readers from speech-based objective measures. We combine linguistic features (number of correct words, repetitions, deletions, insertions uttered per minute, etc.) with prosodic features. Expressivity is particularly difficult...
Article
Full-text available
Pauses when reading aloud play an essential role in reading and listening comprehension (for a review: Godde et al., 2020). Among the various types of pauses, breathing pauses during oral reading are particularly important. Their placement, frequency and duration tell us about breath and voice coordination as well as articulatory planning. These sk...
Chapter
Full-text available
An emerging research trend associating social robotics and social-cognitive psychology offers preliminary evidence that the mere presence of humanoid robots may have the same effects as human presence on human performance, provided the robots are anthropomorphized to some extent (attribution to mental states to the robot being present). However, wh...
Conference Paper
Full-text available
Neural vocoders are systematically evaluated on homogeneous train and test databases. This kind of evaluation is efficient to compare neural vocoders in their "comfort zone", yet it hardly reveals their limits towards unseen data during training. To compare their extrapolation capabilities, we introduce a methodology that aims at quantifying the ro...
Chapter
Full-text available
Digital plays a key role in the transformation of medicine. Beyond the simple computerisation of healthcare systems, many non-drug treatments are now possible thanks to digital technology. Thus, interactive stimulation exercises can be offered to people suffering from cognitive disorders, such as developmental disorders, neurodegenerative diseases,...
Conference Paper
Full-text available
Le projet FLUENCE (e-Fran, PIA2) a permis la réalisation de trois applications mobiles visant à améliorer les performances des élèves en lecture (EVASION et ELARGIR) et en compréhension orale de l'anglais (LUCIOLE). Les applications ont été déployées dans les classes de l'académie de Grenoble (France) auprès de 722 élèves suivis du CP au CE2. L'ent...
Article
Full-text available
The present work reviews the current knowledge of the development of reading prosody, or reading aloud with expression, in young children. Prosody comprises the variables of timing, phrasing, emphasis and intonation that speakers use to convey meaning. We detail the subjective rating scales proposed as a means of assessing performance in young read...
Article
Full-text available
Échelle Multi-Dimensionnelle de Fluence : A new tool to assess reading fluency including prosody in French, calibrated for grade 2 to 5 Children’s reading fluency is usually assessed in classrooms using the instruction: ‘‘read as fast as you can.’’ This instruction tends to perpetuate the confusion between fluency and speed. However, reading fast i...
Poster
Full-text available
Pausing when reading aloud is essential to comprehension of both listeners and readers. This skill evolves from the early stage of reading acquisition to reading expertise. The placement and duration of respiratory pauses tell us about the breath-voice coordination and so the planning when reading aloud. In a developmental perspective, this study a...
Conference Paper
Full-text available
Reading prosody is a skill developing from the early reading acquisition to the end of education. Prosody development has an impact on reading comprehension. The RAKE app is a reading karaoke used to perform an audiovisual enhanced reading-while-listening task. In this study we used RAKE in a 10 weeks training program with grade 3 to grade 5 pupils...
Poster
Full-text available
Reading prosody is a skill developing from the early reading acquisition to the end of education. Prosody development has an impact on reading comprehension. The RAKE app is a reading karaoke used to perform an audio-visual enhanced reading-while-listening task. In this study we used RAKE in a 10 weeks training program with grade 3 to grade 5 pupil...
Preprint
Full-text available
How can we learn, transfer and extract handwriting styles using deep neural networks? This paper explores these questions using a deep conditioned autoencoder on the IRON-OFF handwriting data-set. We perform three experiments that systematically explore the quality of our style extraction procedure. First, We compare our model to handwriting benchm...
Conference Paper
Full-text available
Prosody in speech is used to communicate a variety of linguistic, paralinguistic and non-linguistic information via multiparametric contours. The Superposition of Functional Contours (SFC) model is capable of extracting the average shape of these elementary contours through iterative analysis-by-synthesis training of neural network contour generato...
Conference Paper
Full-text available
We consider the problem of learning to localize a speech source using a humanoid robot equipped with a binaural hearing system. We aim to map binaural audio features into the relative angle between the robot’s head direction and the target source direction based on a sensorimotor training framework. To this end, we make the following contributions:...
Preprint
Full-text available
Evaluating the style of handwriting generation is a challenging problem, since it is not well defined. It is a key component in order to develop in developing systems with more personalized experiences with humans. In this paper, we propose baseline benchmarks, in order to set anchors to estimate the relative quality of different handwriting style...
Conference Paper
Full-text available
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural net...
Conference Paper
Full-text available
This paper presents a new teleoperation system – called stereo gaze-contingent steering (SGCS) – able to seamlessly control the vergence, yaw and pitch of the eyes of a humanoid robot – here an iCub robot – from the actual gaze direction of a remote pilot. The video stream captured by the cameras embedded in the mobile eyes of the iCub are fed into...
Preprint
Full-text available
The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. More traditional intonation models have given way to the overwhelming performance of artificial intelligence (AI) techniques for training model-free, end-...
Preprint
Full-text available
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural net...
Conference Paper
Full-text available
The Superposition of Functional Contours (SFC) prosody model decomposes the intonation and duration contours into elementary contours that encode specific linguistic functions. It can be used to extract these functional contours at multiple linguistic levels. The PySFC system, which incorporates the SFC, can thus be used to analyse the significance...
Article
We speak to express ourselves. Sometimes words can capture what we mean; sometimes we mean more than can be said. This is where our visible gestures - those dynamic oscillations of our gaze, face, head, hand, arms and bodies – help. Not only do these co-verbal visual signals help express our intentions, attitudes and emotion, they also help us enga...
Article
Full-text available
An important problem in computer animation of virtual characters is the expression of complex mental states during conversation using the coordinated prosody of voice, rhythm, facial expressions, and head and gaze motion. In this work, the authors propose an expressive conversion method for generating natural speech and facial animation in a variet...
Article
Human interactions are driven by multi-level perception-action loops. Interactive behavioral models are typically built using rule-based methods or statistical approaches such as Hidden Markov Model (HMM), Dynamic Bayesian Network (DBN), etc. In this paper, we present the multimodal interactive data and our behavioral model based on recurrent neura...
Poster
Full-text available
Telepresence refers to a set of tools that allows a person to be “present” in a distant environment, by a sufficiently realistic representation of it through a set of multimodal stimuli experienced by the distant devices via its sensors. Immersive Telepresence follows this trend and, thanks to the capabilities given by virtual reality devices, repl...
Article
In this work we explore the capability of audiovisual prosodic features (such as fundamental frequency, head motion or facial expressions) to discriminate among different dramatic attitudes. We extract the audiovisual parameters from an acted corpus of attitudes and structure them as frame, syllable and sentence-level features. Using Linear Discrim...
Article
Reading while listening to texts (RWL) is a promising way to improve the learning benefits provided by a reading experience. In an exploratory study, we investigated the effect of synchronizing the highlighting of words (visual) with their auditory (speech) counterpart during a RWL task. Forty French children from 3rd to 5th grade read short storie...
Conference Paper
Full-text available
Robotics is a reasonably mature technology when robots are restricted to operating with well-known and well-engineered environments, e.g. in manufacturing robotics or domestic applications such as vacuum cleaning or lawn mowing. For more diverse tasks and open-ended environments, robotic behaviours are mainly hand-tuned: for most of the one million...
Conference Paper
Full-text available
Socially assistive robot with interactive behavioral capability have been improving quality of life for a wide range of users by taking care of elderlies, training individuals with cognitive disabilities or physical rehabilitation, etc. While the interactive behavioral policies of most systems are scripted, we discuss here key features of a new met...
Conference Paper
Full-text available
Incremental text-to-speech systems aim at synthesizing a text 'on-the-fly', while the user is typing a sentence. In this context, this article addresses the problem of the part-of-speech tagging (POS, i.e. lexical category) which is a critical step for accurate grapheme-to-phoneme conversion and prosody estimation. Here, the main challenge is to es...
Conference Paper
Full-text available
Several socially assistive robot (SAR) systems have been proposed and designed to engage people into various interactive exercises such as physical training [1], neuropsychological rehabilitation [2] or cognitive assistance [3]. While the interactive behavioral policies of most systems are scripted, we discuss here key features of a new methodology...
Article
This article investigates the use of statistical mapping techniques for the conversion of articulatory movements into audible speech with no restriction on the vocabulary, in the context of a silent speech interface driven by ultrasound and video imaging. As a baseline, we first evaluated the GMM-based mapping considering dynamic features, proposed...
Article
The goal of this paper is to model the coverbal behavior of a subject involved in face-to-face social interactions. For this end, we present a multimodal behavioral model based on a Dynamic Bayesian Network (DBN). The model was inferred from multimodal data of interacting dyads in a specific scenario designed to foster mutual attention and multimod...
Book
Full-text available
Cet ouvrage rassemble les travaux d’études et de recherche effectués dans le cadre du cours «Cognition, Affects et Interaction » que nous avons animé au 1er semestre 2015-2016. Cette deuxième édition de cours poursuit le principe inauguré en 2014 : aux cours magistraux donnés sur la thématique "Cognition, Interaction & Affects" qui donnent les outi...
Article
Full-text available
This paper addresses the adaptation of an acoustic-articulatory model of a reference speaker to the voice of another speaker, using a limited amount of audio-only data. In the context of pronunciation training, a virtual talking head displaying the internal speech articulators (e.g., the tongue) could be automatically animated by means of such a mo...
Article
Recent developments in human-robot interaction show how the ability to communicate with people in a natural way is of great importance for artificial agents. The implementation of facial expressions has been found to significantly increase the interaction capabilities of humanoid robots. For speech, displaying a correct articulation with sound is m...
Conference Paper
Full-text available
The focus of this study is the generation of expressive audiovisual speech from neutral utterances for 3D virtual actors. Taking into account the segmental and suprasegmental aspects of audiovisual speech, we propose and compare several computational frameworks for the generation of expressive speech and face animation. We notably evaluate a standa...
Conference Paper
Full-text available
This article reports the use of a karaoke technique to drive the visual attention span (VAS) of subjects reading a text while listening to the text spelled aloud by a reading tutor. We tested the impact of computer-assisted synchronous reading (S+) that emphasizes words when they are uttered, vs. non-synchronous reading (S-) in a reading while list...
Conference Paper
Full-text available
Incremental speech synthesis aims at delivering the synthetic voice while the sentence is still being typed. One of the main challenges is the online estimation of the target prosody from a partial knowledge of the sentence's syntactic structure. In the context of HMM-based speech synthesis, this typically results in missing segmental and suprasegm...
Article
Full-text available
The aim of this paper is to model multimodal perception-action loops of human behavior in face-to-face interactions. To this end, we propose trainable behavioral models that predict the optimal actions for one specific person given others’ perceived actions and the joint goals of the interlocutors. We first compare sequential models—in particular d...
Article
Full-text available
We here propose to use immersive teleoperation of a humanoid robot by a human pilot for artificially providing the robot with social skills. This so-called beaming approach of learning by demonstration (the robot passively experience social behaviors that can be further modeled and used for autonomous control) offers a unique way to study embodied...