Frédéric EliseiGIPSA-lab · Département Parole et Cognition
Frédéric Elisei
PhD in Computer Science
About
107
Publications
16,075
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,396
Citations
Introduction
Frédéric Elisei currently works at the Département Parole et Cognition, GIPSA-lab. Frédéric does research in face-to-face interaction (speech synchronization with gaze and gestures) : for the human-human case but also in relation with cognitive robotics.
Additional affiliations
Position
- Research Assistant
Publications
Publications (107)
Background
Impaired cognitive function is observed in many pathologies, including neurodegenerative diseases such as Alzheimer disease. At present, the pharmaceutical treatments available to counter cognitive decline have only modest effects, with significant side effects. A nonpharmacological treatment that has received considerable attention is c...
We developed a web app for ascribing verbal descriptions to expressive audiovisual utterances. These descriptions are limited to lists of adjective tags that are either suggested via a navigation in emotional latent spaces built using discriminant analysis of BERT embeddings, or entered freely by participants. We show that such verbal descriptions...
Given the importance of gaze in Human-Robot Interactions (HRI), many gaze control models have been developed. However, these models are mostly built for dyadic face-to-face interaction. Gaze control models for multiparty interaction are more scarce. We here propose and evaluate data-driven gaze control models for a robot game animator in a three-pa...
The human gaze direction is the sum of the head and eye movements. The coordination of these two segments has been studied and models of the contribution of head movement to the gaze of virtual agents or robots have been proposed. However, these coordination models are mostly not trained nor evaluated in an interaction context, and may underestimat...
Hard of hearing or profoundly deaf people make use of cued speech (CS) as a communication tool to understand spoken language. By delivering cues that are relevant to the phonetic information, CS offers a way to enhance lipreading. In literature, there have been several studies on the dynamics between the hand and the lips in the context of human pr...
BACKGROUND
Impaired cognitive function is observed in many pathologies, including neurodegenerative diseases such as Alzheimer disease. At present, the pharmaceutical treatments available to counter cognitive decline have only modest effects, with significant side effects. A nonpharmacological treatment that has received considerable attention is c...
An emerging research trend associating social robotics and social-cognitive psychology offers preliminary evidence that the mere presence of humanoid robots may have the same effects as human presence on human performance, provided the robots are anthropomorphized to some extent (attribution to mental states to the robot being present). However, wh...
Digital plays a key role in the transformation of medicine. Beyond the simple computerisation of healthcare systems, many non-drug treatments are now possible thanks to digital technology. Thus, interactive stimulation exercises can be offered to people suffering from cognitive disorders, such as developmental disorders, neurodegenerative diseases,...
We consider the problem of learning to localize a
speech source using a humanoid robot equipped with a binaural
hearing system. We aim to map binaural audio features into the
relative angle between the robot’s head direction and the target
source direction based on a sensorimotor training framework.
To this end, we make the following contributions:...
This paper presents a new teleoperation system – called stereo gaze-contingent steering (SGCS) – able to seamlessly control the vergence, yaw and pitch of the eyes of a humanoid robot – here an iCub robot – from the actual gaze direction of a remote pilot. The video stream captured by the cameras embedded in the mobile eyes of the iCub are fed into...
Human interactions are driven by multi-level perception-action loops. Interactive behavioral models are typically built using rule-based methods or statistical approaches such as Hidden Markov Model (HMM), Dynamic Bayesian Network (DBN), etc. In this paper, we present the multimodal interactive data and our behavioral model based on recurrent neura...
Telepresence refers to a set of tools that allows a person to be “present” in a distant environment, by a sufficiently realistic representation of it through a set of multimodal stimuli experienced by the distant devices via its sensors. Immersive Telepresence follows this trend and, thanks to the capabilities given by virtual reality devices, repl...
Socially assistive robot with interactive behavioral capability have been improving quality of life for a wide range of users by taking care of elderlies, training individuals with cognitive disabilities or physical rehabilitation, etc. While the interactive behavioral policies of most systems are scripted, we discuss here key features of a new met...
Several socially assistive robot (SAR) systems have been proposed and designed to engage people into various interactive exercises such as physical training [1], neuropsychological rehabilitation [2] or cognitive assistance [3]. While the interactive behavioral policies of most systems are scripted, we discuss here key features of a new methodology...
The goal of this paper is to model the coverbal behavior of a subject involved in face-to-face social interactions. For this end, we present a multimodal behavioral model based on a Dynamic Bayesian Network (DBN). The model was inferred from multimodal data of interacting dyads in a specific scenario designed to foster mutual attention and multimod...
Recent developments in human-robot interaction show how the ability to communicate with people in a natural way is of great importance for artificial agents. The implementation of facial expressions has been found to significantly increase the interaction capabilities of humanoid robots. For speech, displaying a correct articulation with sound is m...
The aim of this paper is to model multimodal perception-action loops of human behavior in face-to-face interactions. To this end, we propose trainable behavioral models that predict the optimal actions for one specific person given others’ perceived actions and the joint goals of the interlocutors. We first compare sequential models—in particular d...
We here propose to use immersive teleoperation of a humanoid robot by a human pilot for artificially providing the robot with social skills. This so-called beaming approach of learning by demonstration (the robot passively experience social behaviors that can be further modeled and used for autonomous control) offers a unique way to study embodied...
Recent developments in human-robot interaction show how the ability to communicate with people in a natural way is of great importance for artificial agents. The implementation of facial expressions has been found to significantly increase the interaction capabilities of humanoid robots. For speech, displaying a correct articulation with sound is m...
This paper presents the virtual speech cuer built in the context of the ARTUS project aiming at watermarking hand and face gestures of a virtual animated agent in a broadcasted audiovisual sequence. For deaf televiewers that master cued speech, the animated agent can be then superimposed - on demand and at the reception - in the original broadcast...
The article presents a method for adapting a GMM-based acoustic-articulatory inversion model trained on a reference speaker to another speaker. The goal is to estimate the articulatory trajectories in the geometrical space of a reference speaker from the speech audio signal of another speaker. This method is developed in the context of a system of...
The article presents a statistical mapping approach for crossspeaker acoustic-to-articulatory inversion. The goal is to estimate the most likely articulatory trajectories for a reference speaker from the speech audio signal of another speaker. This approach is developed in the framework of our system of visual articulatory feedback developed for co...
Human-human interaction in natural environments relies on a variety of perceptual cues. Humanoid robots are becoming increasingly refined in their sensorimotor capabilities, and thus should now be able to manipulate and exploit these social cues in cooperation with their human partners. Previous studies have demonstrated that people follow human an...
Nous présentons un ensemble de travaux d'acquisition - pour un même locuteur - de données acoustiques, aérodynamiques et articulatoires, utilisant divers dispositifs complémentaires. Les modèles de production de parole développés à partir de ces données permettent d'exploiter de manière cohérente les caractéristiques de ces dispositifs, et reflèten...
ISBN-13 978-2-7351-1272-2
Context Several studies tend to show that visual articulatory feedback is useful for phonetic correction, both for speech therapy and "Computer Aided Pronunciation Training" (CAPT) [1]. In [2], we proposed a visual articulatory feedback system based on a 3D talking head used in "an augmented speech scenario", i.e. displaying all speech articulators...
This paper reviews some theoretical and practical aspects of different statistical mapping techniques used to model the relationships between the articulatory gestures and the resulting speech sound. These techniques are based on the joint modeling of articulatory and acoustic data using Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM). T...
In the present work we observe two subjects interacting in a collaborative task on a shared environment. One goal of the experiment is to measure the change in behavior with respect to gaze when one interactant is wearing dark glasses and hence his/her gaze is not visible by the other one. The results show that if one subject wears dark glasses whi...
Orofacial clones can display speech articulation in an augmented mode, i.e. display all major speech articulators, including those usually hidden such as the tongue or the velum. Besides, a number of studies tend to show that the visual articulatory feedback provided by ElectroPalatoGraphy or ultrasound echography is useful for speech therapy. This...
In this paper, we describe two series of experiments that examine audiovisual face-to-face interaction between naive human viewers and either a human interlocutor or a virtual conversational agent. The main objective is to analyze the interplay between speech activity and mutual gaze patterns during mediated face-to-face interactions. We first quan...
Lip reading relies on visible articulators to ease speech understanding. However, lips and face alone provide very incomplete phonetic information: the tongue, that is generally not entirely seen, carries an important part of the articulatory information not accessible through lip reading. The question is thus whether the direct and full vision of...
We introduce here an emerging technological and scientific field. Augmented speech communication (ASC) aims at supplementing human-human communication with enhanced or additional modalities. ASC improves human-human communication by exploiting a priori knowledge on multimodal coherence of speech signals, user/listener voice characteristics or more...
We describe here the control, shape and appearance models that are built using an original photogrammetric method to capture characteristics of speaker-specific facial articulation, anatomy, and texture. Two original contributions are put forward here: the trainable trajectory formation model that predicts articulatory trajectories of a talking fac...
This article investigates a blossoming research field: face-to-face communication. The performance and the robustness of the technological components that are necessary to the implementation of face-to-face interaction systems between a human being and a conversational agent - vocal technologies, computer vision, image synthesis, dialogue comprehen...
Visible speech movements were motion captured and parameterized. Coarticulated targets were extracted from VCVs and modeled to generate arbitrary German utterances by target interpolation. The system was extended to synthesize English utterances by a mapping to German phonemes. An evaluation by means of a modified rhyme test reveals that the synthe...
Cet article explore un champ de recherches en plein essor : la communication face-à-face. Les performances et la robustesse des composants technologiques nécessaires à la mise en oeuvre de systèmes d'interaction face-à-face entre l'homme et un agent conversationnel – technologies vocales, vision par ordinateur, synthèse d'images, compréhension et g...
Lip reading relies on visible articulators to ease audiovisual speech understanding. However, lips and face alone provide very incomplete phonetic information: the tongue, that is generally not entirely seen, carries an important part of the articulatory information not accessible through lip reading. The question was thus whether the direct and fu...
In this paper we present an overview of LIPS2008: Visual Speech Synthesis Challenge. The aim of this challenge is to bring together researchers in the field of visual speech synthesis to firstly evaluate their systems within a common framework, and secondly to identify the needs of the wider community in terms of evaluation. In doing so we hope to...
This paper presents preliminary analysis and modelling of facial motion capture data recorded on a speaker uttering nonsense syllables and sentences with various acted facial expressions. We analyze here the impact of facial expressions on articulation and determine prediction errors of simple models trained to map neutral articulation to the vario...
We describe here the trainable trajectory formation model that will be used for the LIPS'2008 challenge organized at InterSpeech'2008. It predicts articulatory trajectories of a talking face from phonetic input. It basically uses HMM-based synthesis but asynchrony between acoustic and gestural boundaries – taking for example into account non audibl...
Cued Speech is a communication system that complements lip-reading with a small set of possible handshapes placed in different positions near the face. Developing a Cued Speech capable system is a time-consuming and difficult challenge. This paper focuses on how an existing bank of reference Cued Speech gestures, exhibiting natural dynamics for han...
We present a methodology developed to derive three-dimensional models of speech articulators from volume MRI and multiple
view video images acquired on one speaker. Linear component analysis is used to model these highly deformable articulators
as the weighted sum of a small number of basic shapes corresponding to the articulators’ degrees of freed...
We present here the analysis of multimodal data gathered during realistic face-to-face interaction of a target speaker with a number of interlocutors. Videos and gaze of both interlocutors were monitored with an experimental setup using coupled cameras and screens equipped with eye trackers. With the aim to understand the functions of gaze in socia...
De nombreux travaux ont établi que la vision des articulateurs typiquement visibles (lèvres, mâchoire, visage, partie antérieure de la langue, dents) facilite la compréhension de la parole par les humains, et augmente significativement le taux de reconnaissance. Pour autant, tout ne peut être " lu " sans ambigüité avec la seule vue du visage. En pa...
We present here a system for controlling the eye gaze of a virtual embodied conversational agent able to perceive the physical environment in which it interacts. This system is inspired by known components of human visual attention system and reproduces its limitations in terms of visual acuity, sensitivity to movement, limitations of short-memory...
We present here the analysis of multimodal data gathered during realistic face-to-face interaction of a target speaker with a number of interlocutors. Videos and gaze have been monitored with an experimental setup using coupled cameras and screens with integrated eye trackers. With the aim to understand the functions of gaze in social interaction a...
Eye gaze plays many important roles in audiovisual speech, especially in face-to-face interactions. Eyelid shapes are known to correlate with gaze direction. This correlation is perceived and should be restored when animating 3D talking heads. This paper presents a data-based construction method that models the user’s eyelid geometric deformations...
We investigate the intelligibility of natural visual and audiovisual speech compared to re-synthesized speech movements rendered by a talking head. This talking head is created using the speaker cloning methodology of the Institut de la Communication Parlée in Grenoble (now department for speech and cognition in GIPSA- Lab). A German speaker with c...
We present here the analysis of multimodal data gathered during realistic face-to-face interaction of a target speaker with a number of interlocutors. During several dyadic dialogs videos and gaze have been monitored with an original experimental setup using coupled cameras and screens equipped with eye tracking capabilities. For a detailed analysi...
In the framework of experimental phonetics, our approach to the study of speech production is based on the measurement, the analysis and the modeling of orofacial articulators such as the jaw, the face and the lips, the tongue or the velum. Therefore, we present in this article experimental techniques that allow characterising the shape and movemen...
We describe here our first effort for developing a virtual talking head able to engage a situated face-to-face interaction with a human partner. This paper concentrates on the low-level components of this interaction loop and the cognitive impact of the implementation of mutual attention and multimodal deixis on the communication task.
In the framework of experimental phonetics, our approach to the study of speech production is based on the measurement, the analysis and the modeling of orofacial articulators such as the jaw, the face and the lips, the tongue or the velum. Therefore, we present in this article experimental techniques that allow characterising the shape and movemen...