Article

Tracking talking faces with shape and appearance models

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents a system that can recover and track the 3D speech movements of a speaker’s face for each image of a monocular sequence. To handle both the individual specificities of the speaker’s articulation and the complexity of the facial deformations during speech, speaker-specific articulated models of the face geometry and appearance are first built from real data. These face models are used for tracking: articulatory parameters are extracted for each image by an analysis-by-synthesis loop. The geometric model is linearly controlled by only seven articulatory parameters. Appearance is seen either as a classical texture map or through local appearance of a relevant subset of 3D points. We compare several appearance models: they are either constant or depend linearly on the articulatory parameters. We compare tracking results using these different appearance models with ground truth data not only in terms of recovery errors of the 3D geometry but also in terms of intelligibility enhancement provided by the movements.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In [4], an dynamic extension to FACS system is presented. In [5] [6] results of 3D speech movement analysis by using facial deformation states are given and used for animation and tracking purposes. These results show promising gains and lead to higher degrees of acceptance in facial animation. ...
... In [7], a mirror construction is described to capture the movements of the lips from two views. A double mirror construction together with two cameras is described in [6] and used to determine speech movements from a total of four views, where the real camera views are almost similar. We have constructed a system with two mirrors which is shown in Fig. 1. ...
... In [4], an dynamic extension to FACS system is presented. In [5, 6] results of 3D speech movement analysis by using facial deformation states are given and used for animation and tracking purposes. These results show promising gains and lead to higher degrees of acceptance in facial animation. ...
Conference Paper
Full-text available
We present our system for the capturing and analysis of 3D facial motion. A high speed camera is used as capture unit in combination with two surface mirrors. The mirrors provide two additional virtual views of the face without the need of multiple cameras and to avoid synchronization problems. We use this system to capture the motion of a person's face while speaking. Investigations of these facial motions are presented and rigid and non-rigid motion are analyzed. In order to extract only facial deformation independent from head pose, we use a new and simple approach for separating rigid and non-rigid motion named weight-compensated motion estimation (WCME). This approach weights the data points according to their influence to the desired motion model. We also present first results of our model-based facial deformation analysis. Such results can be used for facial animations in order to achieve a higher degree of quality.
... One reason is that the face improves intelligibility, particularly when the auditory signal is degraded by the presence of noise or distracting prose (see Sumby and Pollack [1]; Benoît et al. [2]; Jesse et al. [3]; Summerfield [4]). Given this observation, there is value in developing applications with virtual 3D animated talking heads that are aligned with the auditory speech (see Bailly et al. [5]; Beskow [6]; Massaro [7]; Odisio et al. [8]; Pelachaud et al. [9]). These animated agents have the potential to improve communication between humans and machines. ...
... Across a range of studies comparing specific mathematical predictions (see Chen and Massaro [28]; Massaro [7,27,29]), the FLMP has been more successful than other competitor models in accounting for the experimental data. Previous tests of the FLMP did not include both a synthetic and a natural talker, and previous tests of intelligibility 8 EURASIP Journal on Audio, Speech, and Music Processing [7]). The present three experiments include these additional conditions, which allow us to use the FLMP parameter values to assess differences between test and reference conditions of the visual channel. ...
Article
Full-text available
Animated agents are becoming increasingly frequent in research and applications in speech science. An important challenge is to evaluate the effectiveness of the agent in terms of the intelligibility of its visible speech. In three experiments, we extend and test the Sumby and Pollack (1954) metric to allow the comparison of an agent relative to a standard or reference, and also propose a new metric based on the fuzzy logical model of perception (FLMP) to describe the benefit provided by a synthetic animated face relative to the benefit provided by a natural face. A valid metric would allow direct comparisons accross different experiments and would give measures of the benfit of a synthetic animated face relative to a natural face (or indeed any two conditions) and how this benefit varies as a function of the type of synthetic face, the test items (e.g., syllables versus sentences), different individuals, and applications.
... Then using decision tree clustering tied state triphones were created. The contexts considered for clustering are based on the hierarchical cluster trees of phonemes mentioned in [8]. The complete speech corpus has been used for the estimation of HMM parameters. ...
... The analysis of geometric targets of the 5690 allophones produced by the speaker (see Figure 2) reveals confusion trees similar to previous findings (Odisio and Bailly 2004 ...
... Overall, these studies suggest that it is important to develop application with virtual 3D animated talking heads that are aligned with the auditory speech Bailly et al., 2003;Beskow, 2003;Massaro, 1998;Ouni et al., 2007;Odisio et al., 2004). ...
Article
Full-text available
In recent years, advances in wireless communication technology have led to the widespread use of cellular phones. Because of noisy environmental conditions and competing surrounding conversations, users tend to speak loudly. As a consequence, private policies and public legislation tend to restrain the use of cellular phone in public places. Silent speech which can only be heard by a limited set of listeners close to the speaker is an attractive solution to this problem if it can effectively be used for quiet and private communication. The motivation of this research thesis was to investigate ways of improving the naturalness and the intelligibility of synthetic speech obtained from the conversion of silent or whispered speech. A Non-audible murmur (NAM) condenser microphone, together with signal-based Gaussian Mixture Model (GMM) mapping, were chosen because promising results were already obtained with this sensor and this approach, and because the size of the NAM sensor is well adapted to mobile communication technology. Several improvements to the speech conversion obtained with this sensor were considered. A first set of improvement concerns characteristics of the voiced source. One of the features missing in whispered or silent speech with respect to loud or modal speech is F0, which is crucial in conveying linguistic (question vs. statement, syntactic grouping, etc.) as well as paralinguistic (attitudes, emotions) information. The proposed estimation of voicing and F0 for converted speech by separate predictors improves both predictions. The naturalness of the converted speech was then further improved by extending the context window of the input feature from phoneme size to syllable size and using a Linear Discriminant Analysis (LDA) instead of a Principal Component Analysis (PCA) for the dimension reduction of input feature vector. The objective positive influence of this new approach of the quality of the output converted speech was confirmed by perceptual tests. Another approach investigated in this thesis consisted in integrating visual information as a complement to the acoustic information in both input and output data. Lip movements which significantly contribute to the intelligibility of visual speech in face-to-face human interaction were explored by using an accurate lip motion capture system from 3D positions of coloured beads glued on the speaker's face. The visual parameters are represented by 5 components related to the rotation of the jaw, to lip rounding, upper and lower lip vertical movements and movements of the throat which is associated with the underlying movements of the larynx and hyoid bone. Including these visual features in the input data significantly improved the quality of the output converted speech, in terms of F0 and spectral features. In addition, the audio output was replaced by an audio-visual output. Subjective perceptual tests confirmed that the investigation of the visual modality in either the input or output data or both, improves the intelligibility of the whispered speech conversion. Both of these improvements are confirmed by subjective tests. Finally, we investigated the technique using a phonetic pivot by combining Hidden Markov Model (HMM)-based speech recognition and HMM-based speech synthesis techniques to convert whispered speech data to audible one in order to compare the performance of the two state-of-the-art approaches. Audiovisual features were used in the input data and audiovisual speech was produced as an output. The objective performance of the HMM-based system was inferior to the direct signal-to-signal system based on a GMM. A few interpretations of this result were proposed together with future lines of research.
... These models are built from Magnetic Resonance Imaging (MRI), Computer Tomography (CT) and video data acquired from this speaker and aligned on a common reference coordinate system related to the skull. The jaw, lips and face model, described in Odisio et al. (2004), is controlled by two jaw parameters (jaw height, jaw advance), and three lip parameters (lip protrusion LP common to both lips, upper lip height UL, lower lip height LL). The velum model presented in Serrurier and Badin (2008) is essentially controlled by one parameter that drives the opening/closing movements of the nasopharyngeal port. ...
Article
Lip reading relies on visible articulators to ease speech understanding. However, lips and face alone provide very incomplete phonetic information: the tongue, that is generally not entirely seen, carries an important part of the articulatory information not accessible through lip reading. The question is thus whether the direct and full vision of the tongue allows tongue reading. We have therefore generated a set of audiovisual VCV stimuli with an audiovisual talking head that can display all speech articulators, including tongue, in an augmented speech mode. The talking head is a virtual clone of a human speaker and the articulatory movements have also been captured on this speaker using ElectroMagnetic Articulography (EMA). These stimuli have been played to subjects in audiovisual perception tests in various presentation conditions (audio signal alone, audiovisual signal with profile cutaway display with or without tongue, complete face), at various Signal-to-Noise Ratios. The results indicate: (1) the possibility of implicit learning of tongue reading, (2) better consonant identification with the cutaway presentation with the tongue than without the tongue, (3) no significant difference between the cutaway presentation with the tongue and the more ecological rendering of the complete face, (4) a predominance of lip reading over tongue reading, but (5) a certain natural human capability for tongue reading when the audio signal is strongly degraded or absent. We conclude that these tongue reading capabilities could be used for applications in the domains of speech therapy for speech retarded children, of perception and production rehabilitation of hearing impaired children, and of pronunciation training for second language learners.
... Our virtual talking head is made of the assemblage of individual three-dimensional models of diverse speech organs (tongue, jaw, lips, velum, face, etc) built from MRI, CT and video data acquired from a single subject and aligned on a common reference coordinate system related to the skull. The jaw, lips and face model described in [8] is controlled by two jaw parameters (jaw height, jaw advance), three lip parameters (lip protrusion common to both lips, upper lip height, lower lip height). The three-dimensional jaw and tongue model developed by Badin et al. [9] is driven mostly by five parameters: jaw height (common with the lips / face model), tongue body, tongue dorsum, tongue tip vertical and tongue tip horizontal. ...
Conference Paper
Full-text available
Lip reading relies on visible articulators to ease audiovisual speech understanding. However, lips and face alone provide very incomplete phonetic information: the tongue, that is generally not entirely seen, carries an important part of the articulatory information not accessible through lip reading. The question was thus whether the direct and full vision of the tongue allows tongue reading. We have therefore generated a set of audiovisual VCV stimuli by controlling an audiovisual talking head that can display all speech articulators, including tongue, in an augmented speech mode, from articulators movements tracked on a speaker. These stimuli have been played to subjects in a series of audiovisual perception tests in various presentation conditions (audio signal alone, audiovisual signal with profile cutaway display with or without tongue, complete face), at various Signal-to-Noise Ratios. The results show a given implicit effect of tongue reading learning, a preference for the more ecological rendering of the complete face in comparison with the cutaway presentation, a predominance of lip reading over tongue reading, but the capability of tongue reading to take over when the audio signal is strongly degraded or absent. We conclude that these tongue reading capabilities could be used for applications in the domain of speech therapy for speech retarded children, perception and production rehabilitation of hearing impaired children, and pronunciation training for second language learners.
... With the publication of Essa and Pentland [4] , a dynamic extension to FACS system is presented. In the conference publication of Kalberer and Van Gool as well as in the journal publication of Odisio et al. [5, 6] results of 3D speech movement analysis by using facial deformation states are given and used for animation and tracking purposes. In this paper, we analyze the dynamics of facial expressions. ...
Conference Paper
Investigation of the motion performed by a person’s face while speaking is the target of this paper. Methods and results of the studied facial motions are presented and rigid and non-rigid motion are analyzed. In order to extract only facial deformation independent from head pose, we use a new and simple approach for separating rigid and non-rigid motion called Weight Compensated Motion Estimation (WCME). This approach weights the data points according to their influence to the desired motion model. A synthetic test as well as real data are used to demonstrate the performance of this approach. We also present results in the field of facial deformation analysis and used basis shapes as description form. These results can be used for recognition purposes by adding temporal changes to the overall process or adding natural deformations other than at the given database.
... If we use an exhaustive search approach, we need to look at all the possible positions and at each location, all the possible poses of our model in the zone where the model could be present in the image under test. Recent applications of Hausdorff distance for shape matching include face detection and tracking [8]. Medical image registration also uses Hausdorff distance techniques for kidney images acquired using CT devices [9]. ...
Conference Paper
n this work, we present a Monte Carlo approach to compute Hausdorff distance for locating objects in real images. Objects are considered to be only under translation motion. We use edge points as the features of the model. Using a different interpretation of the Hausdorff distance, we show how image similarity can be measured by using a randomly sub-sampled set of feature points. As a result of computing the Hausdorff distance on smaller sets of features, our approach is faster than the classical one. We have found that our method converges toward the actual Hausdorff distance by using less than 20 % of the feature points. We show the behavior of our method for several fractions of feature points used to compute Hausdorff distance. These tests let us conclude that performance is only critically degraded when the sub-sampled set has a cardinality under 15 % of the total feature points in real images.
... Then using decision tree clustering tied state triphones were created. The contexts considered for clustering are based on the hierarchical cluster trees of phonemes mentioned in [8]. The complete speech corpus has been used for the estimation of HMM parameters. ...
Article
Full-text available
We describe automatic visual speech segmentation using facial data captured by a stereo-vision technique. The segmentation is performed using an HMM-based forced alignment mechanism widely used in automatic speech recognition. The idea is based on the assumption that using visual speech data alone for the training might capture the uniqueness in the facial compo- nent of speech articulation, asynchrony (time lags) in visual and acoustic speech segments and significant coarticulation effects. This should provide valuable information that helps to show the extent to which a phoneme may affect surrounding phonemes visually. This should provide information valuable in labeling the visual speech segments based on dominant coarticulatory contexts.
... Initial models are often trained off-line using limited hand-labeled data. Most models consider either facial expressions [16] or speech-related facial gestures [15], with few attempts that treat the global problem [3]. The work below presents our first effort in characterizing the DoF of the facial deformation of one speaker when involved in face-to-face conversation. ...
Article
Full-text available
In this paper we analyze the degrees of freedom (DoF) of facial movements in face-to-face conversation. We propose here a method for automatically selecting expressive frames in a large fine-grained motion capture corpus that best complement an initial shape model built using neutral speech. Using conversational data from one speaker, we extract 11 DoF that reconstruct facial deformations with a average precision less than a millimeter. Gestural scores are then built that gather movements and discursive labels. This modeling framework offers a productive analysis of conversational speech that seeks in the multimodal signals the rendering of given communicative functions and linguistic events.
... First, we recover both the six head movement parameters and the six articulatory speech parameters using the simpler model (no eyelids). As the rendered 3D model perfectly matches the recorded images, analysis by synthesis could have been used [28]. We actually used a pattern matching algorithm to track the beads displacements first. ...
Article
Full-text available
Eye gaze plays many important roles in audiovisual speech, especially in face-to-face interactions. Eyelid shapes are known to correlate with gaze direction. This correlation is perceived and should be restored when animating 3D talking heads. This paper presents a data-based construction method that models the user’s eyelid geometric deformations caused by gazing and blinking during conversation. This 3D eyelid and gaze model has been used to analyze and automatically reconstruct our German speaker’s gaze. This can potentially complement or replace infra-red based eye tracking when it is important to collect not only where the user looks but also how (ocular expressions...). This method may be used as a tool to study expressive speech and gaze patterns related to cognitive activities (speaking, listening, thinking...).
... These parameters explain 46.2, 4.6, 18.7, 3.8, 3.2, 1.6 and 1.3% of the movement variance. The analysis of geometric targets of the 5690 allophones produced by the speaker (seeFigure 2) reveals confusion trees similar to previous findings (Odisio and Bailly 2004). Consequently 3 visemes are considered for vowels (grouping respectively rounded [uy ], mid-open [ieøoa ] and open vowels [aoeoe  ]) and 4 visemes for consonants (distinguishing respectively bilabials [pbm], labiodentals [fv], rounded fricatives [] from the others). ...
Article
Full-text available
A new trainable trajectory formation system - named TDA - for facial animation is here proposed that dissociates parametric spaces and methods for movement planning and execution. Movement planning is achieved by HMM-based trajectory formation. This module essentially plans configurations of lip geometry (aperture, spreading and protrusion). Movement execution is performed by concatenation of multi-represented diphones. This module is responsible for selecting and concatenating detailed facial movements that best obey to the target kinematics of the geometry previously planned. Movement planning ensures that the essential visual characteristics of visemes are reached (lip closing for bilabials, rounding and opening for palatal fricatives, etc) and that appropriate coarticulation is planned. Movement execution grafts phonetic details and idiosyncratic articulatory strategies (dissymetries, importance of jaw movements, etc) to the planned gestural score. This planning scheme is compared to alternative planning strategies using articulatory modeling and motion capture data
Article
The work performed during this thesis concerns visual speech synthesis in the context of humanoid animation. Our study proposes and implements control models for facial animation that generate articulatory trajectories from text. We have used 2 audiovisual corpuses in our work. First of all, we compared objectively and subjectively the main state-of-the-art models. Then, we studied the spatial aspect of the articulatory targets generated by HMM-based synthesis and concatenation-based synthesis that combines the advantages of these methods. We have proposed a new synthesis model named TDA (Task Dynamics for Animation). The TDA system plans the geometric targets by HMM synthesis and executes the computed targets by concatenation of articulatory segments. Then, we have studied the temporal aspect of the speech synthesis and we have proposed a model named PHMM (Phased Hidden Markov Model). The PHMM manages the temporal relations between different modalities related to speech. This model calculates articulatory gestures boundaries as a function of the corresponding acoustic boundaries between allophons. It has been also applied to the automatic synthesis of Cued speech in French. Finally, a subjective evaluation of the different proposed systems (concatenation, HMM, PHMM and TDA) is presented.
Article
At the MPI multimodal research has a long history. An increasing amount of resources is created to test scientific hypothesis. This requires proper methods and technologies to manage these resources. During the last five years mature tools1 were developed for these purposes that guide the resources during their whole life-cycle; ELAN can be used to create accurate and complex annotations; IMDI helps the user to create useful metadata descriptions, to model the underlying relations between the resources and to search for suitable resources; LAMUS is used to upload and manage large language resource repositories and finally ANNEX and LEXUS can be used to access multimodal resources via the web.
Article
Full-text available
Nous présentons un ensemble de travaux d'acquisition - pour un même locuteur - de données acoustiques, aérodynamiques et articulatoires, utilisant divers dispositifs complémentaires. Les modèles de production de parole développés à partir de ces données permettent d'exploiter de manière cohérente les caractéristiques de ces dispositifs, et reflètent notre connaissance des mécanismes de production de la parole au niveau périphérique. Le clone orofacial ainsi développé peut ensuite être contrôlé à partir de signaux de capture de mouvement à haute résolution temporelle. La synthèse articulatoire des consonnes fricatives, l'élaboration de données étalonnées, les modèles tridimensionnels d'articulateurs de la parole font partie des résultats majeurs issus de ces travaux.
Book
When we speak, we configure the vocal tract which shapes the visible motions of the face and the patterning of the audible speech acoustics. Similarly, we use these visible and audible behaviors to perceive speech. This book showcases a broad range of research investigating how these two types of signals are used in spoken communication, how they interact, and how they can be used to enhance the realistic synthesis and recognition of audible and visible speech. The volume begins by addressing two important questions about human audiovisual performance: how auditory and visual signals combine to access the mental lexicon and where in the brain this and related processes take place. It then turns to the production and perception of multimodal speech and how structures are coordinated within and across the two modalities. Finally, the book presents overviews and recent developments in machine-based speech recognition and synthesis of AV speech.
Conference Paper
3D face models, thanks to the accuracy and effectiveness of recent devices and techniques for 3D object reconstruction, are extending and enforcing traditional 2D face recognition engines. Using 3D face models allows, in particular, improving recognition robustness with respect to, e.g. non-frontal or partially occluded acquisitions or variations in the lighting conditions. We further discuss some possible applicative scenarios in the conclusions. In this paper we will describe how a set-up with a single hi-resolution camera together with one, two or more planar mirrors can be implemented to provide accurate 3D models of faces: in particular we will tackle the calibration phase of a multi-mirror environment showing advantages with respect to a multi cameras arrangement, we will also propose a possible reconstruction algorithm which uses a global energy-minimization approach to provide an accurate depth-map accounting for surface smoothness, non-convex regions and holes. Examples carried out with both synthetic and real data show that the proposed approach can fruitfully improve accuracy of 2D recognition engines.
Conference Paper
While facial expressions and phoneme states are analyzed and published very well, the dynamic deformation of a face is rarely described or modeled. Recently dynamic facial expressions are analyzed. We describe a capture system, processing steps and analysis results useful for modeling facial deformations while speaking. The capture system consists of a double mirror construction and a high speed camera, in order to get fluid motion. Not only major face features as well as a high accuracy of the tracked facial points, are required for such analysis. The dynamic analysis results demonstrate the potential of a reduced phoneme alphabet, because of similar 3D shape deformations. The separation of asymmetric facial motion allows to setup a personalized deformation model, besides the common symmetric deformation.
Article
Full-text available
Cued Speech is a communication system that complements lip-reading with a small set of possible handshapes placed in different positions near the face. Developing a Cued Speech capable system is a time-consuming and difficult challenge. This paper focuses on how an existing bank of reference Cued Speech gestures, exhibiting natural dynamics for hand articulation and movements, could be reused for another speaker (augmenting some video or 3D talking heads). Any Cued Speech hand gesture should be recorded or considered with the concomitant facial locations that Cued Speech specifies to leverage the lip reading ambiguities (such as lip corner, chin, cheek and throat for French). These facial target points are moving along with head movements and because of speech articulation. The post-treatment algorithm proposed here will retarget synthesized hand gestures to another face, by slightly modifying the sequence of translations and rotations of the 3D hand. This algorithm preserves the co-articulation of the reference signal (including undershooting of the trajectories, as observed in fast Cued Speech) while adapting the gestures to the geometry, articulation and movements of the target face. We will illustrate how our Cued Speech capable audiovisual synthesizer – built using simultaneously recorded hand trajectories and facial articulation of a single French Cued Speech user – can be used as a reference signal for this retargeting algorithm. For the ongoing evaluation of our algorithm, an intelligibility paradigm has been retained, using natural videos for the face. The intelligibility of some video VCV sequences with composited hand gestures for Cued Speech is being measured using a panel of Cued Speech users.
Article
Full-text available
Cette thèse traite de la mise en œuvre d'un système de synthèse 3D de parole audiovisuelle capable, a partir d'une simple chaîne phonétique, de générer un signal audio synthétique, les mouvements du visage correspondant ainsi que les mouvements de la main reproduisant les gestes de la Langue française Parlée Complétée (LPC). Nous avons enregistré les mouvements faciaux et manuels d'une codeuse LPC par une technique de motion capture, ainsi que le signal audio correspondant, lors de la production d'un corpus de 238 phrases couvrant l'ensemble des diphones du français. Après traitements et analyses des données, nous avons implémenté un système de synthèse par concaténation d'unités en deux étapes capable de générer de la parole codée. Enfin, nous avons évalué notre système tant au niveau de l'intelligibilité segmentale qu'au niveau de la compréhension. Les résultats sont prometteurs et montrent clairement un apport d'information du code de synthèse.
Article
Le travail réalisé durant cette thèse concerne la synthèse visuelle de la parole pour l'animation d'un humanoïde de synthèse. L'objectif principal de notre étude est de proposer et d'implémenter des modèles de contrôle pour l'animation faciale qui puissent générer des trajectoires articulatoires à partir du texte. Pour ce faire nous avons travaillé sur 2 corpus audiovisuels. Tout d'abord, nous avons comparé objectivement et subjectivement les principaux modèles existants de l'état de l'art. Ensuite, nous avons étudié l'aspect spatial des réalisations des cibles articulatoires, pour les synthèses par HMM (Hidden Markov Model) et par concaténation simple. Nous avons combiné les avantages des deux méthodes en proposant un nouveau modèle de synthèse nommé TDA (Task Dynamics for Animation). Ce modèle planifie les cibles géométriques grâce à la synthèse par HMM et exécute les cibles articulatoires ainsi générées grâce à la synthèse par concaténation. Par la suite, nous avons étudié l'aspect temporel de la synthèse de la parole et proposé un second modèle de synthèse intitulé PHMM (Phased Hidden Markov Model) permettant de gérer les différentes modalités liées à la parole. Le modèle PHMM permet de calculer les décalages des frontières des gestes articulatoires par rapport aux frontières acoustiques des allophones. Ce modèle a été également appliqué à la synthèse automatique du LPC (Langage Parlé Complété). Enfin, nous avons réalisé une évaluation subjective des différentes méthodes de synthèse visuelle étudiées (concaténation, HMM, PHMM et TDA).
Article
Full-text available
In this paper we present efforts for characterizing the three dimensional (3-D) movements of the right hand and the face of a French female speaker during the audiovisual production of cued speech. The 3-D trajectories of 50 hand and 63 facial flesh points during the production of 238 utterances were analyzed. These utterances were carefully designed to cover all possible diphones of the French language. Linear and nonlinear statistical models of the articulations and the postures of the hand and the face have been developed using separate and joint corpora. Automatic recognition of hand and face postures at targets was performed to verify a posteriori that key hand movements and postures imposed by cued speech had been well realized by the subject. Recognition results were further exploited in order to study the phonetic structure of cued speech, notably the phasing relations between hand gestures and sound production. The hand and face gestural scores are studied in reference with the acoustic segmentation. A first implementation of a concatenative audiovisual text-to-cued speech synthesis system is finally described that employs this unique and extensive data on cued speech in action.
Conference Paper
Full-text available
We present a system that can recover and track the 3D speech movements of a speaker's face for each image of a monocular sequence. A speaker-specific face model is used for tracking: model parameters are extracted from each image by an analysis-by-synthesis loop. To handle both the individual specificities of the speaker's articulation and the complexity of the facial deformations during speech, speaker-specific models of the face 3D geometry and appearance are built from real data. The geometric model is linearly controlled by only six articulatory parameters. Appearance is seen either as a classical texture map or through local appearance of a relevant subset of 3D points. We compare several appearance models: they are either constant or depend linearly on the articulatory parameters. We evaluate these different appearance models with ground truth data.
Article
Full-text available
This paper presents the main approaches used to synthesize talking faces, and provides greater detail on a handful of these approaches. An attempt is made to distinguish between facial synthesis itself (i.e. the manner in which facial movements are rendered on a computer screen), and the way these movements may be controlled and predicted using phonetic input. The two main synthesis techniques (model-based vs. image-based) are contrasted and presented by a brief description of the most illustrative existing systems. The challenging issues—evaluation, data acquisition and modeling—that may drive future models are also discussed and illustrated by our current work at ICP.
Article
Full-text available
We present here our efforts for implementing a system able to synthesize French Manual Cued Speech (FMCS). We recorded and analyzed the 3D trajectories of 50 hand and 63 facial flesh points during the production of 238 utterances carefully designed for covering all possible di- phones of the French language. Linear and non linear sta- tistical models of the hand and face deformations and postures have been developed using both separate and joint corpora. We create 2 separate dictionaries, one con- taining diphones and another one containing "dikeys". Using these 2 dictionaries, we implement a complete text- to-cued speech synthesis system by concatenation of di- phones and dikeys.
Article
Full-text available
We present a linear three-dimensional modeling paradigm for lips and face, that captures the audiovisual speech activity of a given speaker by only six parameters. Our articulatory models are constructed from real data (front and profile images), using a linear component analysis of about 200 3D coordinates of fleshpoints on the subject's face and lips. Compared to a raw component analysis, our construction approach leads to somewhat more comparable relations across subjects: by construction, the six parameters have a clear phonetic/articulatory interpretation. We use such a speaker's specific articulatory model to regularize MPEG-4 facial articulation parameters (FAP) and show that this regularization process can drastically reduce bandwidth, noise and quantization artifacts. We then present how analysis-by-synthesis techniques using the speaker-specific model allows the tracking of facial movements. Finally, the results of this tracking scheme have been used to develop a text-to-audiovisual speech system.
Article
Full-text available
The Synface project is developing a synthetic talking face to aid the hearing-impaired in telephone conversation. This report investigates the gain in intelligibility from the synthetic talking head when controlled by hand-annotated speech in both 12 normal hearing (NH) and 13 hearing-impaired (HI) listeners (average hearing loss 86 dB). For NH listeners, audio from everyday sentences was degraded to simulate the information losses that arise in severe-to-profound hearing impairment. For the HI group, audio was filtered to simulate telephone speech. Auditory signals were presented alone, with the synthetic face, and with a video of the original talker. Purely auditory intelligibility was low for the NH group. With the addition of the synthetic face, average intelligibility increased by 22%. The HI group had a large variation in intelligibility in the purely auditory condition. They showed a 22% improvement with the addition of the synthetic face. For both groups, intelligibility with the synthetic face was significantly lower than with the natural face. However, the improvement with the synthetic face is sufficient to be useful in everyday communication. Questionnaire responses from the HI group indicated strong interest in the Synface system.
Conference Paper
Full-text available
We have already presented a system that can track the 3D speech movements of a speaker's face in a monocular video sequence. For that purpose, speaker-specic models of the face have been built, including a 3D shape model and several appearance models. In this paper, speech movements estimated using this system are per- ceptually evaluated. These movements are re-synthesised using a Point-Light (PL) rendering. They are paired with original audio signals degraded with white noise at several SNR. We study how much such PL movements enhance the identication of logatoms, and also to what extent they inuence the perception of incongru- ent audio-visual logatoms. In a rst experiment, the PL rendering is evaluated per se. Results seem to conrm other previous stud- ies: though less efcient than actual video, PL speech enhances intelligibility and can reproduce the McGurk effect. In the second experiment, the movements have been estimated with our tracking framework with various appearance models. No salient differences are revealed between the performances of the appearance models.
Conference Paper
Full-text available
This paper describes an extension of a technique for the recognition and tracking of every day objects in cluttered scenes. The goal is to build a system in which ordinary desktop objects serve as physical icons in a vision based system for man-machine interaction. In such a system, the manipulation of objects replaces user commands. A view-variant recognition technique, developed by the second author, has been adapted by the first author for a problem of recognising and tracking objects on a cluttered background in the presence of occlusions. This method is based on sampling a local appearance function at discrete viewpoints by projecting it onto a vector of receptive fields which have been normalised to local scale and orientation. This paper reports on the experimental validation of the approach, and of its extension to the use of receptive fields based on colour. The experimental results indicate that the second author’s technique does indeed provide a method for building a fast and robust recognition technique. Furthermore, the extension to coloured receptive fields provides a greater degree of local discrimination and an enhanced robustness to variable background conditions. The approach is suitable for the recognition of general objects as physical icons in an augmented reality.
Conference Paper
Full-text available
We have created a system for capturing both the three-dimensional geometry and color and shading information for human facial ex- pressions. We use this data to reconstruct photorealistic, 3D ani- mations of the captured expressions. The system uses a large set of sampling points on the face to accurately track the three dimen- sional deformations of the face. Simultaneously with the tracking of the geometric data, we capture multiple high resolution, regis- tered video images of the face. These images are used to create a texture map sequence for a three dimensional polygonal face model which can then be rendered on standard D graphics hardware. The resulting facial animation is surprisingly life-like and looks very much like the original live performance. Separating the capture of the geometry from the texture images eliminates much of the vari- ance in the image data due to motion, which increases compression ratios. Although the primary emphasis of our work is not compres- sion we have investigated the use of a novel method to compress the geometric data based on principal components analysis. The texture sequence is compressed using an MPEG4 video codec. An- imations reconstructed from 512x512 pixel textures look good at data rates as low as 240 Kbits per second.
Article
Full-text available
abstract With many visual speech animation techniques now available, there is a clear need for systematic perceptual evaluation schemes. We describe here our scheme and its application to a new video-realistic (potentially indistinguishable from real recorded video) visual-speech animation system, called Mary 101. Two types of experiments were performed: a) distinguishing visually between real and synthetic image- sequences of the same utterances, ("Turing tests") and b) gauging visual speech recognition by comparing lip-reading performance of the real and synthetic image-sequences of the same utterances ("Intelligibility tests"). Subjects that were presented randomly with either real or synthetic image-sequences could not tell the synthetic from the real sequences above chance level. The same subjects when asked to lip-read the utterances from the same image-sequences recognized speech from real image-sequences significantly better than from synthetic ones. However, performance for both, real and synthetic, were at levels suggested in the literature on lip-reading. We conclude from the two experiments that the animation of Mary 101 is adequate for providing a percept of a talking head. However, additional effort is required to improve the animation for lip-reading purposes like rehabilitation and language learning. In addition, these two tasks could be considered as explicit and implicit perceptual discrimination tasks. In the explicit task (a), each stimulus is classified directly as a synthetic or real image-sequence by detecting a possible difference between the synthetic and the real image-sequences. The implicit perceptual discrimination task (b) consists of a comparison between visual recognition of speech of real and synthetic image-sequences. Our results suggest that implicit perceptual discrimination is a more sensitive method for discrimination between synthetic and real image-sequences than explicit perceptual discrimination.
Article
Full-text available
Seeing a talker's face can improve the perception of speech in noise. There is little known about which characteristics of the face are useful for enhancing the degraded signal. In this study, a point-light technique was employed to help isolate the salient kinematic aspects of a visible articulating face. In this technique, fluorescent dots were arranged on the lips, teeth, tongue, cheeks, and jaw of an actor. The actor was videotaped speaking in the dark, so that when shown to observers, only the moving dots were seen. To test whether these reduced images could contribute to the perception of degraded speech, noise-embedded sentences were dubbed with the point-light images at various signal-to-noise ratios. It was found that these images could significantly improve comprehension for adults with normal hearing and that the images became more effective as participants gained experience with the stimuli. These results have implications for uncovering salient visual speech information as well as in the development of telecommunication systems for listeners who are hearing impaired.
Article
Full-text available
Neuropsychological research suggests that the neural system underlying visible speech on the basis of kinematics is distinct from the system underlying visible speech of static images of the face and identifying whole-body actions from kinematics alone. Functional magnetic resonance imaging was used to identify the neural systems underlying point-light visible speech, as well as perception of a walking/jumping point-light body, to determine if they are independent. Although both point-light stimuli produced overlapping activation in the right middle occipital gyrus encompassing area KO and the right inferior temporal gyrus, they also activated distinct areas. Perception of walking biological motion activated a medial occipital area along the lingual gyrus close to the cuneus border, and the ventromedial frontal cortex, neither of which was activated by visible speech biological motion. In contrast, perception of visible speech biological motion activated right V5 and a network of motor-related areas (Broca's area, PM, M1, and supplementary motor area (SMA)), none of which were activated by walking biological motion. Many of the areas activated by seeing visible speech biological motion are similar to those activated while speech-reading from an actual face, with the exception of M1 and medial SMA. The motor-related areas found to be active during point-light visible speech are consistent with recent work characterizing the human "mirror" system (Rizzolatti, Fadiga, Gallese, & Fogassi, 1996).
Conference Paper
Full-text available
We describe a comparative evaluation of different movement generation systems capable of computing articulatory trajectories from phonetic input. The articulatory trajectories here pilot the facial deformation of a 3D clone of a human female speaker. We test the adequacy of the predicted trajectories in accompanying the production of natural utterances. The performance of these predictions are compared to the ones of natural articulatory trajectories produced by the speaker and estimated by an original video-based motion capture technique. The test uses the point-light technique (Rosenblum, L.D. and Saldana, H.M., 1996; 1998).
Conference Paper
Full-text available
This paper describes a new approach for speaker identification based on lipreading. Visual features are extracted from image sequences of the talking face and consist of shape parameters which describe the lip boundary and intensity parameters which describe the grey-level distribution of the mouth area. Intensity information is based on principal component analysis using eigenspaces which deform with the shape model. The extracted parameters account for both, speech dependent and speaker dependent information. We built spatio-temporal speaker models based on these features, using HMMs with mixtures of Gaussians. Promising results were obtained for text dependent and text independent speaker identification tests performed on a small video database
Article
Full-text available
In this paper we present a method for the estimation of three-dimensional motion from 2-D image sequences showing head and shoulder scenes typical for video telephone and tele-conferencing applications. We use a 3-D model that speci#es the color and shape of the person in the video. Additionally, the model constrains the motion and deformation in the face to a set of facial expressions which are represented by the facial animation parameters #FAPs# de#ned by the MPEG-4 standard. Using this model, we obtain a description of both global and local 3-D head motion as a function of the unknown facial parameters. Combining the 3-D information with the optical #ow constraint leads to a robust and linear algorithm that estimates the facial animation parameters from two successive frames with low computational complexity. Toovercome the restriction of small object motion, which is common to optical #ow based approaches, weuseamulti-resolution framework. Experimental results on synthetic and rea...
Article
Full-text available
A technique for 3D head tracking under varying illumination is proposed. The head is modeled as a texture mapped cylinder. Tracking is formulated as an image registration problem in the cylinder's texture map image. The resulting dynamic texture map provides a stabilized view of the face that can be used as input to many existing 2D techniques for face recognition, facial expressions analysis, lip reading, and eye tracking. To solve the registration problem with lighting variation and head motion, the residual registration error is modeled as a linear combination of texture warping templates and orthogonal illumination templates. Fast stable online tracking is achieved via regularized weighted least-squares error minimization. The regularization tends to limit potential ambiguities that arise in the warping and illumination templates. It enables stable tracking over extended sequences. Tracking does not require a precise initial model fit; the system is initialized automatically using a simple 2D face detector. It is assumed that the target is facing the camera in the first frame. The formulation uses texture mapping hardware. The nonoptimized implementation runs at about 15 frames per second on a SGI O2 graphic workstation. Extensive experiments evaluating the effectiveness of the formulation are reported. The sensitivity of the technique to illumination, regularization parameters, errors in the initial positioning, and internal camera parameters are analyzed. Examples and applications of tracking are reported
Article
Full-text available
An approach to estimating the motion of the head and facial expressions in model-based facial image coding is presented. An affine nonrigid motion model is set up. The specific knowledge about facial shape and facial expression is formulated in this model in the form of parameters. A direct method of estimating the two-view motion parameters that is based on the affine method is discussed. Based on the reasonable assumption that the 3-D motion of the face is almost smooth in the time domain, several approaches to predicting the motion of the next frame are proposed. Using a 3-D model, the approach is characterized by a feedback loop connecting computer vision and computer graphics. Embedding the synthesis techniques into the analysis phase greatly improves the performance of motion estimation. Simulations with long image sequences of real-world scenes indicate that the method not only greatly reduces computational complexity but also substantially improves estimation accuracy
Article
Full-text available
Optical flow provides a constraint on the motion of a deformable model. We derive and solve a dynamic system incorporating flow as a hard constraint, producing a model-based least-squares optical flow solution. Our solution also ensures the constraint remains satisfied when combined with edge information, which helps combat tracking error accumulation. Constraint enforcement can be relaxed using a Kalman filter, which permits controlled constraint violations based on the noise present in the optical flow information, and enables optical flow and edge information to be combined more robustly and efficiently. We apply this framework to the estimation of face shape and motion using a 3D deformable face model. This model uses a small number of parameters to describe a rich variety of face shapes and facial expressions. We present experiments in extracting the shape and motion of a face from image sequences which validate the accuracy of the method. They also demonstrate that our t...
Conference Paper
Full-text available
In many face recognition tasks the pose and illumination conditions of the probe and gallery images are different. In other cases multiple gallery or probe images may be available, each captured from a different pose and under a different illumination. We propose a face recognition algorithm which can use any number of gallery images per subject captured at arbitrary poses and under arbitrary illumination, and any number of probe images, again captured at arbitrary poses and under arbitrary illumination. The algorithm operates by estimating the Fisher light-field of the subject's head from the input gallery or probe images. Matching between the probe and gallery is then performed using the Fisher light-fields.
Article
Full-text available
This paper describes an extension of a technique for the recognition and tracking of every day objects in cluttered scenes. The goal is to build a system in which ordinary desktop objects serve as physical icons in a vision based system for man-machine interaction. In such a system, the manipulation of objects replaces user commands.
Article
Full-text available
The fact that objects in the world appear in different ways depending on the scale of observation has important implications if one aims at describing them. It shows that the notion of scale is of utmost importance when processing unknown measurement data by automatic methods. In their seminal works, Witkin (1983) and Koenderink (1984) proposed to approach this problem by representing image structures at different scales in a so-called scale-space representation. Traditional scale-space theory building on this work, however, does not address the problem of how to select local appropriate scales for further analysis. This article proposes a systematic approach for dealing with this problem---a heuristic principle is presented stating that local extrema over scales of different combinations of gamma-normalized derivatives are likely candidates to correspond to interesting structures. Specifically, it is proposed that this idea can be used as a major mechanism in algorithms for automatic scale selection, which adapt the local scales of processing to the local image structure. Support is given in terms of a general theoretical investigation of the behaviour of the scale selection method under rescalings of the input pattern and by experiments on real-world and synthetic data. Support is also given by a detailed analysis of how different types of feature detectors perform when integrated with a scale selection mechanism and then applied to characteristic model patterns. Specifically, it is described in detail how the proposed methodology applies to the problems of blob detection, junction detection, edge detection, ridge detection and local frequency estimation.
Article
We present a new set of techniques for modeling and animating realistic faces from photographs and videos. Given a set of face photographs taken simultaneously, our modeling technique allows the interactive recovery of a textured 3D face model. By repeating this process for several facial expressions, we acquire a set of face models that can be linearly combined to express a wide range of expressions. Given a video sequence, this linear face model can be used to estimate the face position, orientation, and facial expression at each frame. We illustrate these techniques on several datasets and demonstrate robust estimations of detailed face geometry and motion.
Article
We report here on an experiment comparing visual recognition of monosyllabic words produced either by our computer-animated talker or a human talker. Recognition of the synthetic talker is reasonably close to that of the human talker, but a significant distance remains to be covered and we discuss improvements to the synthetic phoneme specifications. In an additional experiment using the same paradigm, we compare perception of our animated talker with a similarly generated point-light display, finding significantly worse performance for the latter for a number of viseme classes. We conclude with some ideas for future progress and briefly describe our new animated tongue.
Article
"Oral speech intelligibility tests were conducted with, and without, supplementary visual observation of the speaker's facial and lip movements. The difference between these two conditions was examined as a function of the speech-to-noise ratio and of the size of the vocabulary under test. The visual contribution to oral speech intelligibility (relative to its possible contribution) is, to a first approximation, independent of the speech-to-noise ratio under test. However, since there is a much greater opportunity for the visual contribution at low speech-to-noise ratios, its absolute contribution can be exploited most profitably under these conditions." (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
In this paper we describe how 2D appearance models can be applied to the problem of creating a near-videorealistic talk-ing head. A speech corpus of a talker uttering a set of phoneti-cally balanced training sentences is analysed using a generative model of the human face. Segments of original parameter tra-jectories corresponding to the synthesis unit are extracted from a codebook, normalised, blended, concatenated and smoothed before being applied to the model to give natural, realistic ani-mations of novel utterances. We also present some early results of subjective tests conducted to determine the realism of the synthesiser.
Chapter
IntroductionMotion Units – The Visual RepresentationMUPs and MPEG-4 FAPsReal-Time Audio-to-MUP MappingExperimental ResultsThe iFACE SystemConclusion References
Article
This paper reports the first phase of a research program on visual perception of motion patterns characteristic of living organisms in locomotion. Such motion patterns in animals and men are termed here as biological motion. They are characterized by a far higher degree of complexity than the patterns of simple mechanical motions usually studied in our laboratories. In everyday perceptions, the visual information from biological motion and from the corresponding figurative contour patterns (the shape of the body) are intermingled. A method for studying information from the motion pattern per se without interference with the form aspect was devised. In short, the motion of the living body was represented by a few bright spots describing the motions of the main joints. It is found that 10–12 such elements in adequate motion combinations in proximal stimulus evoke a compelling impression of human walking, running, dancing, etc. The kinetic-geometric model for visual vector analysis originally developed in the study of perception of motion combinations of the mechanical type was applied to these biological motion patterns. The validity of this model in the present context was experimentally tested and the results turned out to be highly positive.
Article
We address the problem of automatically placing landmarks across an image sequence to define correspondences between frames. The marked up sequence is then used to build a statistical model of the appearance of the object within the sequence. We locate the most salient object features from within the first frame and attempt to track them throughout the sequence. Salient features are those which have a low probability of being misclassified as any other feature, and are therefore more likely to be robustly tracked throughout the sequence. The method automatically builds statistical models of the objects shape and the salient features appearance as the sequence is tracked. These models are used in subsequent frames to further improve the probability of finding accurate matches. Results are shown for several face image sequences. The quality of the model is comparable with that generated from hand labelled images.
Article
In this study, previous articulatory midsagittal models of tongue and lips are extended to full three-dimensional models. The geometry of these vocal organs is measured on one subject uttering a corpus of sustained articulations in French. The 3D data are obtained from magnetic resonance imaging of the tongue, and from front and profile video images of the subject's face marked with small beads. The degrees of freedom of the articulators, i.e., the uncorrelated linear components needed to represent the 3D coordinates of these articulators, are extracted by linear component analysis from these data. In addition to a common jaw height parameter, the tongue is controlled by four parameters while the lips and face are also driven by four parameters. These parameters are for the most part extracted from the midsagittal contours, and are clearlyinterpretable in phonetic/biomechanical terms. This implies that most 3D features such as tongue groove or lateral channels can be controlled by articulatory parameters defined for the midsagittal model. Similarly, the 3D geometry of the lips is determined by parameters such as lip protrusion or aperture, that can be measured from a profile view of the face.
Article
Tracking human lips in video is an important but notoriously difficult task. To accurately recover their motions in 3D from any head pose is an even more challenging task, though still necessary for natural interactions. Our approach is to build and train 3D models of lip motion to make up for the information we cannot always observe when tracking. We use physical models as a prior and combine them with statistical models, showing how the two can be smoothly and naturally integrated into a synthesis method and a MAP estimation framework for tracking. We have found that this approach allows us to accurately and robustly track and synthesize the 3D shape of the lips from arbitrary head poses in a 2D video stream. We demonstrate this with numerical results on reconstruction accuracy, examples of static fits, and audio-visual sequences.
Article
The main aim of this article is the description of an original approach which allows the efficient tracking of two-dimensional (2D) patterns in image sequences. This approach consists of two stages. An off-line is devoted to the computation of an interaction matrix linking the aspect variation of a pattern with its displacement in the image. In a second stage, this matrix is used to track the pattern by the following process. At first, the position, the scale and the orientation of the pattern is predicted. Then, the difference between the pattern observed at the predicted position and the reference pattern (which is to be tracked) is computed. Finally, this difference multiplied by the interaction matrix gives a correction to apply to the prediction. Computing this correction has a very low computational cost, allowing a real time implementation of the algorithm. From these principles, we have developed a real time three-dimensional object tracker; objects are modeled by sets of 2D appearances. We show that these algorithms are resistant to occlusions and changes of illumination.
Conference Paper
This paper presents a novel algorithm aiming at analysis and identi- fication of faces viewed from different poses and illumination conditions. Face analysis from a single image is performed by recovering the shape and textures parameters of a 3D Morphable Model in an analysis-by-synthesis fashion. The shape parameters are computed from a shape error estimated by optical flow and the texture parameters are obtained from a texture error. The algorithm uses lin- ear equations to recover the shape and texture parameters irrespective of pose and lighting conditions of the face image. Identification experiments are reported on more than 5000 images from the publicly available CMU-PIE database which includes faces viewed from 13 different poses and under 22 different illumina- tions. Extensive identification results are available on our web page for future comparison with novel algorithms.
Article
The MPEG4 standard supports the transmission and composition of facial animation with natural video by including a facial animation parameter (FAP) set that is defined based on the study of minimal facial actions and is closely related to muscle actions. The FAP set enables model-based representation of natural or synthetic talking head sequences and allows intelligible visual reproduction of facial expressions, emotions, and speech pronunciations at the receiver. This paper describes two key components we have developed for building a model-based video coding system: (1) a method for estimating FAP parameters based on our previously proposed piecewise Bézier volume deformation model (PBVD), and (2) various methods for encoding FAP parameters. PBVD is a linear deformation model suitable for both the synthesis and the analysis of facial images. Each FAP parameter is a basis function in this model. Experimental results on PBVD-based animation, model-based tracking, and spatial-temporal compression of FAP parameters are demonstrated in this paper.
Article
facial animation (FA) will be. We have undertaken experiments on 190 subjects in order to explore the benefits of FA. Part of the experiment was aimed at exploring the objective benefits, i.e., to see if FA can help users to perform certain tasks better. The other part of the experiment was aimed at subjective benefits. At the same time comparison of different FA techniques was undertaken. We present the experiment design and the results. The results show that FA aids users in understanding spoken text in noisy conditions; that it can effectively make waiting times more acceptable to the user; and that it makes services more attractive to the users, particularly when they compare directly the same service with or without the FA.
Article
Thesis (Ph. D.)--University of Washington, 1999 This dissertation addresses the problems of modeling and animating realistic faces. The general approach followed is to extract information from face photographs and videos by applying image-based modeling and rendering techniques. Given a set of face photographs taken simultaneously, our modeling technique permits us to interactively recover a 3D face model. We present animation techniques based on morphing between face models corresponding to different expressions. We demonstrate that a wide range of expressions can be generated by forming linear combinations of a small set of initial expressions. Given a video sequence, this linear face model can be used to estimate the face position, orientation, and facial expression at each frame. The thesis also explores different applications of face tracking, such as performance-driven animation and alterations of the original footage.
Article
This work presents a methodology for 3D modeling of lip motion in speech production and its application to lip tracking and visual speech animation. Firstly, a geometric modeling allows to create a 3D lip model from 30 control points for any lip shape. Secondly, a statistical analysis, performed on a set of 10 key shapes, generates a lip gesture coding with three articulatory-oriented parameters, specific to one speaker. The choice of the key shapes is based on general phonetic observations. Finally, the application for lip tracking of the 3D model controlled by the three parameters is presented and evaluated.
Article
Visual recognition of consonants was studied in 31 hearing-impaired adults before and after 14 hours of concentrated, individualized, spechreading training. Confusions were analyzed via a hierarchical clustering technique to derive categories of visual contrast among the consonants. Pretraining and posttraining results were compared to reveal the effects of the training program. Training caused an increase in the number of visemes consistently recognized and an increase in the percentage of within-viseme responses. Analysis of the responses made revealed that most changes in consonant recognition occurred during the first few hours of training.
Article
Isolated kinematic properties of visible speech can provide information for lip reading. Kinematic facial information is isolated by darkening an actor's face and attaching dots to various articulators so that only moving dots can be seen with no facial features present. To test the salience of these images, the authors conducted experiments to determine whether the images could visually influence the perception of discrepant auditory syllables. Results showed that these images can influence auditory speech independently of the participant's knowledge of the stimuli. In other experiments, single frozen frames of visible syllables were presented with discrepant auditory syllables to test the salience of static facial features. Although the influence of the kinematic stimuli was perceptual, any influence of the static featural stimuli was likely based on participant's misunderstanding or postperceptual response bias.
Article
The efficacy of audio-visual interactions in speech perception comes from two kinds of factors. First, at the information level, there is some "complementarity" of audition and vision: It seems that some speech features, mainly concerned with manner of articulation, are best transmitted by the audio channel, while some other features, mostly describing place of articulation, are best transmitted by the video channel. Second, at the information processing level, there is some "synergy" between audition and vision: The audio-visual global identification scores in a number of different tasks involving acoustic noise are generally greater than both the auditory-alone and the visual-alone scores. However, these two properties have been generally demonstrated until now in rather global terms. In the present work, audio-visual interactions at the feature level are studied for French oral vowels which contrast three series, namely front unrounded, front rounded, and back rounded vowels. A set of experiments on the auditory, visual, and audio-visual identification of vowels embedded in various amounts of noise demonstrate that complementarity and synergy in bimodal speech appear to hold for a bundle of individual phonetic features describing place contrasts in oral vowels. At the information level (complementarity), in the audio channel the height feature is the most robust, backness the second most robust one, and rounding the least, while in the video channel rounding is better than height, and backness is almost invisible. At the information processing (synergy) level, transmitted information scores show that all individual features are better transmitted with the ear and the eye together than with each sensor individually.
Article
Modeling the peripheral speech motor system can advance the understanding of speech motor control and audiovisual speech perception. A 3-D physical model of the human face is presented. The model represents the soft tissue biomechanics with a multilayer deformable mesh. The mesh is controlled by a set of modeled facial muscles which uses a standard Hill-type representation of muscle dynamics. In a test of the model, recorded intramuscular electromyography (EMG) was used to activate the modeled muscles and the kinematics of the mesh was compared with 3-D kinematics recorded with OPTOTRAK. Overall, there was a good match between the recorded data and the model's movements. Animations of the model are provided as MPEG movies.
Conference Paper
We propose a novel method for extracting natural hand parameters from monocular image sequences. The purpose is to improve a vision-based sign language recognition system by providing detail information about the finger constellation and the 3D hand posture. Therefore, the hand is modelled by a set of 2D appearance models, each representing a limited variation range of 3D hand shape and posture. The single models are linked to each other according to the natural neighbourhood of the corresponding hand status. During an image sequence, necessary model transitions are executed towards one of the current neighbour models. The natural hand parameters are calculated from the shape and texture parameters of the current model, using a relation estimated by linear regression. The method is robust against large differences between subsequent frames and also against poor image quality. It can be implemented in real-time and offers good properties to handle occlusion and partly missing image information.
Conference Paper
We address the 3D tracking of pose and animation of the human face in monocular image sequences using active appearance models. The classical appearance-based tracking suffers from two disadvantages: (i) the estimated out-of-plane motions are not very accurate, and (ii) the convergence of the optimization process to desired minima is not guaranteed. We aim at designing an efficient active appearance model, which is able to cope with the above disadvantages by retaining the strengths of feature-based and featureless tracking methodologies. For each frame, the adaptation is split into two consecutive stages. In the first stage, the 3D head pose is recovered using robust statistics and a measure of consistency with a statistical model of a face texture. In the second stage, the local motion associated with some facial features is recovered using the concept of the active appearance model search. Tracking experiments and method comparison demonstrate the robustness and out-performance of the developed framework.
Conference Paper
In this paper, we present an approach for real-time speech-driven 3D face animation using neural networks. We first analyze a 3D facial movement sequence of a talking subject and learn a quantitative representation of the facial deformations, called the 3D motion units (MUs). A 3D facial deformation can be approximated by a linear combination of the MUs weighted by the MU parameters (MUPs) - the visual features of the facial deformation. The facial movement sequence synchronizes with a audio track. The audio track is digitized and the audio features of each frame are calculated. A real-time audio-to-MUP mapping is constructed by training a set of neural networks using the calculated audio-visual features. The audio-visual features are divided into several groups based on the audio features. One neural network is trained per group to map the audio features to the corresponding MUPs. Given a new audio feature vector, we first classify it into one of the groups and select the corresponding neural network to map the audio feature vector to MUPs, which are used for face animation. The quantitative evaluation shows the effectiveness of the proposed approach.
Conference Paper
A real-time system for tracking and modeling of faces using an analysis-by-synthesis approach is presented. A 3D face model is texture-mapped with a head-on view of the face. Feature points in the face-texture are then selected based on image Hessians. The selected points of the rendered image are tracked in the incoming video using normalized correlation. The result is fed into an extended Kalman filter to recover camera geometry, head pose, and structure from motion. This information is used to rigidly move the face model to render the next image needed for tracking. Every point is tracked from the Kalman filter's estimated position. The variance of each measurement is estimated using a number of factors, including the residual error and the angle between the surface normal and the camera. The estimated head pose can be used to warp the face in the incoming video back to frontal position, and parts of the image can then be subject to eigenspace coding for efficient transmission. The mouth texture is transmitted in this way using 50 bits per frame plus overhead from the person specific eigenspace. The face tracking system runs at 30 Hz, coding the mouth texture slows it down to 12 Hz
Article
Lip segmentation is an essential stage in many multimedia systems such as videoconferencing, lip reading, or low-bit-rate coding communication systems. In this paper, we propose an accurate and robust quasi-automatic lip segmentation algorithm. First, the upper mouth boundary and several characteristic points are detected in the first frame by using a new kind of active contour: the "jumping snake." Unlike classic snakes, it can be initialized far from the final edge and the adjustment of its parameters is easy and intuitive. Then, to achieve the segmentation, we propose a parametric model composed of several cubic curves. Its high flexibility enables accurate lip contour extraction even in the challenging case of a very asymmetric mouth. Compared to existing models, it brings a significant improvement in accuracy and realism. The segmentation in the following frames is achieved by using an interframe tracking of the keypoints and the model parameters. However, we show that, with a usual tracking algorithm, the keypoints' positions become unreliable after a few frames. We therefore propose an adjustment process that enables an accurate tracking even after hundreds of frames. Finally, we show that the mean keypoints' tracking errors of our algorithm are comparable to manual points' selection errors.
Article
Parameterized models can produce realistic, manipulable images of human faces— with a surprisingly small number of parameters.
Article
An approach to the analysis of dynamic facial images for the purposes of estimating and resynthesizing dynamic facial expressions is presented. The approach exploits a sophisticated generative model of the human face originally developed for realistic facial animation. The face model which may be simulated and rendered at interactive rates on a graphics workstation, incorporates a physics-based synthetic facial tissue and a set of anatomically motivated facial muscle actuators. The estimation of dynamical facial muscle contractions from video sequences of expressive human faces is considered. An estimation technique that uses deformable contour models (snakes) to track the nonrigid motions of facial features in video images is developed. The technique estimates muscle actuator controls with sufficient accuracy to permit the face model to resynthesize transient expressions
Article
In this paper, a new technique for modeling textured 3D faces is introduced. 3D faces can either be generated automatically from one or more photographs, or modeled directly through an intuitive user interface. Users are assisted in two key problems of computer aided face modeling. First, new face images or new 3D face models can be registered automatically by computing dense one-to-one correspondence to an internal face model. Second, the approach regulates the naturalness of modeled faces avoiding faces with an "unlikely" appearance.