Conference PaperPDF Available

LUCIA: An open source 3D expressive avatar for multimodal h.m.i.

Authors:

Abstract and Figures

LUCIA is an MPEG-4 facial animation system developed at ISTC-CNR. It works on standard Facial Animation Parameters and speaks with the Italian version of FESTIVAL TTS. To achieve an emotive/expressive talking head LUCIA was built from real human data physically extracted by ELITE optic-tracking movement analyzer. LUCIA can copy a real human being by reproducing the movements of passive markers positioned on his face and recorded by the ELITE device or can be driven by an emotional XML tagged input text, thus realizing true audio/visual emotive/expressive synthesis. Synchronization between visual and audio data is very important in order to create the correct WAV and FAP files needed for the animation. LUCIA's voice is based on the ISTC Italian version of FESTIVAL-MBROLA packages, modified by means of an appropriate APML/VSML tagged language. LUCIA is available in two different versions: an open source framework and the "work in progress" WebGL. © 2012 ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering.
Content may be subject to copyright.
LUCIA: An open source 3D expressive avatar for
multimodal h.m.i.
G.Riccardo Leone - Giulio Paci - Piero Cosi
Institute of Cognitive Sciences and Technologies – National Research Council
Via Martiri della libertà 2, 35137 Padova, Italy
{piero.cosi,riccardo.leone}@pd.istc.cnr.it
Abstract. LUCIA is an MPEG-4 facial animation system developed at ISTC-
CNR
1
. It works on standard Facial Animation Parameters and speaks with the
Italian version of FESTIVAL TTS. To achieve an emotive/expressive talking
head LUCIA was built from real human data physically extracted by ELITE op-
tic-tracking movement analyzer. LUCIA can copy a real human being by repro-
ducing the movements of passive markers positioned on his face and recorded
by the ELITE device or can be driven by an emotional XML tagged input text,
thus realizing true audio/visual emotive/expressive synthesis. Synchronization
between visual and audio data is very important in order to create the correct
WAV and FAP files needed for the animation. LUCIA’s voice is based on the
ISTC Italian version of FESTIVAL-MBROLA packages, modified by means of
an appropriate APML/VSML tagged language. LUCIA is available in two dif-
ferent versions: an open source framework and the “work in progress” WebGL.
Keywords: talking head, TTS, facial animation, mpeg4, 3D avatar, virtual
agent, affective computing, LUCIA, FESTIVAL
1 Introduction
There are many ways to control a synthetic talking face. Among them, geometric
parameterization [1, 2], morphing between target speech shapes [3], muscle and
pseudo-muscle models [4, 5], appear the most attractive.
Growing interest have encountered text to audiovisual systems [6, 7], in which
acoustical signal is generated by a Text to Speech engine and the phoneme informa-
tion extracted from input text is used to define the articulatory movements.
To generate realistic facial animation is necessary to reproduce the contextual vari-
ability due to the reciprocal influence of articulatory movements for the production of
following phonemes. This phenomenon, defined co-articulation [8], is extremely
1
With the collaboration of many students and researchers working at ISTC during these last
years, among them: G.Tisato, F.Tesser, C.Drioli, V.Ferrari, G.Perin, A.Fusaro, D.Grigoletto,
M.Nicolao, G.Sommavilla, E.Marchetto
2 G.Riccardo Leone - Giulio Paci - Piero Cosi
complex and difficult to model. A variety of co-articulation strategies are possible and
even different strategies may be needed for different languages [9].
A modified version of the Cohen-Massaro co-articulation model [10] has been
adopted for LUCIA [11] and a semi-automatic minimization technique, working on
real cinematic data acquired by the ELITE optic-electronic system [12], was used for
training the dynamic characteristics of the model, in order to be more accurate in re-
producing the true human lip movements.
Moreover, emotions are quite important in human interpersonal relations and indi-
vidual development. Linguistic, paralinguistic and emotional transmission are inher-
ently multimodal, and different types of information in the acoustic channel integrate
with information from various other channels facilitating the communicative proc-
esses. The transmission of emotions in speech communication is a topic that has re-
cently received considerable attention, and automatic speech recognition (ASR) and
multimodal or audio-visual (AV) speech synthesis are examples of fields, in which
the processing of emotions can have a great impact and can improve the effectiveness
of human-machine interaction.
Viewing the face improves significantly the intelligibility of both natural and syn-
thetic speech, especially under degraded acoustic conditions. Facial expressions signal
emotions, add emphasis to the speech and facilitate the interaction in a dialogue situa-
tion. From these considerations, it is evident that, in order to create more natural talk-
ing heads, it is essential that their capability comprises the emotional behavior.
Fig. 1. LUCIA’s functional block diagram
In our TTS (text-to-speech) framework, AV speech synthesis, that is the automatic
generation of voice and facial animation from arbitrary text, is based on parametric
descriptions of both the acoustic and visual speech modalities. The visual speech syn-
thesis uses 3D polygon models, that are parametrically articulated and deformed,
while the acoustic speech synthesis uses an Italian version of the FESTIVAL diphone
TTS synthesizer [13] now modified with emotive/expressive capabilities. The block
diagram of our framework is depicted in Fig. 1.
Various applications can be conceived by the use of animated characters, spanning
from research on human communication and perception, via tools for the hearing im-
paired, to spoken and multimodal agent-based user interfaces. The recent introduction
LUCIA: An open source 3D expressive avatar for multimodal h.m.i. 3
of WebGL [14], which is 3D graphics in web browsers, opens the possibility to bring
all these applications via internet. This software version of LUCIA is currently in the
early stage of development.
2 Data acquisition environment
LUCIA is totally based on true real human data collected during the last decade by
the use of ELITE [15, 16, 17], a fully automatic movement analyzer for 3D cinemat-
ics data acquisition [12], which provides 3D coordinate reconstruction, starting from
2D perspective projections, by means of a stereo-photogrammetric procedure which
allows a free positioning of the TV cameras. The 3D data dynamic coordinates of pas-
sive markers (see Fig. 2) are then used to create our lips articulatory model and to
drive directly our talking face, copying human facial movements.
Fig. 2. . Position of reflecting markers and reference planes for the articulatory movement data
collection (on the left), and the MPEG-4 standard facial reference points (on the right)
Two different configurations have been adopted for articulatory data collection: the
first one, specifically designed for the analysis of labial movements, considers a sim-
ple scheme with only 8 reflecting markers (bigger markers in Fig.2) while the second,
adapted to the analysis of expressive and emotive speech, utilizes the full and com-
plete set of 28 markers. All the movements of the 8 or 28 markers, depending on the
adopted acquisition pattern, are recorded and collected, together with their velocity
and acceleration, simultaneously with the co-produced speech which is usually seg-
mented and analyzed by means of PRAAT [18], that computes also intensity, du-
ration, spectrograms, formants, pitch synchronous F0, and various voice quality pa-
rameters in the case of emotive and expressive speech [19, 20].
In order to simplify and automates many of the operation needed for building-up
the 3D avatar from the motion-captured data we developed INTERFACE [21], an in-
tegrated software designed and implemented in Matlab©
4 G.Riccardo Leone - Giulio Paci - Piero Cosi
3 Architecture and implementations
LUCIA is a MPEG-4 standard facial animation engine implementing a decoder
compatible with the "Predictable Facial Animation Object Profile" [22]. LUCIA
speaks with the Italian version of FESTIVAL TTS [13]; we have already seen the
overall system illustrated in Fig. 1. The homepage of the project is [23].
MPEG4 specifies a set of Face Animation Parameters (FAPs), each corresponding
to a particular facial action deforming a face model in its neutral state. A particular
facial action sequence is generated by deforming the face model, according to the
specified FAP values, indicating the magnitude of the corresponding action, for the
corresponding time instant. Then the model is rendered onto the screen.
LUCIA is able to generate a 3D mesh polygonal model by directly importing its
structure from a VRML file [24] and to build its animation in real time.
At the current stage of development, LUCIA is a textured young female 3D face
model built with 25423 polygons: 14116 belong to the skin, 4616 to the hair, 2688x2
to the eyes, 236 to the tongue and 1029 to the teeth respectively.
Fig. 3. Lucia’s wireframe, textures and renderings.
Currently the model is divided in two sub sets of fundamental polygons: the skin
on one hand and the inner articulators, such as the tongue and the teeth, or the facial
elements such as the eyes and the hair, on the other. This subdivision is quite useful
when animation is running, because only the reticule of polygons corresponding to the
skin is directly driven by the pseudo-muscles and constitutes a continuous and unitary
element, while the other anatomical components move themselves independently, fol-
lowing translations and rotations (for example the eyes rotate around their center).
According to this strategy the polygons are distributed in such a way that the resulting
visual effect is quite smooth with no rigid "jumps" over all the 3D model.
LUCIA emulates the functionalities of the mimic muscles, by the use of specific
"displacement functions" and of their following action on the skin of the face. The ac-
tivation of such functions is determined by specific parameters that encode small
muscular actions acting on the face; these actions can be modified in time in order to
LUCIA: An open source 3D expressive avatar for multimodal h.m.i. 5
generate the wished animation. Such parameters, in MPEG-4, take the name of Facial
Animation Parameters and their role is fundamental for achieving a natural move-
ment. The muscular action is made explicit by means of the deformation of a polygo-
nal reticule built around some particular key points called Facial Definition Parame-
ters (FDPs) that correspond to the junction on the skin of the mimic muscles.
Moving only the FDPs is not sufficient to smoothly move the whole 3D model,
thus, each "feature point" is related to a particular "influence zone" constituted by an
ellipses that represents a zone of the reticule where the movement of the vertexes is
strictly connected. Finally, after having established the relationship for the whole set
of FDPs and the whole set of vertexes, all the points of the 3D model can be simulta-
neously moved with a graded strength following a raised-cosine function rule associ-
ated to each FDP.
There are two current versions of LUCIA: an open source 3D facial animation
framework written in C programming language [25] and a new WebGL implementa-
tion [26]. The C framework allows efficient rendering of a 3D face model in
OpenGL-enabled systems (it has been tested on Windows and Linux using several ar-
chitectures). It has a modular design: each module provides one of the several com-
mon facilities needed to create a real-time Facial Animation application. It includes an
interface to play audio files, a robust and extendable FAP parser, sample-based au-
dio/video synchronization and an object oriented interface to the components of a face
model. Finally, the framework includes a ready to use female face model and some
sample applications to play simple animation and to test movements' behavior. The
framework's core is built on top of OpenGL and does not rely on any specific context
provider (it has been tested using GLUT[27], FreeGLUT[28] and GtkGLExt[29]). In
order to grant portability, the modules' interfaces are designed so that their implemen-
tation details are hidden to the application and it is possible to provide multiple im-
plementation of the same model (e.g., three implementations of the audio module are
included using respectively OpenAL[30], Gstreamer[31] and Microsoft MCI[32]).
The very recent introduction of 3D graphics in the web browsers (which is known as
WebGL [14]) opens new possibilities for our 3D avatar. The powerful of this new
technology is that you don’t need to download any additional software or driver to ac-
cess the content of the 3D world you are interacting with. We are currently develop-
ing this new software version in order to easily integrate LUCIA in a website; there
are many promising functionality for web applications: a virtual guide (which we are
exploiting in the Wikimemo.it project - The portal of Italian Language and Culture); a
storyteller for e-book reading; a digital tutor for hearing impaired; a personal assistant
for smart-phone and mobile devices. The early results can be observed in [26]
4 Emotional synthesis
Audio Visual emotional rendering was developed working on true real emotional
audio and visual databases whose content was used to automatically train emotion
specific intonation and voice quality models to be included in FESTIVAL, our Italian
TTS system [33, 34, 35, 36] and also to define specific emotional visual rendering to
be implemented in LUCIA [37, 38, 39].
6 G.Riccardo Leone - Giulio Paci - Piero Cosi
Fig. 4. APML/VSML mark-up language extensions for emotive audio/visual synthesis.
An emotion specific XML editor explicitly designed for emotional tagged texts
was developed. The APML mark up language [40] for behavior specification permits
to specify how to markup the verbal part of a dialog move so as to add to it the
"meanings" that the graphical and the speech generation components of an animated
agent need to produce the required expressions (Fig. 4). So far, the language defines
the components that may be useful to drive a face animation through the facial de-
scription language (FAP) and facial display functions. The extension of such language
is intended to support voice specific controls. An extended version of the APML lan-
guage has been included in the FESTIVAL speech synthesis environment, allowing
the automatic generation of the extended phonation file from an APML tagged text
with emotive tags. This module implements a three-level hierarchy in which the affec-
tive high level attributes (e.g. <anger>, <joy>, <fear>) are described in terms of me-
dium-level voice quality attributes defining the phonation type (e.g., <modal>, <soft>,
<pressed>, <breathy>, <whispery>, <creaky>). These medium-level attributes are in
turn described by a set of low-level acoustic attributes defining the perceptual corre-
lates of the sound (e.g. <spectral tilt>, <shimmer>, <jitter>). The low-level acoustic
attributes correspond to the acoustic controls that the extended MBROLA synthesizer
can render through the sound processing procedure described above. This descriptive
scheme has been implemented within FESTIVAL as a set of mappings between high-
level and low-level descriptors. The implementation includes the use of envelope
generators to produce time curves of each parameter.
In order to check and evaluate, by direct low-level manual/graphic instructions,
various multi level emotional facial configurations we developed “EmotionPlayer”,
which was strongly inspired by the EmotionDisc of Zsofia Ruttkay [41]. It is designed
for a useful immediate feedback, as exemplified in Fig. 5.
Emotions
Expressions
Acoustic
Controls
Voice
Quality
Meaning
Semantic DTD
tagnames Abstraction
level
APML
VSML
affective
voqual
signalctrl
3
2
1
< fear >
< breathy> … < tremulous>
< asp_noise> … < spectral_tilt>
Examples
LUCIA: An open source 3D expressive avatar for multimodal h.m.i. 7
EmotionPlayer
Fig. 5. Emotion Player: clicking on three-level intensity (low, mid, high) emotional disc, an
emotional configuration (i.e. high -fear) is activated.
5 Conclusions
LUCIA is an MPEG-4 standard FAPs driven OpenGL framework which provides
several common facilities needed to create a real-time Facial Animation application.
It has high quality 3D model and a fine co-articulatory model, which is automatically
trained by real data, used to animate the face.
The modified co-articulatory model is able to reproduce quite precisely the true
cinematic movements of the articulatory parameters. The mean error between real and
simulated trajectories for the whole set of parameters is, in fact, lower than 0.3 mm.
Labial movements implemented with the new modified model are quite natural and
convincing especially in the production of bilabials and labiodentals and remain co-
herent and robust to speech rate variations.
The overall quality and user acceptability of LUCIA talking head has to be percep-
tually evaluated [42, 43] by a complete set of test experiments, and the new model has
to be trained and validated in asymmetric contexts (V1CV2) too. Moreover, emotions
and the behavior of other articulators, such as tongue for example, have to be ana-
lyzed and modeled for a better realistic implementation.
A new WebGL implementation of the avatar is currently in progress to exploit new
possibilities that arise from the integration of LUCIA in the internet websites.
Acknowledgments. Part of this work has been sponsored by PF-STAR (Preparing
Future multiSensorial inTerAction Research, European Project IST- 2001-37599,
http://pfstar.itc.it ), TICCA (Tecnologie cognitive per l'Interazione e la Cooperazione
Con Agenti artificiali, joint "CNR - Provincia Autonoma Trentina" Project) and
WIKIMEMO.IT (The Portal of Italian Language and Culture, FIRB Project,
RBNE078K93, Italian Ministry of University and Scientific Research)
8 G.Riccardo Leone - Giulio Paci - Piero Cosi
References
1. Massaro, D.W., Cohen, M.M., Beskow, J., Cole, R.A.: Developing and Evaluat-
ing Conversational Agents. In: Cassell, J., Sullivan, J., Prevost, S., Churchill, E.
(eds), Embodied Conversational Agents. MIT Press, pp. 287--318, Cambridge,
U.S.A. (2000)
2. Le Goff, B.: Synthèse à partir du texte de visages 3D parlant français. PhD the-
sis, Grenoble, France (1997)
3. Bregler, C., Covell, M., Slaney, M.: Video Rewrite: Driving Visual Speech with
Audio. In: SIGGRAPH ’97, pp. 353--360. (1997)
4. Lee, Y., Terzopoulos, D., Waters, K.: Realistic Face Modeling for Animation.
In: SIGGRAPH ’95, pp. 55--62. (1995)
5. Vatikiotis-Bateson, E., Munhall, K.G., Hirayama, M., Kasahara, Y., Yehia, H.:
Physiology-Based Synthesis of Audiovisual Speech. In: 4th Speech Production
Seminar: Models and Data, pp. 241--244. (1996)
6. Beskow, J.: Rule-Based Visual Speech Synthesis. In: Eurospeech ’95, pp.299--
302. Madrid, Spain (1995)
7. LeGoff, B., Benoit, C.: A text-to-audiovisualspeech synthesizer for French. In:
ICSLP '96, pp. 2163--2166. Philadelphia, U.S.A. (1996)
8. Farnetani, E., Recasens, D.: Coarticulation Models in Recent Speech Production
Theories. In: Hardcastle, W.J. (eds.) Coarticulation in Speech Production. Cam-
bridge University Press, Cambridge, U.S.A. (1999)
9. Bladon, R.A., Al-Bamerni, A.: Coarticulation resistance in English. J. Phonetics.
4, pp. 135--150 (1976)
10. Cosi., P., Perin., G.: Labial Coarticulation Modeling for Realistic Facial Anima-
tion. In: ICMI ‘02, pp. 505--510. Pittsburgh, U.S.A. (2002)
11. Cosi, P., Fusaro, A., Tisato, G.: LUCIA a New Italian Talking-Head Based on a
Modified Cohen-Massaro’s Labial Coarticulation Model. In: Eurospeech 2003,
vol. III, pp. 2269--2272. Geneva, Switzerland (2003)
12. Ferrigno, G., Pedotti, A.: ELITE: A Digital Dedicated Hardware System for
Movement Analysis via Real-Time TV Signal Processing. In: IEEE Transactions
on Biomedical Engineering, BME-32, pp. 943--950 (1985)
13. Cosi, P., Tesser, F., Gretter, R., Avesani, C.: Festival Speaks Italian!. In: Eu-
rospeech 2001, pp. 509--512. Aalborg, Denmark (2001)
14. WebGL- OpenGL for the web, http://www.khronos.org/webgl/
15. Cosi, P., Magno Caldognetto, E.: Lip and Jaw Movements for Vowels and Con-
sonants: Spatio-Temporal Characteristics and Bimodal Recognition Applica-
tions. In: Storke, D.G., Henneke, M.E. (eds.) Speechreading by Humans and
Machine: Models, Systems and Applications.NATO ASI Series, Series F: Com-
puter and Systems Sciences, vol. 150, pp. 291--313. Springer-Verlag, (1996)
16. Magno Caldognetto, E., Zmarich, C., Cosi, P., Ferrero, F.: Italian Consonantal
Visemes: Relationships Between Spatial/temporal Articulatory Characteristics
and Coproduced Acoustic Signal. In: AVSP '97, Tutorial & Research Workshop
on Audio-Visual Speech Processing: Computational & Cognitive Science Ap-
proaches, pp. 5--8. Rhodes, Greece (1997)
17. Magno Caldognetto, E., Zmarich, C., Cosi, P.: Statistical Definition of Visual In-
formation for Italian Vowels and Consonants. In: Burnham, D., Robert-Ribes, J.,
LUCIA: An open source 3D expressive avatar for multimodal h.m.i. 9
Vatikiotis-Bateson, E. (eds.) Proceedings of AVSP '98, pp. 135--140. Terrigal,
Austria (1998)
18. Boersma, P.: PRAAT, a system for doing phonetics by computer. Glot Interna-
tional, 5 (9/10), 341--345 (1996)
19. Magno Caldognetto, E., Cosi, P., Drioli, C., Tisato, G., Cavicchio, F.: Coproduc-
tion of Speech and Emotions: Visual and Acoustic Modifications of Some Pho-
netic Labial Tar-gets. In: AVSP 2003, ISCA Workshop, pp. 209--214. St Jorioz,
France (2003)
20. Drioli, C., Tisato, G., Cosi, P., Tesser, F.: Emotions and Voice Quality: Experi-
ments with Sinusoidal Modeling. In Proceedings of Voqual 2003, Voice Quality:
Functions, Analysis and Synthesis, ISCA Workshop, pp. 127--132. Geneva,
Switzerland (2003)
21. Tisato, G., Cosi, P., Drioli, C., Tesser, F.: INTERFACE: a New Tool for Build-
ing Emotive/Expressive Talking Heads. In: INTERSPEECH 2005, pp. 781--784,
Lisbon, Portugal (2005)
22. MPEG-4 standard, http://mpeg.chiariglione.org/standards/mpeg-4/mpeg-4.htm
23. LUCIA - homepage, http://www2.pd.istc.cnr.it/LUCIA/
24. Hartman, J., Wernecke, J.: The VRML Handbook. Addison Wessley, (1996)
25. LUCIA Open source project, http://sourceforge.net/projects/lucia/
26. LUCIA WebGL version, http://www2.pd.istc.cnr.it/LUCIA/webgl/
27. The OpenGL Utility Toolkit, http://www.opengl.org/resources/libraries/glut/
28. The Free OpenGL Utility Toolkit, http://freeglut.sourceforge.net
29. GTK+ OpenGL Extension, http://projects.gnome.org/gtkglext/
30. OpenAL:a cross platform 3D audio API, http://connect.creativelabs.com/openal/
31. Gstreamer:Open source multimedia framework, http://gstreamer.freedesktop.org/
32. Windows MCI, http://en.wikipedia.org/wiki/Media_Control_Interface
33. Tesser, F., Cosi, P., Drioli, C., Tisato, G.: Prosodic Data-Driven Modelling of
Narrative Style in FESTIVAL TTS. In: 5th ISCA Speech Synthesis Workshop,
Pittsburgh, U.S.A. (2004)
34. Tesser, F., Cosi, P., Drioli, C., Tisato, G.: Emotional Festival-Mbrola TTS
Synthesis. In: INTERSPEECH 2005, pp. 505--508. Lisbon, Portugal (2005)
35. Drioli, C., Tesser, F., Tisato, G., Cosi, P.: Control of Voice Quality for Emo-
tional Speech Synthesis. In: AISV 2004, 1st Conference of Associazione Italiana
di Scienze della Voce, pp. 789--798. EDK Editore s.r.l., Padova, Italy (2005)
36. Nicolao, M., Drioli, C., Cosi, P.: GMM modelling of voice quality for
FESTIVAL-MBROLA emotive TTS synthesis. In: INTERSPEECH 2006, pp.
1794--1797. Pittsburgh, U.S.A. (2006)
37. Cosi, P., Fusaro, A., Grigoletto, D., Tisato, G.: Data-Driven Tools for Designing
Talking Heads Exploiting Emotional Attitudes. In: Tutorial and Research Work-
shop "Affective Dialogue Systems", pp. 101--112. Kloster Irsee, Germany
(2004)
38. Magno Caldognetto, E., Cosi, P., Drioli, C., Tisato, G., Cavicchio, F.: Visual and
acoustic modifications of phonetic labial targets in emotive speech: Effects of the
co-production of speech and emotions. J. Speech Communication vol. 44, pp.
173--185 (2004)
39. Magno Caldognetto, E., Cosi, P., Cavicchio, F.: Modification of the Speech Ar-
ticulatory Characteristics in the Emotive Speech. In: Tutorial and Research
10 G.Riccardo Leone - Giulio Paci - Piero Cosi
Workshop "Affective Dialogue Systems", pp. 233--239. Kloster Irsee, Germany
(2004)
40. De Carolis, B., Pelachaud, C., Poggi, I., Steedman, M.: APML, a Mark-up Lan-
guage for Believable Behavior Generation. In: Prendinger, H., Ishizuka, M.
(eds.) Life-Like Characters, pp. 65--85. Springer, (2004)
41. Ruttkay, Z., Noot, H., Hagen, P.: Emotion Disc and Emotion Squares: tools to
explore the facial expression space. Computer Graphics Forum, 22(1), pp. 49--53
(2003)
42. Massaro D.W.: Perceiving Talking Faces: from Speech Perception to a Behav-
ioral Principle. MIT Press, Cambridge, U.S.A. (1997)
43. Costantini, E., Pianesi, F., Cosi, P.: Evaluation of Synthetic Faces: Human Rec-
ognition of Emotional Facial Displays. In: Tutorial and Research Workshop "Af-
fective Dialogue Systems", pp. 276--287. Kloster Irsee, Germany (2004)
... Open source 3D facial animation model Xface is implemented depending on the viability of MPEG-4 approach [5]. In Xface a set of vertices are selected for all feature points (zones) and a Fig. 2 Talking head systems a Greta [1], b Xface [3], and c Lucia [24] Raised Cosine Function (RCF) is used to deform the region and transfer the vertices in the neighbourhood when a feature point is moved. This is being a distance transform achieves satisfactory results. ...
... Open source 3D facial animation model Xface is implemented depending on the viability of MPEG-4 approach [5]. In Xface a set of vertices are selected for all feature points (zones) and a Fig. 2 Talking head systems a Greta [1], b Xface [3], and c Lucia [24] Raised Cosine Function (RCF) is used to deform the region and transfer the vertices in the neighbourhood when a feature point is moved. This is being a distance transform achieves satisfactory results. ...
Article
Full-text available
Lip synchronization of 3D face model is now being used in a multitude of important fields. It brings a more human, social and dramatic reality to computer games, films and interactive multimedia, and is growing in use and importance. High level of realism can be used in demanding applications such as computer games and cinema. Authoring lip syncing with complex and subtle expressions is still difficult and fraught with problems in terms of realism. This research proposed a lip syncing method of realistic expressive 3D face model. Animated lips requires a 3D face model capable of representing the myriad shapes the human face experiences during speech and a method to produce the correct lip shape at the correct time. The paper presented a 3D face model designed to support lip syncing that align with input audio file. It deforms using Raised Cosine Deformation (RCD) function that is grafted onto the input facial geometry. The face model was based on MPEG-4 Facial Animation (FA) Standard. This paper proposed a method to animate the 3D face model over time to create animated lip syncing using a canonical set of visemes for all pairwise combinations of a reduced phoneme set called ProPhone. The proposed research integrated emotions by the consideration of Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language (EEMML) to produce realistic 3D face model.
... Users interacting with the web interface are also offered the opportunity to listen and look at a female talking head reading the paragraph which contains the keyword. This task has been achieved through LUCIA-WebGL Cosi et al., 2011), an MPEG-4 Facial Animation Parameters (FAP) driven talking head that implements a decoder compatible with the "Predictable Facial Animation Object Profile" (Pandzic and Forchheimer, 2003). LUCIA-WebGL is totally based on real human data collected by means of ELITE, a fully automatic movement analyzer for 3D kinematics data acquisition by (Ferrigno and Pedotti, 1985). ...
Conference Paper
Full-text available
In this paper we present a web interface to study Italian through the access to read Italian literature. The system allows to browse the content, search for specific words and listen to the correct pronunciation produced by native speakers in a given context. This work aims at providing people who are interested in learning Italian with a new way of exploring the Italian culture and literature through a web interface with a search module. By submitting a query, users may browse and listen to the results through several modalities including: a) the voice of a native speaker: if an indexed audio track is available, the user can listen either to the query terms or to the whole context in which they appear (sentence, paragraph, verse); b) a synthetic voice: the user can listen to the results read by a text-to-speech system; c) an avatar: the user can listen to and look at a talking head reading the paragraph and visually reproducing real speech articulatory movements. In its up to date version, different speech technologies currently being developed at ISTC-CNR are implemented into a single framework. The system will be described in detail and hints for future work are discussed.
... At ISTC-CNR of Padua we developed LUCIA, a talking head based on an open source facial animation frame-work [7]. [8], [9]. More Recently, WebGL ("Web Graphics Language") was introduced [10], and currently supported by the major web browsers development companies. ...
Article
Full-text available
In this DEMO we present the first worldwide WebGL implementation of a talking head (LuciaWebGL), and also the first WebGL talking head running on iOS mobile devices (Apple iPhone and iPad).
... LUCIA is a MPEG-4 facial animation engine that implements a modified version of Cohen-Massaro coarticulation model to model visual speech (see Section 2.3.1) (COSI et al., 2004;LEONE et al., 2012). The system also receives as input an APML script containing emotional tags associated to the six facial expressions of Ekman. ...
Thesis
Full-text available
The facial animation technology experiences an increasing demand for applications involving virtual assistants, sellers, tutors and newscasters; lifelike game characters, social agents, and tools for scientific experiments in psychology and behavioral sciences. A relevant and challenging aspect of the development of talking heads is the realistic reproduction of the speech articulatory movements combined with the elements of non-verbal communication and the expression of emotions. This work presents an image-based, or 2D, facial animation synthesis methodology that allows the reproduction of a wide range of expressive speech emotional states and also supports the modulation of head movements and the control of face elements, like the blinking of the eyes and the raising of the eyebrows. The synthesis of the animation uses a database of prototype images which are combined to produce animation keyframes. The weights used for combining the prototype images are derived from a statistical active appearance model (AAM), which is built from a set of sample images extracted from an audio-visual corpus of a real face. The generation of the animation keyframes is driven by the timed phonetic transcription of the speech to be animated and the desired emotional state. The keyposes consist of expressive context-dependent visemes that implicitly model the speech coarticulation effects. The transition between adjacent keyposes is performed through a non-linear image morphing algorithm. To evaluate the synthesized animations, a perceptual evaluation based on the recognition of emotions was performed. Among the contributions of the work is also the building of a database of expressive speech video and motion capture data for Brazilian Portuguese.
... LUCIA is a MPEG-4 facial animation engine that implements a modified version of Cohen-Massaro coarticulation model to model visual speech (see Section 2.3.1) (COSI et al., 2004;LEONE et al., 2012). The system also receives as input an APML script containing emotional tags associated to the six facial expressions of Ekman. ...
Thesis
The facial animation technology experiences an increasing demand for applications involving virtual assistants, sellers, tutors and newscasters; lifelike game characters, social agents, and tools for scientific experiments in psychology and behavioral sciences. A relevant and challenging aspect of the development of talking heads is the realistic reproduction of the speech articulatory movements combined with the elements of non-verbal communication and the expression of emotions. This work presents an image-based, or 2D, facial animation synthesis methodology that allows the reproduction of a wide range of expressive speech emotional states and also supports the modulation of head movements and the control of face elements, like the blinking of the eyes and the raising of the eyebrows. The synthesis of the animation uses a database of prototype images which are combined to produce animation keyframes. The weights used for combining the prototype images are derived from a statistical active appearance model (AAM), which is built from a set of sample images extracted from an audio-visual corpus of a real face. The generation of the animation keyframes is driven by the timed phonetic transcription of the speech to be animated and the desired emotional state. The keyposes consist of expressive context-dependent visemes that implicitly model the speech coarticulation effects. The transition between adjacent keyposes is performed through a non-linear image morphing algorithm. To evaluate the synthesized animations, a perceptual evaluation based on the recognition of emotions was performed. Among the contributions of the work is also the building of a database of expressive speech video and motion capture data for Brazilian Portuguese.
... An efficient coding of shape and animation of human face was included in the MPEG-4 international standard [3]. At ISTC-CNR of Padua we developed LUCIA talking head, an open source facial animation framework [4]. With the introduction of WebGL [5], which is 3D graphics for web browsers, we enhanced the possibility for Lucia to be embedded in any internet site [6]. ...
Conference Paper
Full-text available
Luciaweb is a 3D Italian talking avatar based on the new WebGL technology. WebGL is the standard programming library to develop 3D computer graphics inside the web browsers. In the last year we developed a facial animation system based on this library to interact with the user in a bimodal way. The overall system is a client-server application using the http protocol: we have a client (a browser or an app) and a web server. No software download and no plugin are required. All the software reside on the server and the visualization player is delivered inside the html pages that the client ask at the beginning of the connection. On the server side a software called AudioVideo Engine generates the phonemes and visemes information needed for the animation. The demo called Emotional Parrot shows the ability to reproduce the same input in different emotional states. This is the first WebGL software running on iOS device ever.
Preprint
Full-text available
We consider the challenging problem of audio to animated video generation. We propose a novel method OneShotAu2AV to generate an animated video of arbitrary length using an audio clip and a single unseen image of a person as an input. The proposed method consists of two stages. In the first stage, OneShotAu2AV generates the talking-head video in the human domain given an audio and a person's image. In the second stage, the talking-head video from the human domain is converted to the animated domain. The model architecture of the first stage consists of spatially adaptive normalization based multi-level generator and multiple multilevel discriminators along with multiple adversarial and non-adversarial losses. The second stage leverages attention based normalization driven GAN architecture along with temporal predictor based recycle loss and blink loss coupled with lipsync loss, for unsupervised generation of animated video. In our approach, the input audio clip is not restricted to any specific language, which gives the method multilingual applicability. OneShotAu2AV can generate animated videos that have: (a) lip movements that are in sync with the audio, (b) natural facial expressions such as blinks and eyebrow movements, (c) head movements. Experimental evaluation demonstrates superior performance of OneShotAu2AV as compared to U-GAT-IT and RecycleGan on multiple quantitative metrics including KID(Kernel Inception Distance), Word error rate, blinks/sec
Article
Full-text available
Voice quality is recognized to play an important role for the rendering of emotions in verbal communication. In this paper we explore the effectiveness of a sinusoidal modeling process- ing framework for voice transformations finalized to the anal- ysis and synthesis of emotive speech. A set of acoustic cues is selected to compare the voice quality characteristics of the speech signals on a voice corpus in which different emotions are reproduced. The sinusoidal signal processing tool is used to convert a neutral utterance into emotive utterances. Two dif- ferent procedures are applied and compared: in the first one, only the alignment of phoneme duration and of pitch contour is performed; the second procedure refines the transformations by using a spectral conversion function. This refinement im- proves the reproduction of the different voice qualities of the target emotive utterances. The acoustic cues extracted from the transformed utterances are compared to the emotive original ut- terances, and the properties and quality of the transformation method are discussed.
Article
Full-text available
This research focuses on the spatio-temporal characteristics of lips and jaw movements and on their relevance for lip-reading, bimodal communication theory and bimodal recognition applications. 3D visible articulatory targets for vowels and consonants are proposed. Relevant modifications on the spatiotemporal consonant targets due to coarticulatory phenomena are exemplified. When visual parameters are added to acoustic ones as inputs to a Recurrent Neural Network system, high recognition results in plosive classification experiments are obtained.
Chapter
The variation that a speech sound undergoes under the influence of neighbouring sounds has acquired the well-established label coarticulation. The phenomenon of coarticulation has become a central problem in the theory of speech production. Much experimental work has been directed towards discovering its characteristics, its extent and its occurrence across different languages. This book is a major study of coarticulation by a team of international researchers. It provides a definitive account of the experimental findings to date, together with discussions of their implications for modelling the process of speech production. Different components of the speech production system (larynx, tongue, jaw, etc.) require different techniques for investigation and a whole section of this book is devoted to a description of the experimental techniques currently used. Other chapters offer a theoretically sophisticated discussion of the implications of coarticulation for the phonology-phonetics interface.
Article
This paper describes how the visual and acoustic characteristics of some Italian phones (/'a/, /b/, /v/) are modifled in emotive speech by the expression of joy, surprise, sadness, disgust, anger, and fear. In this research we speciflcally analyze the interaction between labial conflgurations, peculiar to each emotion, and the articulatory lip movements of the Italian vowel /'a/ and consonants /b/ and /v/, deflned by phonetic-phonological rules. This interaction was quantifled examining the variations of the following parameters: lip opening, upper and lower lip vertical displacements, lip rounding, anterior/posterior movements (protrusion) of upper lip and lower lip, left and right lip corner horizontal displacements, left and right cor- ner vertical displacements, and asymmetry parameters calculated as the difierence between right and left corner position along the horizontal and the vertical axes. Moreover, we present the correlations between articulatory data and the spectral features of the co-produced acoustic signal.