Conference PaperPDF Available

Data-Driven Tools for Designing Talking Heads Exploiting Emotional Attitudes

Authors:
Data-Driven Tools for Designing Talking Heads
Exploiting Emotional Attitudes
Piero Cosi, Andrea Fusaro, Daniele Grigoletto, Graziano Tisato
Istituto di Scienze e Tecnologie della Cognizione - C.N.R.
Sezione di Padova “Fonetica e Dialettologia”
Via G. Anghinoni, 10 - 35121 Padova ITALY
{cosi, fusaro, grigoletto, tisato}@csrf.pd.cnr.it
http://www.csrf.pd.cnr.it
Abstract. Audio/visual speech, in the form of labial movement and facial ex-
pression data, was utilized in order to semi-automatically build a new Italian
expressive and emotive talking head capable of believable and emotional be-
havior. The methodology, the procedures and the specific software tools util-
ized for this scope will be described together with some implementation exam-
ples.
1 Introduction1
It is quite evidently documented by specific workshops and conferences (AVSP,
LREC), European (FP6, Multimodal/Mulsensorial Communication, R&D) and Inter-
national (COCOSDA, ISLE, LDC, MITRE) framework activities, and by various
questionnaires (see ISLE and NIMM, ELRA, COCOSDA, LDC, TalkBank, Dagstuhl
Seminar [1]) that data-driven procedures for building more natural and expressive
talking heads are becoming popular and successful.
The knowledge that both acoustic and visual signal simultaneously convey linguis-
tic, extra linguistic and paralinguistic information is rather spread in the speech com-
munication community, and it constitutes the basis for this work. The data-driven pro-
cedure utilized to build a new Italian talking head, described in this work, has been, in
fact, directly driven by audio/visual data, in the form of labial movement and facial
expression data, that were physically extracted by an automatic optotracking move-
ment analyzer for 3D kinematics data acquisition called ELITE [2].
1 Part of this work has been sponsored by COMMEDIA (COMunicazione Multimodale di E-
mozioni e Discorso in Italiano con Agente animato virtuale, CNR Project C00AA71), PF-
STAR (Preparing Future multiSensorial inTerAction Research, European Project IST- 2001-
37599, http://pfstar.itc.it ) and TICCA (Tecnologie cognitive per l'Interazione e la Coopera-
zione Con Agenti artificiali, joint “CNR - Provincia Autonoma Trentina” Project).
1.1 Audio/Visual Acquisition Environment
ELITE is a fully automatic movement analyzer for 3D kinematics data acquisition,
that provides for 3D coordinate reconstruction, starting from the 2D perspective pro-
jections, by means of a stereophotogrammetric procedure which allows a free posi-
tioning of the TV cameras. The 3D data coordinates are then used to calculate and
evaluate the parameters described hereinafter. Two different configurations have been
adopted for articulatory data collection: the first one, specifically designed for the
analysis of labial movements, considers a simple scheme with only 8 reflecting mark-
ers (bigger grey markers on Figure 1a) while the second, adapted to the analysis of
expressive and emotive speech, utilizes the full and complete set of 28 markers.
0
1
2
mm
LIP CORNER ASYMMETRICAL HORIZONTAL DISPLACEMENT (ASYMX)
−10
−5
0
mm
LOWER LIP VERTICAL DISPLACEMENT (LL)
−2
0
2
4
mm
UPPER LIP VERTICAL DISPLACEMENT (UL)
1.6 1.8 2 2.2 2.4 2.6
−2
−1
0
mm
LIP CORNER ASYMMETRICAL VERTICAL DISPLACEMENT (ASYMY)
Time (s)
−6
−4
−2
0
mm
LOWER LIP PROTRUSION (LLP)
−4
−2
0
2
mm
UPPER LIP PROTRUSION (ULP)
−10
−5
0
mm
LIP ROUNDING (LR)
0
5
10
mm
LIP OPENING (LO)
−1
−0.5
0
0.5 ’a b a
(a) (b)
Fig. 1. Position of reflecting markers and reference planes for the articulatory movement data
collection (a); speech signal and time evolution of some labial kinematic parameters ( LO, LR,
ULP, LLP, UL, LL, ASYMX and ASYMY, see text) for the sequence /’aba/ (b).
All the movements of the 8 or 28 markers, depending on the adopted acquisition
pattern, are recorded and collected, together with their velocity and acceleration, si-
multaneously with the co-produced speech which is usually segmented and analyzed
by means of PRAAT [3], that computes also intensity, duration, spectrograms, for-
mants, pitch synchronous F0, and various voice quality parameters in the case of emo-
tive and expressive speech [4-5]. As for the analysis of the labial movements, the most
common parameters selected to quantify the labial configuration modifications, as il-
lustrated in Figure 1b for some of them, are introduced in the following Table:
Lip Opening (LO), calculated as the distance between markers placed on the central points of the upper and lower lip
vermillion borders [d(m2,m3)]; this parameter correlates with the HIGH-LOW phonetic dimension.
Lip Rounding (LR), corresponding to the distance between the left and right corners of the lips [d(m4,m5)], which corre-
lates with the ROUNDED-UNROUNDED phonetic dimension: negative values correspond to the lip spreading.
Anterior/posterior movements (Protrusion) of Upper Lip and Lower Lip (ULP and LLP), calculated as the distance
between the marker placed on the central points of either the upper and lower lip and the frontal plane containing the
line crossing the markers placed on the lobes of the ears and perpendicular to plane [d(m2,), d(m3,)]. These parame-
ters correlate with the feature PROTRUDED-RETRACTED: negative values quantify the lip retraction.
Upper and Lower Lip vertical displacements (UL, LL), calculated as a distance between the markers placed on the cen-
tral point of either upper and lower lip and the transversal plane passing through the tip of the nose and the markers on
the ear lobes [d(m2,), d(m3,)]. Hence, positive values correspond to a reduction of the displacement of the markers
from the plane. As told before, these parameters are normalized in relation to the lip resting position.
Table 1. Meaning of some of the most common chosen articulatory parameters.
2 Data-Driven Methodology and Tools
As explained in [6-8], several Audio/Visual corpora, were used to train our MPEG-4
[9] standard talking head called LUCIA [10] speaking with an Italian version of
FESTIVAL TTS [11].
2.1 Model estimation
The parameter estimation procedure for LUCIA’s model is based on a least squared
phoneme-oriented error minimization scheme with a strong convergence property,
between real articulatory data Y(n) and modeled curves F(n) for the whole set of R
stimuli belonging to the same phoneme set :
() ()
()
()
2
11
RN
rr
rn
eYn Fn
==
=−
∑∑
where F(n) is generated by a modified version of the Cohen-Massaro co-articulation
model [13] as introduced in [6-7]. Even if the number of parameters to be optimized is
rather high, the size of the data corpus is large enough to allow a meaningful estima-
tion, but, due to the presence of several local minima, the optimization process has to
be manually controlled in order to assist the algorithm convergence. The mean total
error between real and simulated trajectories for the whole set of parameters is lower
than 0.3 mm in the case of bilabial and labiodental consonants in the /a/ and /i/ con-
texts [14, p. 63].
2.2 MPEG4 Animation
In MPEG-4 [9], FDPs (Facial Denition Parameters) dene the shape of the model
while FAPs (Facial Animation Parameters), dene the facial actions. Given the shape
of the model, the animation is obtained by specifying the FAP-stream that is for each
frame the values of FAPs (see Figure 2). In a FAP-stream, each frame has two lines of
parameters. In the first line the activation of a particular marker is indicated (0, 1)
while in the second, the target values, in terms of differences from the previous ones,
are stored.
In our case, the model uses a pseudo-muscular approach, in which muscle contrac-
tions are obtained through the deformation of the polygonal mesh around feature
points that correspond to skin muscle attachments
68 feature points
Frame 0
Frame 1
Frame 2
Frame 3
Frame
268
Frame
269
Frame
270
info
frame rate
numberframe
Fig. 2. The FAP stream.
Each feature point follows MPEG4 specications where a FAP corresponds to a
minimal facial action. When a FAP is activated (i.e. when its intensity is not null) the
feature point on which the FAP acts is moved in the direction signalled by the FAP it-
self (up, down, left, right, etc).
Using the pseudo-muscular approach, the facial model’s points within the region
of this particular feature point get deformed. A facial expression is characterised not
only by the muscular contraction that gives rise to it, but also by an intensity and a du-
ration. The intensity factor is rendered by specifying an intensity for every FAP. The
temporal factor is modelled by three parameters: onset, apex and offset [15].
The FAP-stream needed to animate a FAE (Facial Animation Engine) could be
completely synthesized by using a specific animation model, such as the co-
articulation one used in LUCIA, or it could be reconstructed on the basis of real data
captured by an optotracking hardware, such as ELITE.
2.3 Tools: “INTERFACE
In order to speed-up the procedure for building-up our talking head an integrated
software called INTERFACE, whose block diagram is illustrated in Figure 3, was de-
signed and implemented in Matlab©. INTERFACE simplifies and automates many of
the operation needed for that purpose.
The whole processing block is designed in order to prepare the correct wav and
FAP files needed for the animation engines, both in the sense of building up the en-
gines and of truly creating the current wav and FAP file needed for the final anima-
tion. The final animation, in fact, can be completely synthesized starting from an input
emotional tagged text, by the use of our animation engine [13], or it can be reproduced
by using the data, relative to the specific movements of the markers positioned on hu-
man subjects, extracted by ELITE.
Processing
Fig. 3. INTERFACE block diagram (see text for details)
INTERFACE, handles three types of input data from which the corresponding
MPEG4 compliant FAP-stream could be created:
low-level data, represented by the markers trajectories captured by ELITE;
these data are processed by 4 programs:
TRACK”, which defines the pattern utilized for acquisition and imple-
ments the 3D trajectories reconstruction procedure;
OPTIMIZE” that trains the modified co-articulation model [13] utilized
to move the lips of GRETA [6] and LUCIA [10], our two current talking
heads under development;
IFDCIN”, that allows the definition of the articulatory parameters in rela-
tion with marker positions, and that is also a DB manager for all the files
used in the optimization stages;
MAVIS” (Multiple Articulator VISualizer, written by Mark Tiede of
ATR Research Laboratories [16]) that allows different visualizations of ar-
ticulatory signals;
symbolic high-level XML text data, processed by:
XML-EDITING”, an emotional specific XML editor for emotion tagged
text to be used in TTS and Facial Animation output;
“EXPML2FAP”, the main core animation tool that transforms the tagged
input text into corresponding WAVand FAP files, where the first are synthe-
sized by FESTIVAL and the last, which are needed to animate the MPEG4
engines GRETA or LUCIA [11], by the optimized animation model (de-
signed by the use of OPTIMIZE);
single low-level FAPs, created by:
XML-EDITING”, (see above);
and edited by
FACEPLAYER”, a direct low-level manual control of a single (or group
of) Fap; in other words, FACEPLAYER renders what happen, in GRETA
and LUCIA, while acting on MPEG4 FAP points for a useful immediate
feedback.
The TrackLab software originally supplied by BTS© [17] for ELITE is not reli-
able in reconstructing 3D trajectories when there are a lot of very quickly varying
markers close to each other, as it usually happens in the articulatory study of facial
expressions. The TRACK MatLab© software was, in fact, developed with the aim of
avoiding marker tracking errors that force a long manual post-processing stage and
also a compulsory stage of markers identification in the initial frame for each used
camera. TRACK is quite effective in terms of trajectories reconstruction and process-
ing speed, obtaining a very high score in marker identification and reconstruction by
means of a reliable adaptive processing. Moreover only a single manual intervention
for creating the reference tracking model (pattern of markers) is needed for all the
files acquired in the same working session. TRACK, in fact, tries to guess the possi-
ble target pattern of markers, as illustrated in Figure 4, and the user must only accept a
proposed association or modify a wrong one if needed, then it runs automatically on
all files acquired in the same session.
Moreover, we let the user the possibility to independently configure the markers
and also a standard FAP-MPEG. The actual configuration of the FAP is described in
an initialization file and can be easily changed. The markers assignment to the MPEG
standard points is realized with the context menu as illustrated in Figure 5.
Fig. 4. Definition of the reference model. TRACK’s marker positions and names are
associated with those corresponding to the real case.
Fig. 5. Marker MPEG-FAP association with the TRACK’s reference model. The
MPEG reference points (on the left) are associated with the TRACK’s marker posi-
tions (on the right).
In other words, as illustrated in the examples shown in Figure 6, for LUCIA,
TRACK allows 3D real data driven animation of a talking face, converting the ELITE
trajectories into standard MPEG4 data and eventually it allows, if necessary, an easy
editing of bad trajectories. Different MPEG4 FAEs could obviously be animated with
the same FAP-stream allowing for an interesting comparison among their different
renderings.
Fig. 6. Examples of a single-frame LUCIA’s emotive expressions. These were obtained by ac-
quiring real movements with ELITE, by automatically tracking and reconstructing them with
“TRACK” and by reproducing them with LUCIA.
3 Visual Emotions
At the present time, emotional visual configurations are designed and refined, by
means of visual inspection of real data, with a software called
EMOTIONALPLAYER (EP) (see Figure 7), designed and implemented in Matlab©
on the basis of FACIALPLAYER, introduced above in 2.3, and greatly inspired by the
Emotion Disc software [18]. In the future, a strategy similar to that introduced in 2.1
will be adopted. EMOTIONAL PLAYER manages single facial movements of a syn-
thetic face in a standard MPEG-4 framework in order to create emotional and expres-
sive visual renderings in GRETA and LUCIA.
As already underlined above in 2.2, in MPEG-4 animations, FDPs dene the shape
of the model while FAPs dene the facial actions. The intensity and the duration of an
emotive expression are driven by an intensity factor that is rendered by specifying an
intensity for every FAP, and by a temporal factor which is modelled by onset, apex
and offset parameters, as explained in [15].
The onset and offset represent, respectively, the time the expression takes to appear
and to disappear; the apex corresponds to the duration for which the facial expression
is at its peak intensity value. These parameters are fundamental to convey the proper
meaning of the facial expressions. In our system, every facial expression is character-
ised by a set of FAPs. Every set of FAPs allows for example the creation of the 6 fa-
cial expressions corresponding to the 6 basic primary emotions of Ekman’s set (Table
2), chosen here for a sake of simplicity, and for every expression only 3 levels of in-
tensity (low, medium, high) have been simulated.
Jaw (fap: 3,14,15)
Eyebrow (fap:32,33,34,35,36)
Eyes (fap: 19,…,28)
b
otton lips (fap:5,10,11,52,58,57)
top lips (fap:4,12,13,8,9)
Fig. 7. EMOTIONALPLAYER.
Expression Description
Anger The inner eyebrows are pulled downward and together. The eyes are wide open. The lips are
pressed against each other or opened to expose the teeth
Fear The eyebrows are raised and pulled together. The inner eyebrows are bent upward. The eyes are
tense and alert.
Disgust The eyebrows and eyelids are relaxed. The upper lip is raised and curled, often asymmetrically.
Happiness The eyebrows are relaxed. The mouth is open and the mouth corners pulled back toward the ears.
Sadness The inner eyebrows are bent upward. The eyes are slightly closed. The mouth is relaxed.
Surprise The eyebrows are raised. The upper eyelids are wide open, the lower relaxed. The jaw is opened
Table. 2. The 6 basic primary emotions of Ekman’s set with corresponding facial expressions.
In our system we distinguish “emotion basis" EB(t) from “emotion display" ED(t).
They are both functions of the time t. An EB(t) involves a specific zone of the face
such as the eyebrow, mouth, jaw, eyelid and so on. EB(t) includes also facial move-
ments such as nodding, shaking, turning the head and movement of the eyes. Each
EB(t) is defined as a set of MPEG-4 compliant FAP parameters:
EB(t) = { fap3 = v1(t); ...............; fap68 = vk(t)}
where v1(t),. . . ,vk(t) specify the FAPs function intensity value created by the user. An
EB(t) can also be defined as a combination of EB'(t) by using the '+' operator in this
way:
EB’(t) = EB1’(t)+ EB2’(t)
The emotion display is finally obtained by a linear scaling:
ED’(t) = EB(t)*c = { fap3 = v1(t)*c; ...............; fap68 = vk(t)*c)}
where EB is a “facial basis” and 'c' a constant. The operator '*' multiplies each of the
FAPS constituting the EB by the constant 'c'. The onset, offset and apex (i.e. the dura-
tion of the expression) of emotion is determined by the weighed sum of the functions
vk(t) (k = 3,...,68) created by mouse actions. In Figure 8, two simple emotional exam-
ples for fear and happiness are illustrated.
Fig. 8. Fear (top) and happiness (bottom) emotional examples.
4 Concluding Remarks
An integrated software environment designed and developed for the acquisition, crea-
tion, management, access, and use of audio/visual (AV) articulatory data, captured by
an automatic optotracking movement analyzer, has been introduced and described in
its general characteristics. These methods, tools, and procedures can surely accelerate
the development of Facial Animation Engines and in general of expressive and emo-
tive Talking Agents.
target t
5 Future Trends
Evaluation should be strongly carried out in the future and evaluation tools will be in-
cluded in these tools. Perceptual tests, for example, for comparing both the original
videos/signals and the talking head can surely give us some insights about where and
how the animation engine could be improved.
Results from a preliminar experiment for the evaluation of the adequacy of facial
displays in the expression of some basic emotional states, based on a recognition task,
are presented in a paper in this volume, where also the potentials of the used evalua-
tion methodology are discussed [19].
References
1. Working Group at the Dagstuhl Seminar on Multimodality, 2001, questionnaire on Multi-
modality http://www.dfki.de/~wahlster/Dagstuhl_Multi_Modality
2. Ferrigno G., Pedotti A., “ELITE: A Digital Dedicated Hardware System for Movement
Analysis via Real-Time TV Signal Processing”, IEEE Trans. on Biomedical Engineering,
BME-32, 1985, 943-950.
3. Boersma P., “PRAAT, a system for doing phonetics by computer”, Glot International, 5
(9/10), 1996, 341-345.
4. Magno Caldognetto E., Cosi P., Drioli C., Tisato G., Cavicchio F., “Coproduction of
Speech and Emotions: Visual and Acoustic Modifications of Some Phonetic Labial Tar-
gets”, Proc. AVSP 2003, Audio Visual Speech Processing, ISCA Workshop, St Jorioz,
France, September 4-7, 2003, 209-214 .
5. Drioli C., Tisato G., Cosi P., Tesser F., “Emotions and Voice Quality: Experiments with
Sinusoidal Modeling”, Proceedings of Voqual 2003, Voice Quality: Functions, Analysis
and Synthesis, ISCA Workshop, Geneva, Switzerland, August 27-29, 2003, 127-132.
6. Pelachaud C., Magno Caldognetto E., Zmarich C., Cosi P., “Modelling an Italian Talking
Head”, Proc. AVSP 2001, Aalborg, Denmark, September 7-9, 2001, 72-77.
7. Cosi P., Magno Caldognetto E., Perin G., Zmarich C., “Labial Coarticulation Modeling for
Realistic Facial Animation”, Proc. ICMI 2002, 4th IEEE International Conference on Mul-
timodal Interfaces 2002, October 14-16, 2002 Pittsburgh, PA, USA., pp. 505-510.
8. Cosi P., Magno Caldognetto E., Tisato G., Zmarich C., “Biometric Data Collection For
Bimodal Applications”, Proceedings of COST 275 Workshop, The Advent of Biometric on
the Internet, November 7-8, 2002, Rome, pp. 127-130.
9. MPEG-4 standard. Home page: http://www.chiariglione.org/mpeg/index.htm.
10. Cosi P., Fusaro A., Tisato G., “LUCIA a New Italian Talking-Head Based on a Modified
Cohen-Massaro’s Labial Coarticulation Model”, Proc. Eurospeech 2003, Geneva, Switzer-
land, September 1-4, 2003, 127-132.
11. Cosi P., Tesser F., Gretter R., Avesani, C., “Festival Speaks Italian!”, Proc. Eurospeech
2001, Aalborg, Denmark, September 3-7, 2001, 509-512.
12. FACEGEN web page: http://www.facegen.com/index.htm.
13. Cohen M., Massaro D., “Modeling Coarticulation in Synthetic Visual Speech”, in Magne-
nat-Thalmann N., Thalmann D. (Editors), Models and Techniques in Computer Animation,
Springer Verlag, Tokyo, 1993, pp. 139-156.
14. Perin G., “Facce parlanti: sviluppo di un modello coarticolatorio labiale per un sistema di
sintesi bimodale”, MThesis, Univ. of Padova, Italy, 2000-1.
15. Ekman P. and Friesen W., Facial Action Coding System, Consulting Psychologist Press
Inc., Palo Alto (CA) (USA), 1978.
16. Tiede, M.K., Vatikiotis-Bateson, E., Hoole, P. and Yehia, H, “Magnetometer data acquisi-
tion and analysis software for speech production research”, ATR Technical Report TRH
1999, 1999, ATR Human Information Processing Labs, Japan.
17. BTS home page: http://www.bts.it/index.php
18. Ruttkay Zs., Noot H., ten Hagen P., “Emotion Disc and Emotion Squares: tools to explore
the facial expression space”, Computer Graphics Forum, 22(1) 2003, 49-53.
19. Costantini E., Pianesi F., Cosi P., “Evaluation of Synthetic Faces: Human Recognition of
Emotional Facial Displays”, (in this volume)
... In this perspective, emotion corpora become fundamental to perform conceptual analyses, to develop emotion recognition and synthesis systems (especially in case of data-driven systems) both for speech and face, and to test emotion-oriented tools and applications [1]. In acquiring basic data, four main types of source are commonly used: a) spontaneous emotions; b) inducted emotions; c) acted emotions; and, d) application-driven emotions [2]. In this paper we present an audio/video database of acted emotional speech and facial expressions. ...
... Furthermore, we can see that the emotional expression tends to be longer than the neutral one. All these data have been used for building LUCIA [8], an emotional audio/visual talking head that uses 3D polygon models, which are parametrically articulated and deformed by a data/driven-based animation engine [2], and speaks with the Italian version of the FESTIVAL diphone TTS synthesizer [9], appropriately modified with emotive and expressive capabilities. ...
Article
Full-text available
This paper presents an Italian database of acted emotional speech and facial expressions. New data regarding the transition between emotional states has been collected. Although acted expressions have intrinsic limitations related to their naturalness, this method can be convenient for speech and faces synthesis and within evaluation frameworks. Using motion capture is a good method to get precise information on data for playing back them on facial model and also to build specific animation engine. The procedure to adapt the recorded data to a MPEG-4 compliant facial animation model will be described.
... project -The portal of Italian Language and Culture); a storyteller for e-book reading; a digital tutor for hearing impaired; a personal assistant for smart-phone and mobile devices. The early results can be observed in [26] 4 Emotional synthesis Audio Visual emotional rendering was developed working on true real emotional audio and visual databases whose content was used to automatically train emotion specific intonation and voice quality models to be included in FESTIVAL, our Italian TTS system [33,34,35,36] and also to define specific emotional visual rendering to be implemented in LUCIA [37,38,39]. An emotion specific XML editor explicitly designed for emotional tagged texts was developed. ...
Conference Paper
Full-text available
LUCIA is an MPEG-4 facial animation system developed at ISTC-CNR. It works on standard Facial Animation Parameters and speaks with the Italian version of FESTIVAL TTS. To achieve an emotive/expressive talking head LUCIA was built from real human data physically extracted by ELITE optic-tracking movement analyzer. LUCIA can copy a real human being by reproducing the movements of passive markers positioned on his face and recorded by the ELITE device or can be driven by an emotional XML tagged input text, thus realizing true audio/visual emotive/expressive synthesis. Synchronization between visual and audio data is very important in order to create the correct WAV and FAP files needed for the animation. LUCIA's voice is based on the ISTC Italian version of FESTIVAL-MBROLA packages, modified by means of an appropriate APML/VSML tagged language. LUCIA is available in two different versions: an open source framework and the "work in progress" WebGL. © 2012 ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering.
... LUCIA is a MPEG-4 facial animation engine that implements a modified version of Cohen-Massaro coarticulation model to model visual speech (see Section 2.3.1) (COSI et al., 2004;LEONE et al., 2012). The system also receives as input an APML script containing emotional tags associated to the six facial expressions of Ekman. ...
Thesis
Full-text available
The facial animation technology experiences an increasing demand for applications involving virtual assistants, sellers, tutors and newscasters; lifelike game characters, social agents, and tools for scientific experiments in psychology and behavioral sciences. A relevant and challenging aspect of the development of talking heads is the realistic reproduction of the speech articulatory movements combined with the elements of non-verbal communication and the expression of emotions. This work presents an image-based, or 2D, facial animation synthesis methodology that allows the reproduction of a wide range of expressive speech emotional states and also supports the modulation of head movements and the control of face elements, like the blinking of the eyes and the raising of the eyebrows. The synthesis of the animation uses a database of prototype images which are combined to produce animation keyframes. The weights used for combining the prototype images are derived from a statistical active appearance model (AAM), which is built from a set of sample images extracted from an audio-visual corpus of a real face. The generation of the animation keyframes is driven by the timed phonetic transcription of the speech to be animated and the desired emotional state. The keyposes consist of expressive context-dependent visemes that implicitly model the speech coarticulation effects. The transition between adjacent keyposes is performed through a non-linear image morphing algorithm. To evaluate the synthesized animations, a perceptual evaluation based on the recognition of emotions was performed. Among the contributions of the work is also the building of a database of expressive speech video and motion capture data for Brazilian Portuguese.
... LUCIA is a MPEG-4 facial animation engine that implements a modified version of Cohen-Massaro coarticulation model to model visual speech (see Section 2.3.1) (COSI et al., 2004;LEONE et al., 2012). The system also receives as input an APML script containing emotional tags associated to the six facial expressions of Ekman. ...
Thesis
The facial animation technology experiences an increasing demand for applications involving virtual assistants, sellers, tutors and newscasters; lifelike game characters, social agents, and tools for scientific experiments in psychology and behavioral sciences. A relevant and challenging aspect of the development of talking heads is the realistic reproduction of the speech articulatory movements combined with the elements of non-verbal communication and the expression of emotions. This work presents an image-based, or 2D, facial animation synthesis methodology that allows the reproduction of a wide range of expressive speech emotional states and also supports the modulation of head movements and the control of face elements, like the blinking of the eyes and the raising of the eyebrows. The synthesis of the animation uses a database of prototype images which are combined to produce animation keyframes. The weights used for combining the prototype images are derived from a statistical active appearance model (AAM), which is built from a set of sample images extracted from an audio-visual corpus of a real face. The generation of the animation keyframes is driven by the timed phonetic transcription of the speech to be animated and the desired emotional state. The keyposes consist of expressive context-dependent visemes that implicitly model the speech coarticulation effects. The transition between adjacent keyposes is performed through a non-linear image morphing algorithm. To evaluate the synthesized animations, a perceptual evaluation based on the recognition of emotions was performed. Among the contributions of the work is also the building of a database of expressive speech video and motion capture data for Brazilian Portuguese.
... Track consente anche di risintetizzare l'animazione facciale, convertendo queste traiettorie in un flusso di controllo secondo un protocollo voluto (attualmente MPEG-4). Si ottiene in questo modo una tipica Data-Driven Synthesis (Fig. 3) (Damper, 2001; Cosi et alii, 2004b). • Optimize: utilizza i dati provenienti da Track per estrarre i coefficienti di articolazione fonetica (cap. ...
Article
Full-text available
SOMMARIO Gli sviluppi recenti della ricerca nel campo delle teorie sulla produzione e percezione della lingua parlata, così come nel campo tecnologico dell'interazione uomo-macchina (riconoscimento della voce, sintesi di agenti conversazionali, insegnamento delle lingue, riabilitazione della voce, ecc.) richiedono l'acquisizione e l'elaborazione di grandi quantità di dati articolatori ed acustici. È noto, infatti, che questi dati si differenziano da lingua a lingua per la dimensione e la struttura dell'inventario fonologico. D'altra parte, la richiesta di questo tipo di dati è aumentata negli ultimi anni con il crescente interesse manifestato dalla comunità scientifica nel campo delle emozioni. Questo articolo presenta InterFace, un ambiente interattivo realizzato all'ISTC-SPFD (http://www.pd.istc.cnr.it/LUCIA/home/tools.htm) con lo scopo di facilitare tutte le fasi di analisi, elaborazione, e sintesi dei dati necessari all'animazione audio-visuale delle Teste Parlanti. InterFace permette di raggiungere tre principali finalità: • Estrarre dai dati acquisiti un insieme di misure su parametri articolatori (ad es. apertura labiale, arrotondamento, protrusione, aggrottamento, asimmetrie labiali, ecc.), espressamente definiti dall'utente, e riguardanti tanto l'ambito tradizionale della fonetica che quello più recente delle emozioni. • Ottenere da quegli stessi dati una modellizzazione parametrica dell'evoluzione dei parametri fonetici, che tenga in debito conto i fenomeni di coarticolazione, e che possa essere impiegato nei motori di animazione delle Teste Parlanti. • Creare da varie fonti il flusso dei dati audio-visuali necessari all'animazione di un agente conversazionale, capace di esprimere emozioni. Il sistema può maneggiare quattro differenti tipi di dati in ingresso: • Dati reali, acquisiti da sistemi di cattura degli andamenti cinematici dell'articolazione facciale. L'elaborazione di questi dati permette di realizzare una tipica Data-Driven Synthesis. • Dati testuali, da cui generare il flusso di dati audio-video di controllo dell'animazione facciale. Seguendo questo via, si ottiene una Text-to-Animation Synthesis, ovverosia una Symbolic-Driven Synthesis. • Dati audio, da cui ricavare la segmentazione fonetica con un sistema di riconoscimento automatico e ottenere in questo modo la sequenza dei fonemi necessari ad una animazione sincrona con l'audio. Questo procedimento può essere chiamato una Wav-to-Animation Synthesis. • Dati a basso livello, per controllare manualmente il movimento di uno o più parametri di animazione e verificarne l'effetto con la sintesi video. Quest'ultimo procedimento si può definire come una Manual-Driven Synthesis.
... Visual speech synthesis can be accomplished either through manipulation of video images ([4], [5]) or based on two-or three dimensional models of the human face and/or speech organs that are under control of a set of deformation parameters, as described by for example [6], [7], [8] and [9]. To render visual speech movements, we start from a timealigned transcription of the speech to be synthesized. ...
Conference Paper
Full-text available
This paper describes initial experiments with synthesis of visual speech articulation for different emotions, using a newly developed MPEG-4 compatible talking head. The basic problem with combining speech and emotion in a talking head is to handle the interaction between emotional expression and articulation in the orofacial region. Rather than trying to model speech and emotion as two separate properties, the strategy taken here is to incorporate emotional expression in the articulation from the beginning. We use a data-driven approach, training the system to recreate the expressive articulation produced by an actor while portraying different emotions. Each emotion is modelled separately using principal component analysis and a parametric coarticulation model. The results so far are encouraging but more work is needed to improve naturalness and accuracy of the synthesized speech.
Article
Full-text available
1. SOMMARIO Questo articolo presenta gli sviluppi più recenti di InterFace, un software interattivo realizzato in Matlab all'ISTC-SPFD, per l'animazione audio-visuale delle Facce Parlanti. Per completezza di informazione, questo testo riprende ed integra in maniera esaustiva le presentazioni parziali già fatte precedentemente (Tisato et alii, 2005a, Tisato et alii, 2005b, Cosi et alii, 2005). La ricerca nel campo delle teorie di produzione e percezione della lingua parlata, del riconoscimento della voce, degli agenti conversazionali, dell'insegnamento delle lingue, della riabilitazione della voce, dello studio delle emozioni, ecc., deve far fronte a necessità sempre crescenti di elaborazione di dati articolatori ed acustici.
Conference Paper
Full-text available
Despite the growing attention towards the communication adequacy of embodied conversational agents (ECAs), standards for their assessment are still missing. This paper reports about a methodology for the evaluation of the adequacy of facial displays in the expression of some basic emotional states, based on a recognition task. We consider recognition rates and error distribu- tion, both in absolute terms and with respect to a human model. As to data analysis, we propose to resort to standard loglinear techniques and to informa- tion-theoretic ones. Results from an experiment are presented and the potentials of the methodology are discussed.
Article
Full-text available
Voice quality is recognized to play an important role for the rendering of emotions in verbal communication. In this paper we explore the effectiveness of a sinusoidal modeling process- ing framework for voice transformations finalized to the anal- ysis and synthesis of emotive speech. A set of acoustic cues is selected to compare the voice quality characteristics of the speech signals on a voice corpus in which different emotions are reproduced. The sinusoidal signal processing tool is used to convert a neutral utterance into emotive utterances. Two dif- ferent procedures are applied and compared: in the first one, only the alignment of phoneme duration and of pitch contour is performed; the second procedure refines the transformations by using a spectral conversion function. This refinement im- proves the reproduction of the different voice qualities of the target emotive utterances. The acoustic cues extracted from the transformed utterances are compared to the emotive original ut- terances, and the properties and quality of the transformation method are discussed.
Article
Full-text available
This paper concerns the bimodal transmission of emotive speech and describes how the expression of joy, surprise, sadness, disgust, anger, and fear, leads to visual and acoustic target modifications in some Italian phonemes. Current knowledge on the audio-visual transmission of emotive speech traditionally concerns global prosodic and intonational characteristics of speech and facial configurations. In this research we intend to integrate this approach with the analysis of the interaction between labial configurations, peculiar to each emotion, and the articulatory lip movements defined by phonetic-phonological rules, specific to the vowels and consonants /'a/, /b/, /v/ ([1], [2]). Moreover, we present the correlations between articulatory data and the spectral features of the co-produced acoustic signal 1 .
Article
Full-text available
Our goal is to create a natural Italian talking face with, in particular, lip-readable movements. Based on real data extracted from an Italian speaker with the ELITE system, we have approximated the data using radial basis functions. In this paper we present our 3D facial model based on MPEG-4 standard 1 and our computational model of lip movements for Italian. Our experiment is based on some phonetic-phonological considerations on the parameters defining labial orifice, and on identification tests of visual articulatory movements.
Article
Full-text available
This work focuses on the description of the environment and the procedures utilized at ISTC-SPFD for the biometric data collection of visual face-related articulatory (spatio-temporal) movements useful for lip-reading, bimodal communication theory, and the development of bimodal talking head and bimodal speech recognition applications.
Article
This paper describes how the visual and acoustic characteristics of some Italian phones (/'a/, /b/, /v/) are modifled in emotive speech by the expression of joy, surprise, sadness, disgust, anger, and fear. In this research we speciflcally analyze the interaction between labial conflgurations, peculiar to each emotion, and the articulatory lip movements of the Italian vowel /'a/ and consonants /b/ and /v/, deflned by phonetic-phonological rules. This interaction was quantifled examining the variations of the following parameters: lip opening, upper and lower lip vertical displacements, lip rounding, anterior/posterior movements (protrusion) of upper lip and lower lip, left and right lip corner horizontal displacements, left and right cor- ner vertical displacements, and asymmetry parameters calculated as the difierence between right and left corner position along the horizontal and the vertical axes. Moreover, we present the correlations between articulatory data and the spectral features of the co-produced acoustic signal.
Article
In the paper we present two novel interactive tools, Emotion Disc and Emotion Squares, to explore the facial expression space. They map navigation in a 2D circle, by the first tool, or in two 2D squares, by the second tool, to the high-dimensional parameter space of facial expressions, by using a small number of predefined reference expressions. They can be used as exploration tools by researchers, or as control devices by end-users to put expressions on the face of embodied agents or avatars in applications like games, telepresence and education.