Conference PaperPDF Available

Emotional audio visual arabic text to speech

Authors:

Abstract

The goal of this paper is to present an emotional audio-visual Text to speech system for the Arabic Language. The system is based on two entities: un emotional audio text to speech system which generates speech depending on the input text and the desired emotion type, and un emotional Visual model which generates the talking heads, by forming the corresponding visemes. The phonemes to visemes mapping, and the emotion shaping use a 3-paramertic face model, based on the Abstract Muscle Model. We have thirteen viseme models and five emotions as parameters to the face model. The TTS produces the phonemes corresponding to the input text, the speech with the suitable prosody to include the prescribed emotion. In parallel the system generates the visemes and sends the controls to the facial model to get the animation of the talking head in real time.
EMOTIONAL AUDIO VISUAL ARABIC TEXT TO SPEECH
M. Abou Zliekha , S. Al-Moubayed*
O. Al-Dakkak and, N. Ghneim**,
*Damascus University/Faculty of Information Technology
email: mhd-it@scs-net.org ; email: kamal@scs-net.org
** Higher Institute of Applied Science and Technology (HIAST)
P.O. Box 31983, Damascus, SYRIA
phone: + (963-11) 5120547, fax: + (963-11) 2237710. email: odakkak@hiast.edu.sy ; email: n_ghneim@netcourrier.com
.
ABSTRACT
The goal of this paper is to present an emotional audio-visual
Text to speech system for the Arabic Language. The system is
based on two entities: un emotional audio text to speech
system which generates speech depending on the input text
and the desired emotion type, and un emotional Visual
model which generates the talking heads, by forming the
corresponding visemes. The phonemes to visemes mapping,
and the emotion shaping use a 3-paramertic face model,
based on the Abstract Muscle Model. We have thirteen
viseme models and five emotions as parameters to the face
model. The TTS produces the phonemes corresponding to the
input text, the speech with the suitable prosody to include the
prescribed emotion. In parallel the system generates the
visemes and sends the controls to the facial model to get the
animation of the talking head in real time.
1. INTRODUCTION
Speech and face expressions are the two basic human com-
munication ports to the external world and other humans.
When emotions are expressed through both speech and face,
they give a considerable added value to the meaning of the
speech. The information extracted from the speaker face is
very valuable to the perception of speech.
The visual information of the talking head is a very useful in
both speech synthesizing and recognition systems. The col-
laboration between the audio phonemes and the visual vis-
mes helps to improve speech perception and remove the am-
biguity that may occur, on some phonemes (like the ambigu-
ity between phonemes /m/ and /n/), especially in noisy envi-
ronments, in addition to increasing the non-verbal communi-
cative signals like prosody and emotions. These advantages
are especially relevant when communicating with people
hearing difficulties. The audio-visual speech synthesis could
be used in various domains like 1) automatic news and
weather broadcasting, 2) educational applications (virtual
lectures) 3) aid to hearing impaired people.
Our Objective is to build an Audio Visual Arabic text to
speech system, depending on the following components
Emotional Arabic text to speech system.
Emotional Face Model and Control to build visemes
2. THE AUDIO VISUAL SYSTEM
There are several projects and works, done on audio-
visual speech. MIRALab, University of Geneva worked on
the emotional aspects [8], other researches were done on
audio-visual TTS: Department of computer Science at the
Sheffield university on English [9], the Chinese University of
Hong Kong on Chinese [4], INPG/ENSERG-Université
Stendhal on French [2], and Tokyo Institute of Technology
on Japanese [6]. Our work is a contribution for the Arabic
language.
Figure 1: The system components
Text
Text to
phonemes
and duration
mapping
Phonemes
to visemes
mapping
Face model
Animation process
Visemes
parameters
Emotions
parameters
Audio speech
Visual
speech
Synchronizer
14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, September 4-8, 2006, copyright by EURASIP
Figure 1 shows the components of our system. A text
and an emotion choice is entered, and an animated face to-
gether with the corresponding speech is produced.
The first bloc includes the emotional TTS which gener-
ates the phonemes, with the appropriate prosody to give the
desired speech. The output of this bloc controls the second
bloc that generates the animated talking face. The emotion
gives the general facial expressions, and the phonemes give
the mouth movements. These parameters control the posi-
tions of the face muscles, during speech productions.
3. THE EMOTIONAL ARABIC TEXT TO
SPEECH
In a previous paper, we presented the inclusion of emotions
in an Arabic Text-to-Speech. In this section we remind of
our TTS system in HIAST and the inclusion of emotions in
it. We also recall our results regarding this issue
3.1 The Arabic TTS in HIAST
With the objective of building a complete system of standard
spoken Arabic with a high speech quality, we defined the
following steps to achieve this goal (1) the definition of the
phonemes' set used in standard Arabic, we have 38 pho-
nemes, 28 consonants and 5 vowels (/a/, /u/, /i/ and the
opened vowels /o/ and /e/ with 5 emphatic vowels[3a], (2)
the establishment of the Arabic text-to-phonemes rules using
TOPH (Orthographic-PHonetic Transliteration) [5] after its
adaptation to Arabic Language, and (3) the definition of the
acoustic units; the semi-syllables, and the corpus from
which these units are to be extracted.
As the Arabic syllables are only of 4 forms: V, CV, CVC,
CVCC, the semi-syllables are of 5 forms: #CV, VC#, VCC#
(# is silence), and in continuous speech we have VCV and
VCCV; hence the logatoms from which those semi-syllables
are extracted are respectively: Cvsasa, satVC, satVC1C2,
tV1CV2sa, tV1C1C2V2sa. Where the small letters are pro-
nounced as they are, V, V1, V2 scans all the vowels and C,
C1, C2 scan all the consonants. Some combinations never
occur in the language, they are excluded. This corpus is be-
ing recorded (not finished yet) It will be segmented and ana-
lyzed using PSOLA techniques [10], and in parallel (5) the
incorporation of prosodic features in the syntactic speech.
3.2 Prosody and emotion generation
The output of the third step of our TTS is converted accord-
ing to MBROLA transcription. MBROLA system allows
control on pitch contour and duration for each phoneme. To
control the amplitude we built an additional module. There-
fore we were able to test our prosody and emotion synthesis.
The automatic prosody generation in our TTS enables the
hearer to distinguish assertions, interrogations, exclamations,
or negations. Our methodology was as follows: We began by
recording groups of sentences (assertions, interrogations,…),
with different length (short medium and long sentences). We
extracted the prosodic features, then we modeled the differ-
ent curves of prosody (F0, intension, durations). For each
group, we have different models according to sentence
length. A fuzzy logic controls the application of the different
models. We perform a set of rules to produce the prosodic
parameters automatically, depending on punctuations. Sub-
jective tests showed the improvement in speech quality with
the generated prosody.
In another study [11], we developed rules to modify the pro-
sodic parameters to synthesize emotions (Joy, sadness, fear,
surprise and anger). The automated tool we've developed
for emotional Arabic synthesis proved to be successful,
especially in conversational contexts. The emotion recogni-
tion rates ranged from 67% to 80%.
To improve this recognition rate, we developed a multi-
modal synthesis, incorporating face animation with emo-
tional speech.
4. THE AUDIO-VISUAL ARABIC TEXT TO
SPEECH
In the present section, we develop our approach to build the
emotional audio visual TTS on Arabic. First, we define a set
of visual models corresponding to phonemes clusters, then
we present the face model and control methods to produce
the animated faces
4.1. The Phonemes Visemes mapping
Since many phonemes can not be differentiated based on
audio signals (such as voiced, voiceless or nasals), Fischer
(1968) has introduced the notion of visual phonemes (vise-
mes), where phonemes are clustered based on their visual
similarity [Pelachaud ]. Visemes play an important role in
the selection of the suitable video segments of animated
faces.
A viseme could correspond to more than one phoneme, be-
cause some phonemes are visually alike (ex. /s/ and /z/).
For our Arabic audio-visual speech synthesis system, we've
built an inventory of Arabic visemes classes and developped
a phoneme-viseme mapping. (See Table 1)
4.2. Face modelling and control
There are several methods to model faces, such as: 1) Con-
structive Solid Geometry (CSG), which describes the geome-
try of complex scenes by applying a set of operations to
primitive objects, 2) voxels, which are volumic pixels (by
adding depth to an image), 3) parametric surfaces, which
represents the surfaces using functions and ranges, or poly-
gons mesh. 4) polygons mesh, which is the mostly used
method, as it takes less time to render, and can be efficiently
manipulated by graphics hardware.
The polygon mesh is the model we adopt for our system. In
fact, it is easy to convert any other designed model to this
model. Figure 2 shows a face model with polygons mesh
(right), and the same model covered with skin (left).
14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, September 4-8, 2006, copyright by EURASIP
Table 1: phonemes-visemes mapping (MBROLA nota-
tion)
Face control is a very complex process. Real face consists of
bones, muscles and skin; any small change on any of these
parameters may cause the perception of different emotions
and may generate different visemes.
Figure 2 polygon mesh face model
There are several methods to control a face model: 1) key-
frame morphing method, which animates a graphical object
by creating smooth transitions between various models. This
method is simple and fast, but needs the design of all the
viesemes for each desired emotions (n visemes and m emo-
tion requires n*m face designs), 2) Parametric control
method [7], which divides the face into small controlled ar-
eas (like jaw rotation, mouth opening, etc). The parameters of
this model (amount of area movement) are not simply related
to the mesh, and any change in the mesh will influence the
areas boundaries, and therefore needs a redefinition of the
control areas, 3) Abstract muscle-based model [1], in which t
he face consists of two categories of muscles: linear muscles
and sphincter muscles. Linear muscles pull in the mesh, and
are represented by an attachment point and a vector. The
sphincter ones represent the muscle squeezing by a change of
a corresponding ellipse. Neither of theses muscle kinds is
connected to the polygon mesh, their action is limited to an
influence zone.
Our approach will depend on two face control models: ab-
stract muscles-based model and the key framing model. The
abstract muscles based model is used to produce the shape of
a face model for each of the desired emotion, with the corre-
sponding visemes. The interpolation between two visemes is
done by key frame modelling. The big advantage of the ab-
stract muscles-based model is that the system of mesh modi-
fication is independent of the topology of the face. Further
more, if we have n visemes and m emotions, the model re-
quires only n+m parameter sets, but this is done on the ex-
pense of more processing time than in the other models.
4.3. Emotional Visemes animation
The production and animation of the different visemes is
done as follows:
Applying the emotion morphing on the mesh to get
the desired emotion. This mesh will be the starting
mesh of the animation process
Applying the visemes morphing for each two suc-
cessive visemes, using key frame models.
The application of visemes morphing on the mesh after ap-
plying the emotion morphing will produce the correct viseme
shapes, because the abstract muscles-based model is inde-
pendent of the topology of the face. This fact justifies taking
the emotional face as the basic face mesh for the following
visemes. This process produces the shape of successive
visemes corresponding to the output phonemes of the TTS
for the desired emotion.
The animation process between two consecutive visemes can
be done, using the same approach (abstract muscle-based
model) by extracting the difference between the values of the
parameters and applying the morphing step by step. However
this takes a lot of execution time, and does not produce a
real-time animation.
We adopted to use key-frame animation process to produce
animation between two visemes, while considering the emo-
tional viseme shape is temporary static. To pass from one
viseme shape to another, we used linear functions. The ani-
mation timing is taken from the phonemes durations given by
the emotional TTS. The frame updating time is adaptive,
depending on phonemes duration.
Arabic
trascription
phoneme Viseme
number
ا aa (1)
ق q (1)
ب b (2)
م m (2)
ت t (3)
د d (3)
ز z (3)
س s (3)
ص s. (3)
ض d. (3)
ط t. (3)
ك k (3)
ن n (3)
ث T (4)
ذ D (4)
ظ z. (4)
ج Z (5)
ش S (5)
ح X (6)
خ x (6)
ـه h (6)
ر r (7)
ع H (8)
غ G (8)
ف f (9)
ل l (10)
و w (11)
  O (11)
ي j (12)
 ةآ E (12)
Pause # (13)
14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, September 4-8, 2006, copyright by EURASIP
5. RESULTS
To study the influence of the visual components on the emo-
tional speech intelligibility, we produced the five audio-
visual sentences for each emotion. These sentences we ex-
posed to 10 people. Each individual was asked to give the
perceived emotion for each sentence. Table 2 shows the
results of this test.
Others
Sur-
prise
Fear
Sad-
ness
Joy
An-
ger
Identified
synthesized
8% 0% 0% 0% 0%
92%
Anger
5% 8% 0% 0%
87%
0% Joy
7% 0% 3%
90%
0% 0% Sadness
6% 0% 92% 2% 0% 0% Fear
3%
92%
0% 0% 5% 0% Surprise
Table 2: Audio-visual emotion recognition rates
The Others column represents the percentage of results
where the emotion perceived by voice did not match the one
given by the face, or where the perception is not determinis-
tic in either audio or video; the visual was more expressible
than the audio. In table 3, we recall the audio recognition
results of emotion done in [11], to see the amount of im-
provement to the recognition rates added by visual.
Others
Sur-
prise
Fear
Sad-
ness
Joy
An-
ger
Identified
synthesized
6% 0% 7% 2% 0%
75%
Anger
18% 13% 2% 0%
67%
0% Joy
20% 0% 5%
70%
0% 5% Sadness
12% 0%
80%
5% 0% 3% Fear
15%
73%
2% 0% 10% 0% Surprise
Table 3: Audio emotion recognition rates
Compared with the emotional TTS, we found that the recog-
nition rates have increased up to 92% while it was no more
than 80%. The minimum recognition rate has improved from
67% to 87%.
Figure 3 a screen of the execution of our system
Figure 3 shows a screen of the execution of our system. The
spoken sentence is "I have passed the exam", with the Joy
emotion.
The system runs under Windows environment 2.3 GHz CPU
64 MB Graphics card memory, the frame animation duration
average was 30 ms.
6. CONCLUSION
In this paper, we presented the first version of an emotional
audio-visual speech synthesizer for Arabic texts.
The encouraging results shown above gave us the motiva-
tion to go further to reach a perfect coherence between the
sound and the image, and perhaps to develop an audio-
visual synthesis with various speakers.
REFERENCES
[1] K.Waters, "A muscle model for animating three-
dimensional facial expression," in Proc. SIGGRAPH’87,
vol. 21, no. 4, pp. 17-24.
[2] B. Le Goff, C. Benoît, "A text-to-audiovisual-speech
synthesizer for French", in Proc. ICSLP96.
[3] O. Al dakkak, N. Ghneim, "Towards Man-Machine
Communication in Arabic" in Proc. Syrian-Lebanese Con-
ference, Damascus SYRIA, October 12-13, 1999.
[5] V. Aubergé, "La Synthèse de La parole: des Règles aux
Lexiques", Thèse de l'université Pierre Mendès France,
Grenoble2, 1991.
[4] J.Q. Wang, K.H. Wong, P.A. Heng, H. M. Meng and T.
T. Wong, "A real-time Cantonese text-to-audiovisual speech
synthesizer," International Conference on Acoustics, Speech
and Signal Processing (ICASSP), Montreal, Quebec, Canada,
17-21 May 2004
[6] Masatsune Tamura, Shigekazu Kondo, Takashi Masuko,
Takao Kobayashi, "Text-to-Audio-Visual Speech Synthesis
Based on Parameter Generation from HMM,'' in Proc.
EUROSPEECH' 99, Budapest, Hungary, pp.959-962.
[7] Emmanuel TANGUY, 3D Facial Animation,
http://membres.lycos.fr/maybeweb/projet/dissertation.doc.
[8] N. M. Thalmann, http://www.informatik.uni-
trier.de/~ley/db/indices/a-
tree/m/Magnenat=Thalmann:Nadia.html
[9] J. Edge, "Expressive visual speech using geometric mus-
cle functions," in Proc. of Eurographics UK, pp. 11–18, April
2001
[10] E. Moulines and J. Laroche, “Non-parametric tech-
niques for pitch-scale and time-scale modification of-
speech,” Speech Communication, vol. 16, pp. 175-205,
1995.
[11] O. Al-Dakkak, N. Ghneim, M. Abou Zleikha, and S.
Al-Moubayed, “Emotion inclusion in an Arabic text-to-
speech,” in Proc EUSIPCO 2005, 4-8 September, Turkey.
14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, September 4-8, 2006, copyright by EURASIP
... Going through the literature of Arabic speech emotion processing, we find some studies and a few more or less significant emotional databases. Between 2005 and 2006, we find three Syrian works about the introduction of the emotion parameters for Arabic text-to-speech synthesis [2,4]. Al-Dakkak et al. tried to improve the Arabic synthetic speech (MSA Arabic) in order to sound as natural. ...
Chapter
Full-text available
In this paper, we present and describe our first work to design and build a natural Arabic visual-audio database for the computational processing of emotions and affect in speech and language which will be made available to the research community. It is high time to have spontaneous data representative of the Modern Standard Arabic (MSA) and its dialects. The database consists of audio-visual recordings of some Arabic TV talk shows. Our choice comes down on the different dialects with the MSA. As a first step, we present a sample data of Algerian dialect. It contains two hours of audio-visual recordings of the Algerian TV talk show “Red line”. The data consists of 14 speakers with 1, 443 utterances which are complete sentences. 15 emotions investigated with five that are dominants: enthusiasm, admiration, disapproval, neutral, and joy. The emotion corpus serves in classification experiments using a variety of acoustic features extracted by openSMILE. Some algorithms of classification are implemented with the WEKA toolkit. Low-level audio features and the corresponding delta features are utilized. Statistical functionals are applied to each of the features and delta features. The best classification results - measured by a weighted average of f-measure - is 0.48 for the five emotions.
... • Emotional audio-visual text-to-speech (TTS) system for Arabic Language (Al-Dakkak et al., 2005;Abou-Zliekha et al., 2006). The system is based on two entities: an emotional audio text-tospeech system, which generates speech depending on the input text and the desired emotion type, and an emotional visual model which generates the talking heads, by forming the corresponding visemes. ...
... Text-to-viseme may be the right approach to control an avatar for natural utterance. The text-to-viseme process can translate text into the appropriate viseme and supplement this basic information with other related information such as emotion or gesture [ 2] [3 ] [4]. Rule-based, text-to-viseme synthesis has been successfully implemented by considering emotion an additional item of information [ 5] and for direct visualspeech synthesis [6]. ...
Article
This paper presents a new approach for driving avatars with text-to-speech synthesis that uses pure text as an information source. The goal is to move lips and face muscles on the basis of the phonetic nature of the utterance and the related expression. Several methods came together to define this solution. Rule-based text-to-speech synthesis generates phonetic and expression transcription of the text to be uttered by the avatar. Phonetic transcription is used to train two artificial neural networks, one for text-to-phone transcription and the other for phone-to-viseme mapping. Then two fuzzy-logic engines were tuned for smoothed control of lip and face movements.
Chapter
Visemes are the unique facial positions required to produce phonemes, which are the smallest phonetic unit distinguished by the speakers of a particular language. Each language has multiple phonemes and visemes, and each viseme can have multiple phonemes. However, current literature on viseme research indicates that the mapping between phonemes and visemes is many-to-one: there are many phonemes which look alike visually, and hence they fall into the same visemic category. To evaluate the performance of the proposed method, the authors collected a large number of speech visual signal of five Algerian speakers male and female at different moments pronouncing 28 Arabic phonemes. For each frame the lip area is manually located with a rectangle of size proportional to 120*160 and centred on the mouth, and converted to gray scale. Finally, the mean and the standard deviation of the values of the pixels of the lip area are computed by using 20 images for each phoneme sequence to classify the visemes. The pitch analysis is investigated to show its variation for each viseme.
Article
Visemes are the unique facial positions required to produce phonemes, which are the smallest phonetic unit distinguished by the speakers of a particular language. Each language has multiple phonemes and visemes, and each viseme can have multiple phonemes. However, current literature on viseme research indicates that the mapping between phonemes and visemes is many-to-one: there are many phonemes which look alike visually, and hence they fall into the same visemic category. To evaluate the performance of the proposed method, the authors collected a large number of speech visual signal of ten Algerian speakers male and female at different moments pronouncing 28 Arabic phonemes. For each frame the lip area is manually located with a rectangle of size proportional to 120*160 and centred on the mouth, and converted to gray scale. Finally, the mean and the standard deviation of the values of the pixels of the lip area are computed by using 20 images for each phoneme sequence to classify the visemes. The pitch analysis is investigated to show its variation for each viseme.
Article
Full-text available
This paper describes a system for the creation of expressive visual-speech animation. Geometric Muscle Functions (Waters 1987, 1988) are used for the c ontrol of both facial expression and speech lip po stures, allowing the ea sier integration o f these two factors. This addresses the common problem of attempting to combine two separate domain-specific control techniques. The use of muscle functions also allows the c ontrol mechanism to b e a bstracted from the mesh representation, and so the described system is applicable to any reasonable facial model. Twenty-five simulated muscles, plus jaw rotation, are used to p roduce both the six un iversally- accepted emotions (sadness, fear, contempt, surprise, happiness, and anger) and fifty-six identified static speech postures (visemes). The underlying muscular influences of these two factors are then combined together using weighted-blending techniques to create expressive speech postures. Pre-captured speech and facial expression d ata is used as input t o the system. By varying the relevant muscle influences over time, speech synchronous animation is then created. Example results include spoken d igit sequences and simple sentences. Informal user testing suggests that the addition of detailed internal mouth structures, such as the tongue, would improve recognition rates for certain classes of speech gesture.
Conference Paper
This paper describes the design and development of a Cantonese TTVS synthesizer, which can generate highly natural synthetic speech that is precisely time-synchronized with a real-time 3D face rendering. Our Cantonese TTVS synthesizer utilizes a homegrown Cantonese syllable-based concatenative text-to-speech system named CU VOCAL. This paper describes the extension of CU VOCAL to output syllable labels and durations that correspond to the output acoustic wave file. The syllables are decomposed and their initials/finals are mapped to the nearest IPA symbols that correspond to static viseme models. We have authored sixteen static viseme models together with two emotion-based face models. In order to achieve 3D face rendering, we have designed and implemented a blending technique that computes the linear combinations of the static face models to effect smooth transitions in between models. We demonstrate that this design and implementation of a TTVS synthesizer can achieve real-time performance in generation.
Article
Time-scale and, to a lesser extent, pitch-scale modifications of speech and audio signals are the subject of major theoretical and practical interest. Applications are numerous, including, to name but a few, text-to-speech synthesis (based on acoustical unit concatenation), transformation of voice characteristics, foreign language learning but also audio monitoring or film/soundtrack post-synchronization. To fulfill the need for high-quality time and pitch-scaling, a number of algorithms have been proposed recently, along with their real-time implementation, sometimes for very inexpensive hardware. It appears that most of these algorithms can be viewed as slight variations of a small number of basic schemes. This contribution reviews frequency-domain algorithms (phase-vocoder) and time-domain algorithms (Time-Domain Pitch-Synchronous Overlap/Add and the like) in the same framework. More recent variations of these schemes are also presented.
Conference Paper
The development of a parameterized facial muscle process, that incorporates the use of a model to create realistic facial animation is described.Existing methods of facial parameterization have the inherent problem of hard-wiring performable actions. The development of a muscle process that is controllable by a limited number of parameters and is non-specific to facial topology allows a richer vocabulary and a more general approach to the modelling of the primary facial expressions.A brief discussion of facial structure is given, from which a method for a simple modelling of a muscle process that is suitable for the animation of a number of divergent facial types is described.
Article
Thesis (doctoral)--Université Stendhal-Grenoble 3, 1991.
Conference Paper
An audiovisual speech synthesizer from unlimited French text is presented. It uses a 3-D parametric model of the face. The facial model is controlled by eight parameters. Target values have been assigned to the parameters, for each French viseme, based upon measurements made on a human speaker. Parameter trajectories are modeled by means of dominance functions associated with each parameter and each viseme. A dominance function is characterized by three coefficients so that coarticulation finally depends on the phonetic context, the speech rate, and an “hypo-hyper articulation” coefficient adjustable by the user. Finally, the visual and audiovisual intelligibility of the visual synthesizer has been evaluated in its first version, and compared to that of the acoustic synthesizer on which it was implemented
Article
This paper describes a technique for synthesizing auditory speech and lip motion from an arbitrary given text. The technique is an extension of the visual speech synthesis technique based on an algorithm for parameter generation from HMM with dynamic features. Audio and visual features of each speech unit are modeled by a single HMM. Since both audio and visual parameters are generated simultaneously in a unified framework, auditory speech with synchronized lip movements can be generated automatically. We train both syllable and triphone models as the speech synthesis units, and compared their performance in text-to-audio-visual speech synthesis. Experimental results show that the generated audio-visual speech using triphone models achieved higher performance than that using syllable models. 1. INTRODUCTION Incorporating bimodality of speech into human-computer interaction interfaces generally enhances speech perception and understanding by both humans and computers. From this point ...