Conference PaperPDF Available

EMOTION INCLUSION IN AN ARABIC TEXT-TO-SPEECH

Authors:

Abstract and Figures

Many attempts have been conducted to add emotions to synthesized speech. Few are done for the Arabic language. In the present paper, we introduce a work done to incorporate emotions: anger, joy, sadness, fear and surprise, in an educational Arabic text-to-speech system. After an introduction about emotions, we give a short paragraph of our text-to-speech system, then we discuss our methodology to extract rules for emotion generation, and finally we present the results we had and try to draw conclusions.
Content may be subject to copyright.
EMOTION INCLUSION IN AN ARABIC TEXT-TO-SPEECH
O. Al-Dakkak*, N. Ghneim*, M. Abou Zliekha** and S. Al-Moubayed**
* HIAST
P.O. Box 31983, Damascus, SYRIA
phone: + (963-11) 5120547, fax: + (963-11) 2237710.
email: odakkak@hiast.edu.sy ; email: n_ghneim@netcourrier.com
**Damascus University/Faculty of Information Technology
email: mhd-it@scs-net.org ; email: kamal@scs-net.org
ABSTRACT
Many attempts have been conducted to add emotions to
synthesized speech [1]. Few are done for the Arabic lan-
guage. In the present paper, we introduce a work done to
incorporate emotions: anger, joy, sadness, fear and sur-
prise, in an educational Arabic text-to-speech system. Af-
ter an introduction about emotions, we give a short para-
graph of our text-to-speech system, then we discuss our
methodology to extract rules for emotion generation, and
finally we present the results we had and try to draw con-
clusions.
1. INTRODUCTION
When compared with human speech, synthetic speech is
in general less intelligible, and less expressive [2]. These
are drawbacks for conversational computer systems or for
reading machines. The role of emotions in speech is to
provide the context in which speech should be interpreted
and signal speaker intentions, and this is essential in syn-
thesized speech.
Synthesis systems have to simulate emotions if they want
to produce them. There are two ways to perceive emo-
tions: (1) Generative (speaker) model, which depends on
the mental and physical states of the speaker, and the
syntax and semantic of the utterance, (2) acoustic (lis-
tener) model, which describes the acoustic signal parame-
ters as perceived by the listener [2], [3] which we have
adopted in our work.
In the present article, we are merely concerned with the
production of emotions in Arabic, and the incorporation of
these emotions in synthetic speech produced by an Arabic
TTS system.
2. ARABIC TTS SYSTEM
We intend to build a complete system of standard spoken
Arabic with a high speech quality. The steps to achieve
this goal were (1) the definition of the phonemes' set used
in standard Arabic including the open /E/ and /O/ [4], (2)
the establishment of the Arabic text-to-phonemes rules by
using the TOPH (Orthographic-PHonetic Transliteration)
system [5] after its adaptation to Arabic Language [6], and
(3) the definition of the acoustic units; the semi-syllables,
and the corpus from which these units are to be extracted,
and in parallel, (4) recording the corpus and extracting the
acoustic units prior to analyzing them using PSOLA tech-
niques [7], and in parallel (5) the incorporation of pro-
sodic features in the syntactic speech.
The first three steps are already done. As we intend to use
more phonemes than MBROLA systems [4], [8], we de-
cided to choose the MBROLA system to perform prelimi-
nary text-to-speech. In fact, Arabic is rather a syllabic
language, and semi-syllables are more appropriate for the
synthesis [9], [10]. Our corpus is already decided and is in
the recording phase.
The output of our third step is converted according to
MBROLA transcription. MBROLA system allows control
on pitch contour and duration for each phoneme. That
enabled us to test our prosody and emotion synthesis. We
recall works previously done in the field of general pros-
ody generation for Arabic TTS, such as the ones in [11],
[12].
In the present paper we focus on the incorporation of
emotions in the system.
3. RULE EXTRACTION FOR VARIOUS
EMOTIONS
3.1 Methodology
The most crucial acoustic parameters to consider for emo-
tion synthesis are the prosodic parameters: pitch, duration
and intensity [2], [3]. The variations of each of these pa-
rameters are described through the following other sub-
parameters [13], [2], [14], [15]:
F0 Parameter:
F0 Range (difference between F0max
and F0min)
Variability (degree of variability: high,
low..)
Average F0
Contour slope (shape of contour slope)
Jitter (irregularities between successive
glottal pulses)
Pitch variation according to phoneme
class
Duration Parameter
Speech rate
Silence rate
Duration variation according to pho-
neme class
Duration variation according to pitch
Intensity Parameter
Intensity variation according to pitch
Our methodology was to (1) record a corpus of sentences
emotionless and with different emotions, (2) analyze these
sentences to extract the various parameters and sub-
parameters, and extract rules, (3) synthesize emotions
according to these rules, and finally test the results and
apply tuning on the rules when necessary.
3.2 Recording, analysis and rules extraction
Twenty sentences were chosen for each emotion. Each
sentence was recorded twice, one emotionless and the
other with the intended emotion. All these sentences were
analyzed using PRAAT system to find the prosodic pa-
rameters. A statistical study followed to find the relevant
changes between the pairs of sentences for each emotion.
The following results were found (Table 1):
Emotion Prosodic Rules
Anger F0 mean: + 40%-75%
F0 range: + 50%-100%
F0 at vowels and semi-vowels: + 30%
F0 slope: +
Speech rate: +
Silence rate: -
Duration of vowels and semi-vowels: +
Intensity mean: +
Intensity monotonous with F0
Others: F0 variability: +, F0 jitter: +
joy F0 mean: + 30%-50%
F0 range: + 50%-100%
F0 at vowels and semi-vowels: + 30%
F0 slope: -
Speech rate: -
Duration of vowels and semi-vowels: +
Intensity mean: +
Intensity monotonous with F0
Others: F0 variability: +, F0 jitter: +
sadness F0 mean: + 40%-70%
F0 range: + 180%-220%
F0 at vowels and semi-vowels: +
Speech rate: -
Silence rate: +
Duration of vowels and semi-vowels: +
Intensity mean: +
fear F0 mean: + 50%-100%
F0 range: +100%-150%
F0 at vowels, semi-vowels, nasals and frica-
tives: +
Speech rate: +
Silence rate: -
Duration of vowels and semi-vowels: +
Intensity mean: +
Intensity monotonous with F0
Others: F0 variability: +, F0 jitter: +
surprise F0 mean: + 50%-80%
F0 range: + 150%-200%
F0 at vowels and semi-vowels: +
Speech rate: +
Silence rate: -
Duration of vowels and semi-vowels: +
Others: F0 variability: +
Table 1: Results on natural speech
3.3 Emotion synthesis
To test the above rules, we have developed a tool linked to
our TTS system, to control emotional parameters over the
Arabic text automatically. The inherent synthetic prosody
(emotionless), built in the system is rather coarse, thus the
application of the above rules did not give always the de-
sired emotion perception. We had to tune those rules to
cope with the synthesizer. The final experimental emo-
tional rules are given below (Table 2):
Emotion Prosodic Rules
Anger F0 mean: + 30%
F0 range: + 30%
F0 at vowels and semi-vowels: + 100%
Speech rate: + 75%-80%
Duration of vowels and semi-vowels: + 30%
Duration of fricatives: + 20%
joy F0 mean: + 50%
F0 range: + 50%
F0 at vowels and semi-vowels: + 30%
F0 at fricative: + 30%
Speech rate: + 75%-80%
Duration of vowels and semi-vowels: + 30%
Duration of last vowel phonemes: + 20%
Others: F0 variability: +40%
sadness F0 range: + 130%
F0 at vowels and semi-vowels: + 120%
F0 at fricative: + 120%
Speech rate: - 130%
fear F0 mean: + 40%
F0 range: + 40%
F0 at vowels, semi-vowels, nasals and frica-
tives: +30%
Speech rate: - 75%-80%
Others: F0 variability: +60%, F0 jitter: +3%
surprise F0 mean: + 220%
F0 at vowels and semi-vowels: +150%
Speech rate: - 110%
Duration of vowels: +200%
Duration of semi-vowels: +150%
Others: F0 variability: +60%
Table 2: Results on synthetic speech
The following five figures show F0 contours for each emo-
tional type sentence with its corresponding emotionless
sentence.
Figure 1: Anger emotion and emotionless /؟ﻚﺴﻔﻧ ﻦﻈﺗ ﻦﻣ/
“who do you think you are?”
Figure 2: Joy emotion and emotionless /ءﺎﻤﺴﻟا ﻦﻣ مﻮﻴﻐﻟا ﺖﻟاز/
“No more clouds in the sky”
Figure 3: Sadness emotion and emotionless / ًا ﻦیﺰﺣ ﺎﻧأ
مﻮﻴﻟا/ “I am so sad today!”
Figure 4: Fear emotion and emotionless / ﺮﻈﻨﻤﻟا اﺬه ﺎﻣ ﻲﻬﻟإ ﺎی
ﻒﻴﺨﻤﻟا / “God! What a scary scene!”
Figure 5: Surprise emotion and emotionless / ﺮﻈﻨﻣ ﻦﻣ ﻪﻟ ﺎی
ﻞﻴﻤﺟ /! “What a beautiful scene!”
4. RESULTS
Using the experimental rules, five sentences for each emo-
tion were synthesized and listened by 10 people. Each in-
dividual was asked to give the perceived emotion for each
sentence. Table 3 shows the results of this test.
Others
Sur-
prise
Fear
Sadness
Joy
Anger
synthesized
6%
0%
7%
2%
0%
75%
Anger
18%
13%
2%
0%
67%
0%
Joy
20%
0%
5%
70%
0%
5%
Sadness
12%
0%
80%
5%
0%
3%
Fear
15%
73%
2%
0%
10%
0%
Surprise
Table 3: Emotion recognition rates
Some people believed that some tested sentences have
more than one emotion.
5. CONCLUSION
An automated tool has been developed for emotional Ara-
bic synthesis. The new prosodic model, proposed and
tested in this work proved to be successful, especially
when applied in conversational contexts.
A further work will follow to incorporate other emotions
like disgust, and annoyance.
The quality of the TTS System with its prosody plays a
crucial role in emotion synthesis. We intend to refine our
prosodic model; the emotional rules have to be revalidated
to cope with it.
REFERENCES
[1] M. Schroder, “Emotional speech synthesis: A re-
view”. in Proc. of Eurospeech 2001, Aalborg, Denmark
vol. 1, pp. 561–564, 2001. URL
http://www.dfki.de/~schroed/ publications.html.
[2] J. Cahn “The generation of affect in synthesized
speech” Journal of the American Voice I/O Society, Vol. 8.
pp. 1-19, Jul. 1990.
[3] O. Pierre-Yves, “The production and recognition of
emotions in speech: features and algorithms” Interna-
tional Journal of Human-Computer Studies, vol.59 n.1-2,
pp.157-183, Jul. 2003.
[4] O. Al dakkak, N. Ghneim, “Towards Man-Machine
Communication in Arabic” in Proc. Syrian-Lebanese
Conference, Damascus SYRIA, October 12-13, 1999.
[5] V. Aubergé, "La Synthèse de La parole: des Règles
aux Lexiques", Thèse de l'université Pierre Mendès
France, Grenoble2, 1991.
[6] N. Ghneim, H. Habash, “Text-to-Phonemes in Ara-
bic”, Damascus University Journal for the Basic Sci-
ences,. vol. 19, n. 1,. 2003.
[7] E. Moulines and J. Laroche, “Non-parametric tech-
niques for pitch-scale and time-scale modification of
speech” Speech Communication, vol. 16, pp. 175-205,
1995.
[8] T. Dutoit, V. Pagel, N. Pierret, F. Bataille and O. van
der Vrecken,”The MBROLA project: towards a set of
high quality speech synthesizers free of use for non-
commercial purposes”, Proc. of ICSLP’96, pp. 1393-
1396, 1996.
[9] N. Chenfour, A. Benabbou and A. Mouradi, “ Etude et
Evaluation de la di-syllabe comme Unité Acoustique pour
le Système de Synthèse Arabe PARADIS”, Second Inter-
national Conference on language resources and evalua-
tion, Athenes, Greece, 31 May-2 June 2000.
[10] N. Chenfour, A. Benabbou and A. Mouradi, “ Syn-
thèse de la Parole Arabe TD-PSOLA Géneration et Co-
dage Automatiques du Dictionnaire ”, Second Internatio-
nal Conference on language resources and evaluation,
Athenes, Greece, 31 May-2 June 2000.
[11] S. Nasser Eldin, H. Abdel Nour and A. Rajouani “En-
hancement of a TTS System for Arabic Concatenative
Synthesis by Introducing a Prosodic Model”, ACL/EACL
2001 workshop, Toulouse-France 2001.
[12] S. Baloul, M. Alissali, M. Baudry and P. Boula de
Mareuil “Interface syntaxe-prosodie dans un système de
synthèse de la parole à partir du texte en arabe”, XXIVie-
mes Journées d'Etudes sur la Parole, CNRS/Université
Nancy2, Nancy, France, 24-27 juin 2002.
[13] F. Zotter, “Emotional speech”, at URL:
http://spsc.inw.tugraz.at/courses/asp/ws03/talks/zotter.pdf
[14] I. R. Murray, M. D. Edgington, D. Campion, and J.
Lynn, “Rule-based emotion synthesis using concatenated
speech”, ISCA Workshop on Speech & Emotion, Northern
Ireland, 2000, pp. 173-177.
[15] J.M. Montero, J. Gutiérrez-Arriola, J. Colás,
E.Enríquez and J.M.Pardo, “Analysis and modelling of
emotional speech in Spanich”, at URL:
http://lorien.die.upm.es/~juancho/conferences/0237.pdf
Identified
... Certains travaux se sont intéressés à la synthèse de la parole arabe expressive. La méthode décrite dans (Al-Dakkak et al., 2005), permet de faire la synthèse de phrases en arabe en introduisant cinq émotions : joie, peur, colère, tristesse et surprise. Pour ce faire, une approche de synthèse par règles a été combinée avec le synthétiseur MBROLA . ...
Thesis
Cette thèse porte sur l’adaptation de la synthèse paramétrique de la parole à partir d’un texte écrit à la langue arabe. Pour ce faire, différentes méthodes ont été développées afin de mettre en place des systèmes de synthèse. Ces méthodes sont basées sur une description du signal de parole par un ensemble de paramètres acoustiques et prosodiques. De même, chaque son est représenté par un ensemble de descripteurs contextuels contenant toutes les informations affectant la prononciation de celui-ci. Une partie de ces descripteurs dépend de la langue et de ses particularités, ainsi, afin d'adapter l’approche de synthèse paramétrique à l’arabe, une étude des particularités phonologiques de l’arabe était nécessaire. L’accent a été mis sur deux phénomènes : la gémination et la longueur des voyelles (courte/longue). Deux descripteurs associés à ces deux phénomènes ont été ajoutés à l’ensemble des descripteurs contextuels. De même, différentes approches de choix des unités ont été proposées pour modéliser les consonnes géminées et les voyelles longues. Quatre combinaisons de modélisation sont possibles en alternant la différentiation ou la fusion des consonnes simples et géminées d’une part et des voyelles courtes et longues d’autres part. Un ensemble des tests perceptifs et objectifs a été conduit afin d’évaluer l’effet des quatre approches de modélisation des unités sur la qualité de la parole synthétisée. Les évaluations ont été faites dans le cas de synthèse paramétrique par HMM (Hidden Markov Model) puis dans le cas de la synthèse paramétrique par DNN. Les résultats subjectifs sont montrés que dans le cas de l’approche par HMM, les quatre approches produisent des signaux de qualité similaire, une conclusion qui a été confirmée par les mesures objectives calculées pour évaluer la prédiction des durées des unités de parole. Cependant, les résultats des évaluations objectives dans le cas de l’approche par DNN ont montré que la différentiation des consonnes simples (respectivement des voyelles courtes) des consonnes géminées (respectivement des voyelles longues) permet d’avoir une prédiction des durées légèrement meilleure qu’avec les autres des approches de modélisation. En revanche, cette amélioration n’a pas été perçue lors des tests perceptifs ; les participants ont trouvé que les signaux générés par les quatre approches sont similaires en termes de qualité globale. Une dernière partie de la thèse a été consacrée à la comparaison de l’approche de synthèse par HMM à celle par DNN. L’ensemble des tests conduits ont montré que l’utilisation des DNN a amélioré la qualité perçue des signaux générés.
... Going through the literature of Arabic speech emotion processing, we find some studies and a few more or less significant emotional databases. Between 2005 and 2006, we find three Syrian works about the introduction of the emotion parameters for Arabic text-to-speech synthesis [2,4]. Al-Dakkak et al. tried to improve the Arabic synthetic speech (MSA Arabic) in order to sound as natural. ...
Chapter
Full-text available
In this paper, we present and describe our first work to design and build a natural Arabic visual-audio database for the computational processing of emotions and affect in speech and language which will be made available to the research community. It is high time to have spontaneous data representative of the Modern Standard Arabic (MSA) and its dialects. The database consists of audio-visual recordings of some Arabic TV talk shows. Our choice comes down on the different dialects with the MSA. As a first step, we present a sample data of Algerian dialect. It contains two hours of audio-visual recordings of the Algerian TV talk show “Red line”. The data consists of 14 speakers with 1, 443 utterances which are complete sentences. 15 emotions investigated with five that are dominants: enthusiasm, admiration, disapproval, neutral, and joy. The emotion corpus serves in classification experiments using a variety of acoustic features extracted by openSMILE. Some algorithms of classification are implemented with the WEKA toolkit. Low-level audio features and the corresponding delta features are utilized. Statistical functionals are applied to each of the features and delta features. The best classification results - measured by a weighted average of f-measure - is 0.48 for the five emotions.
... Up until recently, research on Arabic emotion and expressivity has been very limited. The earliest works on expressive speech synthesis in Arabic seem to be those of Al-Dakkak, in 2005 [9,10]. In them, Al-Dakkak records sentences in various emotional states, and analyzes them via a statistical study of the pitch, intensity, and duration (speech rate), in order to extract rules to be applied to the synthesized speech, e.g. for anger: "F0 mean increases by 40-75%, F0 slope increases, intensity mean decreases", and so on. ...
Article
Full-text available
In this article we present the methodology employed for the design and evaluation of a Basic Arabic Expressive Speech corpus (BAES-DB). The corpus, which has a total length of approximately 150 minutes, is constituted of 13 speakers uttering a set of 10 sentences while simulating 3 emotions (joy, anger and sadness) in addition to a neutral utterance. The 10 sentences have been selected to meet phonetic equilibrium and absence of emotional content criteria, from a corpus of sentences proposed in Boudraa [1]. The corpus was evaluated through various tests of guided categorization, performed through a website. The overall good recognition rate is currently at 83.03%, with sadness being the most well recognized type of expressive speech (90.93%) and joy being the least well recognized (73.87%).
... Arabic, despite being spoken by a very large proportion of the world population, is one of the languages for which little work has been done. The first study on Arabic expressive speech synthesis seems to be the work of Al-Dakkak et al. in 2005 [2,3]. Al-Dakkak recorded sentences in various emotional states, and analyzed them via a statistical study of the pitch, intensity, and duration (speech rate), in order to extract rules to be applied to the synthesized speech, e.g. for anger: "F0 mean increases by 40-75 %, F0 slope increases, intensity mean decreases", and so on. ...
Conference Paper
Full-text available
In this paper we will present a contribution to the design of an expressive speech synthesis system for the Arabic language. The system uses diphone concatenation as the synthesis method for the generation of 10 phonetically balanced sentences in Arabic. Rules for the orthographic-to-phonetic transcription are detailed, as well as the methodology employed for recording the diphone database. The sentences were synthesized with both “neutral” and “sadness” expressions and rated by 10 listeners, and the results of the test are provided.
... • Emotional audio-visual text-to-speech (TTS) system for Arabic Language (Al-Dakkak et al., 2005;Abou-Zliekha et al., 2006). The system is based on two entities: an emotional audio text-tospeech system, which generates speech depending on the input text and the desired emotion type, and an emotional visual model which generates the talking heads, by forming the corresponding visemes. ...
Article
Full-text available
In this paper, system is presented for speech synthesis of Arabic vocabularies written with diacritics, graphemes transfer to allophones by applying the phonological rules. The system depend on phonetics units series every unite consists of one ore more phonetic character. We create linguistic group from Arabic phonemes and a group of phonological rules and special and unusual cases by designing special function for this purpose. We use (visual basic v.6) to design this system. Evaluation of the system was undertaken to assess the accuracy on word and sentence levels. The results showed high perception levels about 84 %.
Article
Full-text available
In this study, we propose the approach to implement an interactive speech web site in Arabic and English; this method relies on the Integration between open source software systems of speech recognition, Text to Speech (TTS) and dialogue systems to build the application. the application is targeted to enable blind people and children to browse a news web site with short stories. It can help to enter the digital world in spite of their difficulties.
Article
The text-to-speech (TTS) synthesis technology enables machine to convert text into audible speech and used throughout the world to enhance the accessibility of the information. The important component of any TTS synthesis system is the database of sounds. In this study, three types of sound units i.e., phonemes, diphones and syllables are concatenated to produce natural sound for good quality Sindhi text to speech (STTS) system. The object of this paper consists in treating the phonemes, diphones and syllables under the aspect of the lexicon. The methodology used in STTS is to exploit acoustic representations of speech for synthesis, together with linguistic analyses of text. Sindhi is highly homographic language, the text is written without diacritics in real life applications, that creates lexical and morphological ambiguity. The problem of understating non-diacritic words can be solved using semantic knowledge. This paper describes a Sindhi TTS synthesis system that relies on a WordNet to identify the analogical relations between words in the text. The proposed approach is focused on the use of WordNet structures for the task of synthesis. The architecture and novel algorithm for STTS is proposed. The experiments using WordNet that show promising results and the accuracy of our proposed approach is acceptable.
Conference Paper
Full-text available
Text-to-speech is a crucial part of many man-machine communication applications, such as phone booking and banking, vocal e-mail, and many other applications. In addition to many other applications concerning impaired persons, such as: reading machines for blinds, talking machines for persons with speech difficulties. However, the main drawback of most speech synthesizers in the talking machines, are their metallic sounds. In order to sound naturally, we have to incorporate prosodic features, as close as possible to natural prosody, this helps to improve the quality of the synthetic speech. Actual researches in the world are towards better "automatic prosody generation"
Article
Full-text available
This paper presents a syntactico-prosodic model and its implementation in a diphone Arabic text-to-speech (TTS) system. This model, based on rewrite rules, first calculates the syntactic markers of the input text. Second, a phrasing operation segments it into chunks. The syntax-prosody interface then enables the allocation of pauses and the generation of prosodic parameters: the melodic contour depends on the sentence modality, on the word position within chunks and on the chunk position within the sentence. The implemented modules are currently being evaluated within a multilingual TTS system assessment.
Conference Paper
Full-text available
The aim of the MBROLA project, initiated by the Faculte Polytechnique de Mons (Belgium), is to obtain a set of speech synthesizers for as many voices, languages and dialects as possible, free of use for non-commercial and non-military applications. The ultimate goal is to boost academic research on speech synthesis, and particularly on prosody generation, known as one of the biggest challenges taken up by text-to-speech synthesizers for the years to come. Central to the MBROLA project is MBROLA 2.00, a speech synthesizer based on the concatenation of diphones. Executable files of this synthesizer have been made freely available for many computers/operating systems, as well as a first diphone database for a French male voice. We describe the terms of participation to the project, as a user, as an associated developer, or as a database provider
Article
Concatenative speech synthesis is increasing in popularity, as it offers higher quality output than previous formant synthesisers. However, it is based on recorded speech units, concatenative synthesis offers a lesser degree of parametric control during re- synthesis. Consequently, adding pragmatic effects such as different speaking styles and emotions at the synthesis stage is fundamentally more difficult than with formant synthesis. This paper describes the results of a preliminary attempt to add emotion to concatenative synthetic speech (using BT's Laureate synthesiser), initially using techniques already applied successfully to formant synthesis. A new intonation contour (including both pitch and duration changes) was applied to the concatenated segments during production of the final audible utterance, and some of the available synthesis parameters were systematically modified to increase the affective content. The output digital speech samples were then subject to further manipulation with a waveform editing package, to produce the final output utterance. The results of this process were a small number of manually-produced utterances, but which illustrated that affective manipulations were possible on this type of synthesiser. Further work has produced rule-based implementations which allow automatic production of emotional utterances. Development of these systems will be described, and some initial results from listener studies will be presented.
Article
Time-scale and, to a lesser extent, pitch-scale modifications of speech and audio signals are the subject of major theoretical and practical interest. Applications are numerous, including, to name but a few, text-to-speech synthesis (based on acoustical unit concatenation), transformation of voice characteristics, foreign language learning but also audio monitoring or film/soundtrack post-synchronization. To fulfill the need for high-quality time and pitch-scaling, a number of algorithms have been proposed recently, along with their real-time implementation, sometimes for very inexpensive hardware. It appears that most of these algorithms can be viewed as slight variations of a small number of basic schemes. This contribution reviews frequency-domain algorithms (phase-vocoder) and time-domain algorithms (Time-Domain Pitch-Synchronous Overlap/Add and the like) in the same framework. More recent variations of these schemes are also presented.
Article
This paper presents algorithms that allow a robot to express its emotions by modulating the intonation of its voice. They are very simple and efficiently provide life-like speech thanks to the use of concatenative speech synthesis. We describe a technique which allows to continuously control both the age of a synthetic voice and the quantity of emotions that are expressed. Also, we present the first large-scale data mining experiment about the automatic recognition of basic emotions in informal everyday short utterances. We focus on the speaker-dependent problem. We compare a large set of machine learning algorithms, ranging from neural networks, Support Vector Machines or decision trees, together with 200 features, using a large database of several thousands examples. We show that the difference of performance among learning schemes can be substantial, and that some features which were previously unexplored are of crucial importance. An optimal feature set is derived through the use of a genetic algorithm. Finally, we explain how this study can be applied to real world situations in which very few examples are available. Furthermore, we describe a game to play with a personal robot which facilitates teaching of examples of emotional utterances in a natural and rather unconstrained manner.
Conference Paper
Attempts to add emotion effects to synthesised speech have existed for more than a decade now. Several prototypes and fully operational systems have been built based on different synthesis techniques, and quite a number of smaller studies have been conducted. This paper aims to give an overview of what has been done in this field, pointing out the inherent properties of the various synthesis techniques used, summarising the prosody rules employed, and taking a look at the evaluation paradigms. Finally, an attempt is made to discuss interesting directions for future development.
Article
Thesis (doctoral)--Université Stendhal-Grenoble 3, 1991.