Overall function of the PDT approach.

Source publication

Integrating rule and template-based approaches for emotional Malay speech synthesis

Conference Paper

Full-text available

Sep 2008

Context 1

... synthesizer only generates pitch values for vowels because Malay speech utterances focus on vowels for the proper articulation of emotions. All recorded sentences for each type of emotion were subjected to the same process of extraction, comparison and formulation of differential rate. An average dpmr and ppmr value was calculated for each type of emotion. The emotional synthesizer focuses the prosody manipulation at the beginning and at the end of the sentence because much of the F0 movement occurred at the beginning and end of a sentence. The emotional synthesizer reported on in this paper combines two approaches. Prosody templates are used for text input made up of four words with a combination of two and three syllables. Any other forms of input texts are synthesized by parametric prosody manipulation. When users input text into the emotion synthesizer and select the type of emotion to be synthesized, the synthesizer chooses the approach by counting the number of words and syllable structure of the input text. Figure 1 below shows the flowchart of the emotional synthesizer. This approach replaces the entire standard pitch and duration parameter with emotional parameters available in the prosody database. Each type of syllable combination has its own parameters. Altogether, there are sixteen possible combinations of two and three syllables, which mean there are sixteen templates for each type of emotion. The overall function of this approach is illustrated in figure 2 below. For synthesizing emotional speech using the template approach, the first step is to determine the syllable structure for selecting the appropriate prosody parameter to be used. Next, the standard prosody parameter is removed and replaced with available parameter in the prosody. The replacement of the parameter is on individual phoneme level. The syllable format of the prosody consists of a CVC arrangement. The text input, however, can have a combination of CVC, V or CV. If the input text has CV structure, the emotional synthesizer only replaces the phonemes available in the template and ignores any phonemes that do not have a match. For example the word “ayah” (a-yah) has V structure followed by CVC. The replacement for the first syllable is based only on the vowel, and the second syllable is based on the entire CVC prosody. Since there is no third syllable, the process ends at the second syllable and then moves on to the next word. This process is illustrated in figure 3 below. For the other form of text input, the emotion synthesizer uses parametric prosody manipulation. For synthesizing emotion using this approach, the first step is to isolate and identify individual phonemes as consonants or vowels. The emotion synthesizer manipulates the duration of each phoneme according to the dpmr factor available for each type of emotion. Vowels that were identified were manipulated not only for duration but also for pitch using the ppmr factor, as illustrated in diagram 4 below. The synthesized outputs for the two approaches were tested separately. For the prosody template approach, a listening test identical to the earlier test was carried out. Sixteen new sentences for each type of emotion were constructed with different syllable arrangements and structures. These sentences were synthesized using the synthesizer and were converted to waveform files. All sound files were renamed and arranged randomly. Each participant listened to the voice snippet and determined the emotion intended through forced choice answer sheets. The perception test carried out revealed that the recognition rate for anger was highest at 82.67%, followed by happiness at 80.33%. The recognition rate for sadness and fear was much lower at 68.33% and 65.69% respectively. Table 6 below tabulates the result of the perception test for the PDT approach. For the parametric approach, the listening test was conducted by use of an acceptance test in which participants input their own sentences with a variety of lengths and syllable structures but not the word length and syllable structure of the PDT approach. Each participant input ten different sentences for each type of emotion and listened to the synthesized output. If the subjects judged the synthesized form to be acceptable emotional output, they marked the ‘agree’ column on the answer sheets. Otherwise, they marked the ‘disagree’ column and determined the type of emotion that they perceived when they listened to the synthesized output. The acceptance test conducted with 20 participants shows that the synthesizer can produce recognizable emotional synthesis for any given input text. From total of 200 input sentences for each emotion, anger has the highest conformation rate at 81.00 % followed by happiness at 77.00%. Sad acceptance rate is at 69.00% and fear at 65.00% as illustrated in figure 5 below. To enable comparison between the two approaches, a confusion matrix was constructed for the acceptance test based on the interpretation made by participants during the acceptance test. Table 7 below shows the confusion matrix for the PPM approach based on interpretation of participants. Some participants interpreted the synthesized anger as fear or ecstasy. For happiness, it is sometimes perceived as excitement. Sadness is confused as an expression of fear, boredom, or disgust. Finally, fear is mistaken for sadness, anger, ecstasy or disgust. Prosody manipulation has been given credit for its ability to generate emotional synthesized speech. The most commonly used manipulation technique is the rule-based approach, in which prosody values are generated by some sort of numerical processing, which can be applied either at phoneme level or at sentence level. Little emphasis has been placed on prosody generation using database templates, in view of the complexity of the input text. Database templates produce better variations in output than the rule-based approach. While each approach has its own advantages, the combination of the two approaches should improve the prosodic quality of the synthesized output. The result shows that the ability of the synthesizer to accurately generate the right kind of emotion was generally satisfactory, judging by the listening and acceptance tests used in this research. Users were satisfied with the accuracy of the output as well as with the variation in speech ...

View in full-text

Prosodic Analysis And Modelling For Malay Emotional Speech Synthesis

Article

Full-text available

Sep 2010

This paper discusses an emotional prosody generator for a Malay speech synthesis system that can re-synthesize the selected vocal emotion from neutral synthesized speech output and improve the naturalness by adopting rule-based prosody conversion techniques. The role of prosodic features in emotional expression, particularly fundamental frequency and duration, has been widely investigated in several research projects. This project attempts to improve the naturalness of the synthesized emotional Malay speech by establishing an effective mechanism for the re-synthesis of emotion. Such a mechanism is created by analyzing the variation in the F0 contour of continuous emotional Malay speech against a fixed time period. The emotional prosodic generator for Malay developed in the course of this research makes use of principles of parametric prosody manipulation to synthesize four basic emotions, namely happiness, anger, sadness and fear. Subjective evaluation by means of listening tests was conducted to validate the ability of the emotions generator to generate the necessary prosody to synthesize emotional expression. The evaluation results show an overall recognition rate of between 61% and 85%.

EM-HTS: Real-Time HMM-Based Malay Emotional Speech Synthesis

Article

Full-text available

Jan 2010

This research aims at developing a real-time HMM-based Malay emotional speech synthesis (EM-HTS) that has the ability to synthesize any form of text input in four different expression which are neutral, anger, sadness and happiness. The quality of the emotional speech synthesis was improved by using Neutral to Angry, Sad, and Happy (NASH) duration generator, which uses context-dependent duration generation method to improve the duration information to the label files of target emotions for training purpose. We conducted three forms of evaluations to determine the accuracy, intelligibility and naturalness of the speech generated by EM-HTS. All the three tests show that the adopted method (NASH) gives a better reproduction of prosody compared to conventional method using the same training speech data.

Overall function of the PDT approach.

Context in source publication

Citations