Content uploaded by Xavier Llora
Author content
All content in this area was uploaded by Xavier Llora on Mar 28, 2014
Content may be subject to copyright.
Evolving Emotional Prosody
Cecilia Ovesdotter Alm and Xavier Llor`a
IlliGAL Report No. 2006018
April, 2006
Illinois Genetic Algorithms Laboratory
University of Illinois at Urbana-Champaign
117 Transportation Building
104 S. Mathews Avenue Urbana, IL 61801
Office: (217) 333-2346
Fax: (217) 244-5705
Evolving Emotional Prosody
Cecilia Ovesdotter Alm Xavier Llor`a
Department of Linguistics National Center for Supercomputing Applications,
University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign,
707 S. Mathews Ave., Urbana, IL 61801, USA 1205 W. Clark Street, Urbana, IL 61801, USA
ebbaalm@uiuc.edu xllora@illigal.ge.uiuc.edu
Abstract
Emotion is expressed by prosodic cues, and this study uses the active interactive Genetic Al-
gorithm to search a wide space for sad and angry parameters of intensity, F0, and duration
in perceptual resynthesis experiments with users. This method avoids large recorded databases
and is flexible for exploring prosodic emotion parameters. Solutions from multiple runs are
analyzed graphically and statistically. Average results indicate parameter evolution by emo-
tion, and appear best for sad speech. Solutions are quite successfully classified by CART, with
duration as main predictor.
1 Introduction
Emotion is expressed by prosodic cues, but their interplay is an open question, which is compli-
cated by a challenging search space. Common procedure for emotional speech research depends
on recording and analyzing large databases. Instead, this work uses the active interactive Genetic
Algotihm (aiGA) (Llor`a, Al´ıas, Formiga, Sastry, & Goldberg, 2005) to evolve emotional prosody in
perceptual resynthesis experiments. Within this framework, fundamental parameters of emotional
prosody become an optimization problem, approximated by searching perceptual space of a listener
via interactive feedback. In contrast to unit or parameter estimation based on emotional speech
databases, e.g. (Hofer & Clark, 2005; Hsia, Wu, & Liu, 2005), the method only requires neutral
utterances as starting point, and user preferences guide the direction of the efficient aiGA. Thus,
there is no need to record large emotion databases, and parameter findings are not limited to what
is found in data; instead models evolve more freely, as permitted by imposed bounds. Results
from an initial experiment on 1-word utterances with the goal to evolve angry and sad speech
are analyzed, and indicate that aiGA evolves prosodic parameters by emotion. Solutions are quite
well predicted by a CART model.
2 Related work
Modifications in F0, intensity, and duration are facets of emotional prosody (Scherer, 2003). While
anger is often characterized by increased speech rate, F0, and intensity, sadness is assumed
marked by opposite behavior e.g. (Murray & Arnott, 1993). Other features have been suggested,
but with less evidence, e.g. voice quality (Ni Chasaide & Gobl, 2001). Synthesizing emotional speech
has been attempted with various techniques (Schr¨oder, 2001). EmoSpeak (Schr¨oder, 2004) allows
1
Table 1: Words used as resynthesis basis
Monosyllabic sas, bem, face, tan
Bisyllabic barlet, person, cherry, tantan
Trisyllabic bubelos, strawberry, customer, tantantan
manual manipulation of many parameters with an interesting dimensional interface, but parameters
were fitted to a database and literature. An interesting study drew on a Spanish emotional speech
corpus for Catalan emotional prosody (Iriondo, Al´ıas, Melench´on, & Llorca, 2004).
Despite much previous work, emotional profiles remain unclear (Tatham & Morton, 2005).
Fresh work may contribute to increased understanding of emotional prosody, and the suggested
approach rephrases the research question as: on average, how is a particular emotion best rendered
in synthetic speech? A step has been taken before toward structured search (Sato, 2005), but seemed
to use a simple iGA, which ignores important considerations in interactive computation such as
user fatigue and flexible targets (Takagi, 2001; Llor`a, Sastry, Goldberg, Gupta, & Lakshmi, 2005).
GAs (Goldberg, 1989) are iterative search procedures that resemble natural adaptation phenomena.
Issues and applications in interactive evolutionary computation have been surveyed (Takagi, 2001),
as have recent advances in aiGA theory (Llor`a, Sastry, Goldberg, Gupta, & Lakshmi, 2005). AiGA
has been successful for speech, by interactively estimating cost functions for unit-selection TTS
(Llor`a, Al´ıas, Formiga, Sastry, & Goldberg, 2005); aiGA ensured high intra-run consistency in
subjective evaluations and decreased user evaluations compared to a simple iGA, i.e. combating
user fatigue.
3 Experimental design
Interactive evaluation was used to evolve emotional prosody with aiGA developed by (Llor`a, Sastry,
Goldberg, Gupta, & Lakshmi, 2005; Llor`a, Al´ıas, Formiga, Sastry, & Goldberg, 2005). A user’s
feedback guided the process to estimate performance and evolve a synthetic model beyond what
was presented to the user (for details, cf. (Llor`a, Al´ıas, Formiga, Sastry, & Goldberg, 2005)). AiGA
assumed variable independence and built a probabilistic model, in this case based on a population
of normal distributions1with the UM DAcalgorithm (Gonz´ales, Lozano, & Larra˜nga, 2002), such
that the output of each run rwas an evolved synthetic normal model (µr, σr) for each prosodic
variable.
In this experiment, users listened to and evaluated pairs of resynthesized utterances. The aiGA
had parameter vectors or individuals with prosodic parameters for resynthesizing 1-word utterances.
Each individual had 3 values for Int (intensity), F0 (mean F0), and Dur (total duration) ∈[0,1],
encoded as proportions deviating from the original neutral word at 0.5, with truncation applied if
the synthetic model evolved beyond [0,1]. Each run had 3 iterations, which (Llor`a, Al´ıas, Formiga,
Sastry, & Goldberg, 2005) found sufficient, and a user evaluated 21 pairs of sounds in a run.
Individuals were initialized randomly with a different seed for each day of the experiment, except
that one individual’s values were set according to trends in the literature for each emotion (see
sec. 2). The search space for each variable was delimited by upper and lower bounds, cf. Table 2,
adjusted to the original voice to avoid unnatural speech.
Conversion between actual numbers for resynthesis and their corresponding proportion in [0,1],
1Listeners agree cross-culturally (Abelin & Allwood, 2003) above chance on angry vs. sad speech, which supports
normality. The true distribution remains unknown.
2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
Overall distribution (synthetic):INTENSITY (runs =120)
ang !*:0.79 !*:0.08
sad !*:0.5 !*:0.09
!0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
5
10
15
20
25
30
35
(a) Intensity
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
Overall distribution (synthetic):F0 (runs =120)
ang µ*:0.4 !*:0.09
sad µ*:0.72 !*:0.1
!0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
5
10
15
20
25
30
(b) F0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
6
Overall distribution (synthetic):DURATION (runs =120)
ang µ*:0.38 !*:0.08
sad µ*:0.92 !*:0.07
!0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
10
20
30
40
50
60
(c) Duration
Figure 1: Overall distributions (µ∗, σ∗) for intensity, F0, or duration show partly or fully separated
curves by emotions Sad and Angry.
Table 2: Upper and lower limits on word-level prosodic properties
Variable Unit Min Max
Sound Int dB 69 83
Mean F0 mel (Hz) 124 (139) 227 (282)
Total Dur ratio 0.70 (shorter) 1.80 (longer)
3
as encoded for the aiGA by individuals’ variables, was done with an approximation, where 0.5
represented the neutral original sound used as resynthesis basis, 0 corresponded to the minimum
and 1 to the maximum allowed (cf. Table 2).
Each user-evaluation was a tournament that involved listening to 2 resynthesized 1-word utter-
ances, and selecting the one which the user felt best portrayed the target emotion, or indicating a
draw. To further avoid user fatigue, the word for a tournament was chosen at random from a set
of 4 neutral words, then resynthesized given 2 individuals’ parameters, and the resulting pair of
sounds presented to the user. The original words used as resynthesis basis, cf. Table 1, came from
neutral declarative utterances recorded from a female US English speaker. Words were controlled
for syllable length but not for segmental makeup, since emotional prosody should generalize beyond
the micro-level.2
Resynthesis was done with two parallel Praat (Boersma & Weenink, 2005) implementations,
and individuals were resynthesized on the fly in a step-wise process before each tournament, with
the aiGA running in the background and regulating resynthesis parameters, user tournaments, and
computation of performance. Except for variable input implementation, the model was constructed
with future experiments on multiple-word utterances with local word-level encoding in mind. Thus,
it involved separate resynthesis at the word level with a final concatenation-resynthesis component.
Interactive experiments involved 2 males and 2 females; all highly proficient in English, with
either Swedish (3) or Spanish (1) as native language. Over 10 different days (within a 20-day
period), they completed two blocks of 3 sad tasks and 3 angry task, with an intermediate short
break, for each day, i.e. sad and angry target emotions in combination with either monosyllabic,
or bisyllabic, or trisyllabic word types. The 10 day replicas were done to not overload users and keep
them alert, and to reduce effects from random initialization or daily moods. Emotion perception is
subjective, so averaged results across runs are of main interest. A web interface was used for user
experiments. Post-feedback indicated contentment, but some felt 10 days was a bit long, or worried
slightly about consistency or desensitization, or felt angry was not as distinct as sad. One person
felt sad had a “pleading” quality, and that some sounds reflected a nicer voice.
4 Results and discussion
For each run rof the 4 ∗10 ∗6 = 240 completed runs, the values (with proportion encoding)
for Int,F0 and Dur for its final best individual and final evolved synthetic model (i.e. evolved
µrand σr) were extracted with a python script, with matlab6.1 used for plotting and statistics.
The data set of best individuals is henceforth called BI, and for evolved synthetic models ESM.
The analysis intended to clarify that emotions’ variables yielded distinct prosodic profiles, that
aiGA was indeed evolving emotional prosody (i.e. not prosody by syllabic type), and what the
averaged prosodic models were. The results representing the overall distribution of runs, based on
ESM for the individual prosodic variables, are in Figs. 1 - 2, given proportion encoding ∈[0,1] with
truncation. The curves can be seen as representing the overall distribution (µ∗, σ∗) for the variables
Int, F0, Dur, respectively, where µ∗=Prµr
n, and σ∗=qPrσ2
r
n, where nis the number of runs r
completed (e.g. for sad runs n= 120, or for monosyllabic n= 80). The pooled standard deviation
σ∗is an estimate of the larger population P(with unseen angry/sad cases). In contrast, the
sample standard deviation sis larger, and this difference may be due to the number of runs being
quite small. For completeness, histograms over the µrsample are also included.3
2Slight inconsistency might also be introduced during resynthesis; deemed imperceivable and thus non-invasive
for experimental purposes.
3Some columns may partially hide others.
4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
Overall distribution (synthetic):INTENSITY (runs =80)
mon µ*:0.63 !*:0.09
bis µ*:0.66 !*:0.09
tri µ*:0.64 !*:0.09
!0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
5
10
15
(a) Intensity
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
Overall distribution (synthetic):F0 (runs =80)
mon µ*:0.56 !*:0.1
bis µ*:0.57 !*:0.1
tri µ*:0.55 !*:0.1
!0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
5
10
15
20
(b) F0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
6
Overall distribution (synthetic):DURATION (runs =80)
mon µ*:0.66 !*:0.08
bis µ*:0.65 !*:0.07
tri µ*:0.63 !*:0.07
!0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
5
10
15
20
25
(c) Duration
Figure 2: Overall distribution (µ∗, σ∗) for intensity, F0, or duration show overlapping curves for
syllabic type.
5
0
0.5
1
0
0.5
1
0
0.2
0.4
0.6
0.8
1
X axis: Intensity
Y axis: F0
Z axis: Duration
BEST (120 runs each)
0
0.5
1
0
0.5
1
0
0.2
0.4
0.6
0.8
1
X axis: Intensity
Y axis: F0
Z axis: Duration
SYNT (120 runs each)
(a) Emotion (sad = ring, angry = plus)
0
0.5
1
0
0.5
1
0
0.2
0.4
0.6
0.8
1
X axis: Intensity
Y axis: F0
Z axis: Duration
BEST (80 runs each)
0
0.5
1
0
0.5
1
0
0.2
0.4
0.6
0.8
1
X axis: Intensity
Y axis: F0
Z axis: Duration
SYNT (80 runs each)
(b) Syllabic (monos.=↓triangle, bis.=square, tris.=↑triangle)
Figure 3: Runs for BI (BEST) vs. ESM (SYNT) in 3D indicate trends for two clouds by emotion
(a), but not by syllable type (b).
6
Table 3: ANOVAs showed that emotion was always significant, but syllabic type was not. User (persons A, B, C, D) was significant for
F0,Dur, with interactions. Xindicates significant p-values (Int = intensity, Dur = duration, syl = syllabic types, em = emotions, BI =
final best individual, ESM = final evolved synthetic model, ang = angry)
Tst# Var. Fact1 Fact2 #Rep. Model Sig. Fact1 Sig. Fact2 Sig. Interac. multcompare diff. (sig. main Fact.)
1 Int syl (3) em (2) 40 BI Xsad vs. ang
1 Int syl (3) em (2) 40 ESM Xsad vs. ang
1 F0 syl (3) em (2) 40 BI Xsad vs. ang
1 F0 syl (3) em (2) 40 ESM Xsad vs. ang
1 Dur syl (3) em (2) 40 BI Xsad vs. ang
1 Dur syl (3) em (2) 40 ESM Xsad vs. ang
2 Int user (4) em (2) 30 BI X X sad vs. ang
2 Int user (4) em (2) 30 ESM X X sad vs. ang
2 F0 user (4) em (2) 30 BI X X X sad vs. ang; A vs. BCD
2 F0 user (4) em (2) 30 ESM X X X sad vs. ang; A vs. B
2 Dur user (4) em (2) 30 BI X X X sad vs. ang; AC vs. BD
2 Dur user (4) em (2) 30 ESM X X X sad vs. ang
3 Int user (4) syl-em (6) 10 BI X X
3 Int user (4) syl-em (6) 10 ESM X
3 F0 user (4) syl-em (6) 10 BI X X X
3 F0 user (4) syl-em (6) 10 ESM X X X (same as in test 2)
3 Dur user (4) syl-em (6) 10 BI X X X
3 Dur user (4) syl-em (6) 10 ESM X X X
7
Table 4: Users’ means by emotion for BI and ESM (sample standard deviation in parenthesis; n=30 replicas).
BI-ANG Person A Person B Person C Person D BI-SAD Person A Person B Person C Person D
Int 0.79 (0.21) 0.73 (0.27) 0.8 (0.24) 0.82 (0.22) Int 0.41 (0.25) 0.58 (0.2) 0.48 (0.21) 0.46 (0.18)
F0 0.55 (0.27) 0.39 (0.25) 0.49 (0.25) 0.25 (0.19) F0 0.78 (0.18) 0.64 (0.18) 0.63 (0.24) 0.84 (0.15)
Dur 0.24 (0.15) 0.4 (0.28) 0.23 (0.14) 0.47 (0.27) Dur 0.92 (0.1) 0.97 (0.05) 0.94 (0.11) 0.89 (0.13)
ESM-ANG Person A Person B Person C Person D ESM-Sad Person A Person B Person C Person D
Int 0.85 (0.2) 0.71 (0.3) 0.84 (0.21) 0.76 (0.26) Int 0.44 (0.23) 0.56 (0.14) 0.47 (0.16) 0.51 (0.21)
F0 0.43 (0.16) 0.37 (0.23) 0.51 (0.25) 0.29 (0.18) F0 0.77 (0.16) 0.65 (0.16) 0.66 (0.17) 0.81 (0.17)
Dur 0.35 (0.24) 0.39 (0.29) 0.26 (0.14) 0.53 (0.28) Dur 0.9 (0.1) 0.98 (0.04) 0.94 (0.17) 0.85 (0.18)
8
Fig. 1 shows that the overall distribution separates emotions, with some overlap for Int and
F0, but not for Dur. As expected, the mean of Dur was shorter for angry speech, and longer for
sad speech. For the mean of Int, the relative position of emotions to each other was as expected,
but sad was at the neutral middle. The mean of F0 showed opposite behavior than the majority
literature, with decreased F0 for angry, but increased for sad. In contrast, syllabic word types do
not show separation, cf. Fig. 2, and thus, do not appear to make a difference for average behavior.
When resynthesizing words with µ∗values, sad appeared more distinct than angry,4and
angry differed mildly from neutral, although certain words seemed angrier. Better sad synthetic
speech has been noted before (Iriondo, Al´ıas, Melench´on, & Llorca, 2004). The angry emotion
family may also be more diverse, and thus vary more.
Beyond isolated variables, Fig. 3(a-b) visualize runs in 3D as points in proportion encoding for
3 dimensions (Int,F0, and Dur) for BI and ESM (truncated µrvalues for ESM).5Despite outliers,
and quite large estimated sfor an emotion given its points and dimensions,6Fig. 3(a) indicates a
trend of 2 clouds of points by emotion, which again contrasts with non-separation by syllabic type
in 3(b). Albeit a run’s ESM and BI points do not necessarily occur at same place, overall clouds
seem similar for ESM and BI in 3(a).7
Next, for each prosodic variable, 2-way ANOVAs were done at 95% confidence level for data
sets BI and ESM (with truncation), followed by a multiple comparison for significant main factors
(using matlab6.1’s anova2 and multcompare). Multiple comparison did not consider interactions
and should be interpreted with caution. Results are in Table 3. The first test considered syllable
types and emotions, and only the emotion factor showed significant difference. Interactions were
not significant, and perceptual differences appeared due to emotion, and not to syllable type. The
second test covered users (persons A, B, C and D) and emotions. Again, for all variables, emotion
was a significant factor. For F0 and Dur user was also significant, and interaction between factors
was always significant. The third test regarded users and emotion-syllable type task. The emotion-
syllable type task was a significant factor, and so were interactions (except for Int in ESM), as were
users for, again, F0 and Dur. Multiple comparisons showed that all tests grouped by emotion, and
for the second and third tests, person A, a linguist expert, was involved when user was a significant
factor. Feedback indicated that expert A did more analytical decision-making, so perhaps novice
users are less “contaminated” by formal knowledge. However, user impact remains a point for
further research since significant interactions were observed which are not yet well understood, and
only 4 users were involved. Table 4 shows user behavior by emotion, prosodic variable (truncated
µrfor ESM), and data set, and indicates its complexity. Variation is quite noticeable, but Dur
appears less varied for most subjects, in particular for sad.
CART (as implemented by M. Riley) was used on BI and ESM to see how far the binary
distinction between sad and angry models obtained from runs could be learned, and to inspect
what features supported prediction. Each example, labeled either sad or angry, had proportions
for intensity, F0, and duration as features (non-truncated µrfor ESM).8Averaged precision, recall,
and F-score based on 10-fold cross validation are in Table 5.9Interestingly, despite the sample
variation, on average CART performed well above the 50% na¨ıve baseline at distinguishing sad and
4Sounds with µ
∗values available at: http://www.linguistics.uiuc.edu/ebbaalm/aiGApilot/aiGApilot MU-STAR.zip
5Note as caution that 3D plots are merely descriptive, and may be visually misinforming due to dimensionality or
point overlap.
6For example, for BI ssad = 0.32,sang = 0.43 when semotioni=qs2
Intemotioni
+s2
F0emotioni
+s2
Duremotioni
716% of BI angry equaled the individual set to literature values.
8Only 1 ESM fold had a decision node with value beyond [0,1] range.
9With 9
10 train vs. 1
10 test, with a different tenth for test in each fold, using the default parameters (except minor
tree cutoff to avoid an overfit).
9
Table 5: Ten-fold cross validation averages from CART classifying sad and angry evolved synthetic
models (ESM) and best individuals (BI). Data set had 240 instances, i.e. 24 test examples in each
fold. Averaged results are well above the 50% baseline.
Em-model Av. prec Av. recall Av. F % non-unique exs.
ANG-ESM 0.90 0.88 0.88 0.05 (3 types)
ANG-BI 0.95 0.91 0.92 0.28 (7 types)
SAD-ESM 0.90 0.88 0.88 0.05 (3 types)
SAD-BI 0.92 0.94 0.93 0.26 (8 types)
angry instances. For ESM, 0.9 average precision, 0.88 average recall, and 0.88 F-score was obtained
for both sad and angry predictions. For BI, performance even increased slightly, which may relate
to BI having more repeated feature vectors, cf. col. 5 in Table 5. Inspection of decision trees showed
that duration was mostly used as sole predictor. 5 ESM folds also used F0 for predictions, but
intensity was not used. This may indicate a hierarchy of prosodic feature importance and that
some features may expectedly show more vs. less variability; future work will clarify.
5 Conclusion
Given an initial study of 1-word utterances, the efficient aiGA was used to obtain average models
of emotional prosody in interactive resynthesis experiments, with sadness appearing more distinct.
At this point, microprosody and syllabic length appear less important, which supports word-level
encoding, although some words seem better rendered than others with averaged solution, and
user influence requires more work. Future experiments will include more users, longer utterances,
and more emotions. Evaluating solutions for emotion recognition and naturalness could also be
interesting. To conclude, aiGA has potential for evolving emotional prosody. Analysis indicated
that average intensity, F0 and duration behaved differently for the 2 emotions, and F0 showed an
interesting opposite behavior than expected. Moreover, 3D plotting indicated trends by emotion,
and CART models showed that emotion solutions across runs were predictable to quite high degree,
with duration being most important for prediction.
6 Acknowledgements
Many thanks to C. Shih, R. Proa˜no, D. Goldberg, and R. Sproat for valuable comments. The work
was funded by NSF ITR-#0205731, the Air Force Office of Scientific Research, Air Force Materiel
Command, USAF, under FA9550-06-1-0096, and the NSF under IIS-02-09199. Opinions, findings,
conclusions or recommendations expressed are those of the authors and do not necessarily reflect
the views of the funding agencies.
References
Abelin, ˚
A., & Allwood, J. (2003). Cross linguistic interpretation of expression of emotions. In
8th Int. Symposium on Social Communication
Boersma, P., & Weenink, D. (2005, Summer). Praat: doing phonetics by computer (version 4).
http://www.praat.org.
10
Goldberg, D. (1989). Genetic algorithms in search, optimization, and machine learning. Reading:
Addison W.
Gonz´ales, C., Lozano, J., & Larra˜nga, P. (2002). Mathematical modelling of UMDAc algorithm
with tournament selection: Behaviour on linear and quadratic functions. Int. J. of Approx.
Reasoning ,31 , 313–340.
Hofer, G., R. K., & Clark, R. (2005). Informed blending of databases for emotional speech
synthesis. In Interspeech pp. 501–504.
Hsia, C., Wu, C., & Liu, T. (2005). Duration-embedded bi-HMM for expressive voice conversion.
In Interspeech pp. 1921–1924.
Iriondo, I., Al´ıas, F., Melench´on, J., & Llorca, A. (2004). Modeling and synthesizing emotional
speech for Catalan text-to-speech synthesis. In ADS pp. 197–208.
Llor`a, X., Al´ıas, F., Formiga, L., Sastry, K., & Goldberg, D. (2005). Evaluation consistency in
iGAs: user contradictions as cycles in partial-ordering graphs (Technical Report IlliGAL No
2005022). UIUC.
Llor`a, X., Sastry, K., Goldberg, D., Gupta, A., & Lakshmi, L. (2005). Combating user fatigue
in iGAs: partial ordering, support vector machines, and synthetic fitness (Technical Report
IlliGAL No 2005009). UIUC.
Murray, I. R., & Arnott, J. L. (1993). Toward the simulation of emotion in synthetic speech: a
review of the literature on human vocal emotion. J. Ac. Soc. of Am.,93 (2), 1097–1108.
Ni Chasaide, A., & Gobl, C. (2001). Voice quality and the synthesis of affect. In COST 258 pp.
252–263.
Sato, Y. (2005). Voice quality conversion using interactive evolution of prosodic control. App.
Soft Comp.,5, 181–192.
Scherer, K. (2003). Vocal communication of emotion: a review of research paradigms. Speech
Comm.,40 (1-2), 227–256.
Schr¨oder, M. (2001). Emotional speech synthesis: A review. In EUROSPEECH pp. 561–564.
Schr¨oder, M. (2004). Dimensional emotion representation as a basis for speech synthesis with
non-extreme emotions. In ADS pp. 209–220.
Takagi, H. (2001). Interactive evolutionary computation: Fusion of the capabilities of EC opti-
mization and human evaluation. In IEEE, Volume 89 pp. 1275–1296.
Tatham, M., & Morton, K. (2005). Developments in speech synthesis. Wiley.
11