Conference PaperPDF Available

Evolving emotional prosody

Authors:

Abstract and Figures

Emotion is expressed by prosodic cues, and this study uses the active interactive Genetic Al- gorithm to search a wide space for sad and angry parameters of intensity, F0, and duration in perceptual resynthesis experiments with users. This method avoids large recorded databases and is flexible for exploring prosodic emotion parameters. Solutions from multiple runs are analyzed graphically and statistically. Average results indicate parameter evolution by emo- tion, and appear best for sad speech. Solutions are quite successfully classified by CART, with duration as main predictor.
Content may be subject to copyright.
Evolving Emotional Prosody
Cecilia Ovesdotter Alm and Xavier Llor`a
IlliGAL Report No. 2006018
April, 2006
Illinois Genetic Algorithms Laboratory
University of Illinois at Urbana-Champaign
117 Transportation Building
104 S. Mathews Avenue Urbana, IL 61801
Office: (217) 333-2346
Fax: (217) 244-5705
Evolving Emotional Prosody
Cecilia Ovesdotter Alm Xavier Llor`a
Department of Linguistics National Center for Supercomputing Applications,
University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign,
707 S. Mathews Ave., Urbana, IL 61801, USA 1205 W. Clark Street, Urbana, IL 61801, USA
ebbaalm@uiuc.edu xllora@illigal.ge.uiuc.edu
Abstract
Emotion is expressed by prosodic cues, and this study uses the active interactive Genetic Al-
gorithm to search a wide space for sad and angry parameters of intensity, F0, and duration
in perceptual resynthesis experiments with users. This method avoids large recorded databases
and is flexible for exploring prosodic emotion parameters. Solutions from multiple runs are
analyzed graphically and statistically. Average results indicate parameter evolution by emo-
tion, and appear best for sad speech. Solutions are quite successfully classified by CART, with
duration as main predictor.
1 Introduction
Emotion is expressed by prosodic cues, but their interplay is an open question, which is compli-
cated by a challenging search space. Common procedure for emotional speech research depends
on recording and analyzing large databases. Instead, this work uses the active interactive Genetic
Algotihm (aiGA) (Llor`a, Al´ıas, Formiga, Sastry, & Goldberg, 2005) to evolve emotional prosody in
perceptual resynthesis experiments. Within this framework, fundamental parameters of emotional
prosody become an optimization problem, approximated by searching perceptual space of a listener
via interactive feedback. In contrast to unit or parameter estimation based on emotional speech
databases, e.g. (Hofer & Clark, 2005; Hsia, Wu, & Liu, 2005), the method only requires neutral
utterances as starting point, and user preferences guide the direction of the efficient aiGA. Thus,
there is no need to record large emotion databases, and parameter findings are not limited to what
is found in data; instead models evolve more freely, as permitted by imposed bounds. Results
from an initial experiment on 1-word utterances with the goal to evolve angry and sad speech
are analyzed, and indicate that aiGA evolves prosodic parameters by emotion. Solutions are quite
well predicted by a CART model.
2 Related work
Modifications in F0, intensity, and duration are facets of emotional prosody (Scherer, 2003). While
anger is often characterized by increased speech rate, F0, and intensity, sadness is assumed
marked by opposite behavior e.g. (Murray & Arnott, 1993). Other features have been suggested,
but with less evidence, e.g. voice quality (Ni Chasaide & Gobl, 2001). Synthesizing emotional speech
has been attempted with various techniques (Schr¨oder, 2001). EmoSpeak (Schr¨oder, 2004) allows
1
Table 1: Words used as resynthesis basis
Monosyllabic sas, bem, face, tan
Bisyllabic barlet, person, cherry, tantan
Trisyllabic bubelos, strawberry, customer, tantantan
manual manipulation of many parameters with an interesting dimensional interface, but parameters
were fitted to a database and literature. An interesting study drew on a Spanish emotional speech
corpus for Catalan emotional prosody (Iriondo, Al´ıas, Melench´on, & Llorca, 2004).
Despite much previous work, emotional profiles remain unclear (Tatham & Morton, 2005).
Fresh work may contribute to increased understanding of emotional prosody, and the suggested
approach rephrases the research question as: on average, how is a particular emotion best rendered
in synthetic speech? A step has been taken before toward structured search (Sato, 2005), but seemed
to use a simple iGA, which ignores important considerations in interactive computation such as
user fatigue and flexible targets (Takagi, 2001; Llor`a, Sastry, Goldberg, Gupta, & Lakshmi, 2005).
GAs (Goldberg, 1989) are iterative search procedures that resemble natural adaptation phenomena.
Issues and applications in interactive evolutionary computation have been surveyed (Takagi, 2001),
as have recent advances in aiGA theory (Llor`a, Sastry, Goldberg, Gupta, & Lakshmi, 2005). AiGA
has been successful for speech, by interactively estimating cost functions for unit-selection TTS
(Llor`a, Al´ıas, Formiga, Sastry, & Goldberg, 2005); aiGA ensured high intra-run consistency in
subjective evaluations and decreased user evaluations compared to a simple iGA, i.e. combating
user fatigue.
3 Experimental design
Interactive evaluation was used to evolve emotional prosody with aiGA developed by (Llor`a, Sastry,
Goldberg, Gupta, & Lakshmi, 2005; Llor`a, Al´ıas, Formiga, Sastry, & Goldberg, 2005). A user’s
feedback guided the process to estimate performance and evolve a synthetic model beyond what
was presented to the user (for details, cf. (Llor`a, Al´ıas, Formiga, Sastry, & Goldberg, 2005)). AiGA
assumed variable independence and built a probabilistic model, in this case based on a population
of normal distributions1with the UM DAcalgorithm (Gonz´ales, Lozano, & Larra˜nga, 2002), such
that the output of each run rwas an evolved synthetic normal model (µr, σr) for each prosodic
variable.
In this experiment, users listened to and evaluated pairs of resynthesized utterances. The aiGA
had parameter vectors or individuals with prosodic parameters for resynthesizing 1-word utterances.
Each individual had 3 values for Int (intensity), F0 (mean F0), and Dur (total duration) [0,1],
encoded as proportions deviating from the original neutral word at 0.5, with truncation applied if
the synthetic model evolved beyond [0,1]. Each run had 3 iterations, which (Llor`a, Al´ıas, Formiga,
Sastry, & Goldberg, 2005) found sufficient, and a user evaluated 21 pairs of sounds in a run.
Individuals were initialized randomly with a different seed for each day of the experiment, except
that one individual’s values were set according to trends in the literature for each emotion (see
sec. 2). The search space for each variable was delimited by upper and lower bounds, cf. Table 2,
adjusted to the original voice to avoid unnatural speech.
Conversion between actual numbers for resynthesis and their corresponding proportion in [0,1],
1Listeners agree cross-culturally (Abelin & Allwood, 2003) above chance on angry vs. sad speech, which supports
normality. The true distribution remains unknown.
2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
Overall distribution (synthetic):INTENSITY (runs =120)
ang !*:0.79 !*:0.08
sad !*:0.5 !*:0.09
!0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
5
10
15
20
25
30
35
(a) Intensity
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
Overall distribution (synthetic):F0 (runs =120)
ang µ*:0.4 !*:0.09
sad µ*:0.72 !*:0.1
!0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
5
10
15
20
25
30
(b) F0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
6
Overall distribution (synthetic):DURATION (runs =120)
ang µ*:0.38 !*:0.08
sad µ*:0.92 !*:0.07
!0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
10
20
30
40
50
60
(c) Duration
Figure 1: Overall distributions (µ, σ) for intensity, F0, or duration show partly or fully separated
curves by emotions Sad and Angry.
Table 2: Upper and lower limits on word-level prosodic properties
Variable Unit Min Max
Sound Int dB 69 83
Mean F0 mel (Hz) 124 (139) 227 (282)
Total Dur ratio 0.70 (shorter) 1.80 (longer)
3
as encoded for the aiGA by individuals’ variables, was done with an approximation, where 0.5
represented the neutral original sound used as resynthesis basis, 0 corresponded to the minimum
and 1 to the maximum allowed (cf. Table 2).
Each user-evaluation was a tournament that involved listening to 2 resynthesized 1-word utter-
ances, and selecting the one which the user felt best portrayed the target emotion, or indicating a
draw. To further avoid user fatigue, the word for a tournament was chosen at random from a set
of 4 neutral words, then resynthesized given 2 individuals’ parameters, and the resulting pair of
sounds presented to the user. The original words used as resynthesis basis, cf. Table 1, came from
neutral declarative utterances recorded from a female US English speaker. Words were controlled
for syllable length but not for segmental makeup, since emotional prosody should generalize beyond
the micro-level.2
Resynthesis was done with two parallel Praat (Boersma & Weenink, 2005) implementations,
and individuals were resynthesized on the fly in a step-wise process before each tournament, with
the aiGA running in the background and regulating resynthesis parameters, user tournaments, and
computation of performance. Except for variable input implementation, the model was constructed
with future experiments on multiple-word utterances with local word-level encoding in mind. Thus,
it involved separate resynthesis at the word level with a final concatenation-resynthesis component.
Interactive experiments involved 2 males and 2 females; all highly proficient in English, with
either Swedish (3) or Spanish (1) as native language. Over 10 different days (within a 20-day
period), they completed two blocks of 3 sad tasks and 3 angry task, with an intermediate short
break, for each day, i.e. sad and angry target emotions in combination with either monosyllabic,
or bisyllabic, or trisyllabic word types. The 10 day replicas were done to not overload users and keep
them alert, and to reduce effects from random initialization or daily moods. Emotion perception is
subjective, so averaged results across runs are of main interest. A web interface was used for user
experiments. Post-feedback indicated contentment, but some felt 10 days was a bit long, or worried
slightly about consistency or desensitization, or felt angry was not as distinct as sad. One person
felt sad had a “pleading” quality, and that some sounds reflected a nicer voice.
4 Results and discussion
For each run rof the 4 10 6 = 240 completed runs, the values (with proportion encoding)
for Int,F0 and Dur for its final best individual and final evolved synthetic model (i.e. evolved
µrand σr) were extracted with a python script, with matlab6.1 used for plotting and statistics.
The data set of best individuals is henceforth called BI, and for evolved synthetic models ESM.
The analysis intended to clarify that emotions’ variables yielded distinct prosodic profiles, that
aiGA was indeed evolving emotional prosody (i.e. not prosody by syllabic type), and what the
averaged prosodic models were. The results representing the overall distribution of runs, based on
ESM for the individual prosodic variables, are in Figs. 1 - 2, given proportion encoding [0,1] with
truncation. The curves can be seen as representing the overall distribution (µ, σ) for the variables
Int, F0, Dur, respectively, where µ=Prµr
n, and σ=qPrσ2
r
n, where nis the number of runs r
completed (e.g. for sad runs n= 120, or for monosyllabic n= 80). The pooled standard deviation
σis an estimate of the larger population P(with unseen angry/sad cases). In contrast, the
sample standard deviation sis larger, and this difference may be due to the number of runs being
quite small. For completeness, histograms over the µrsample are also included.3
2Slight inconsistency might also be introduced during resynthesis; deemed imperceivable and thus non-invasive
for experimental purposes.
3Some columns may partially hide others.
4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
Overall distribution (synthetic):INTENSITY (runs =80)
mon µ*:0.63 !*:0.09
bis µ*:0.66 !*:0.09
tri µ*:0.64 !*:0.09
!0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
5
10
15
(a) Intensity
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
Overall distribution (synthetic):F0 (runs =80)
mon µ*:0.56 !*:0.1
bis µ*:0.57 !*:0.1
tri µ*:0.55 !*:0.1
!0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
5
10
15
20
(b) F0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
6
Overall distribution (synthetic):DURATION (runs =80)
mon µ*:0.66 !*:0.08
bis µ*:0.65 !*:0.07
tri µ*:0.63 !*:0.07
!0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
5
10
15
20
25
(c) Duration
Figure 2: Overall distribution (µ, σ) for intensity, F0, or duration show overlapping curves for
syllabic type.
5
0
0.5
1
0
0.5
1
0
0.2
0.4
0.6
0.8
1
X axis: Intensity
Y axis: F0
Z axis: Duration
BEST (120 runs each)
0
0.5
1
0
0.5
1
0
0.2
0.4
0.6
0.8
1
X axis: Intensity
Y axis: F0
Z axis: Duration
SYNT (120 runs each)
(a) Emotion (sad = ring, angry = plus)
0
0.5
1
0
0.5
1
0
0.2
0.4
0.6
0.8
1
X axis: Intensity
Y axis: F0
Z axis: Duration
BEST (80 runs each)
0
0.5
1
0
0.5
1
0
0.2
0.4
0.6
0.8
1
X axis: Intensity
Y axis: F0
Z axis: Duration
SYNT (80 runs each)
(b) Syllabic (monos.=triangle, bis.=square, tris.=triangle)
Figure 3: Runs for BI (BEST) vs. ESM (SYNT) in 3D indicate trends for two clouds by emotion
(a), but not by syllable type (b).
6
Table 3: ANOVAs showed that emotion was always significant, but syllabic type was not. User (persons A, B, C, D) was significant for
F0,Dur, with interactions. Xindicates significant p-values (Int = intensity, Dur = duration, syl = syllabic types, em = emotions, BI =
final best individual, ESM = final evolved synthetic model, ang = angry)
Tst# Var. Fact1 Fact2 #Rep. Model Sig. Fact1 Sig. Fact2 Sig. Interac. multcompare diff. (sig. main Fact.)
1 Int syl (3) em (2) 40 BI Xsad vs. ang
1 Int syl (3) em (2) 40 ESM Xsad vs. ang
1 F0 syl (3) em (2) 40 BI Xsad vs. ang
1 F0 syl (3) em (2) 40 ESM Xsad vs. ang
1 Dur syl (3) em (2) 40 BI Xsad vs. ang
1 Dur syl (3) em (2) 40 ESM Xsad vs. ang
2 Int user (4) em (2) 30 BI X X sad vs. ang
2 Int user (4) em (2) 30 ESM X X sad vs. ang
2 F0 user (4) em (2) 30 BI X X X sad vs. ang; A vs. BCD
2 F0 user (4) em (2) 30 ESM X X X sad vs. ang; A vs. B
2 Dur user (4) em (2) 30 BI X X X sad vs. ang; AC vs. BD
2 Dur user (4) em (2) 30 ESM X X X sad vs. ang
3 Int user (4) syl-em (6) 10 BI X X
3 Int user (4) syl-em (6) 10 ESM X
3 F0 user (4) syl-em (6) 10 BI X X X
3 F0 user (4) syl-em (6) 10 ESM X X X (same as in test 2)
3 Dur user (4) syl-em (6) 10 BI X X X
3 Dur user (4) syl-em (6) 10 ESM X X X
7
Table 4: Users’ means by emotion for BI and ESM (sample standard deviation in parenthesis; n=30 replicas).
BI-ANG Person A Person B Person C Person D BI-SAD Person A Person B Person C Person D
Int 0.79 (0.21) 0.73 (0.27) 0.8 (0.24) 0.82 (0.22) Int 0.41 (0.25) 0.58 (0.2) 0.48 (0.21) 0.46 (0.18)
F0 0.55 (0.27) 0.39 (0.25) 0.49 (0.25) 0.25 (0.19) F0 0.78 (0.18) 0.64 (0.18) 0.63 (0.24) 0.84 (0.15)
Dur 0.24 (0.15) 0.4 (0.28) 0.23 (0.14) 0.47 (0.27) Dur 0.92 (0.1) 0.97 (0.05) 0.94 (0.11) 0.89 (0.13)
ESM-ANG Person A Person B Person C Person D ESM-Sad Person A Person B Person C Person D
Int 0.85 (0.2) 0.71 (0.3) 0.84 (0.21) 0.76 (0.26) Int 0.44 (0.23) 0.56 (0.14) 0.47 (0.16) 0.51 (0.21)
F0 0.43 (0.16) 0.37 (0.23) 0.51 (0.25) 0.29 (0.18) F0 0.77 (0.16) 0.65 (0.16) 0.66 (0.17) 0.81 (0.17)
Dur 0.35 (0.24) 0.39 (0.29) 0.26 (0.14) 0.53 (0.28) Dur 0.9 (0.1) 0.98 (0.04) 0.94 (0.17) 0.85 (0.18)
8
Fig. 1 shows that the overall distribution separates emotions, with some overlap for Int and
F0, but not for Dur. As expected, the mean of Dur was shorter for angry speech, and longer for
sad speech. For the mean of Int, the relative position of emotions to each other was as expected,
but sad was at the neutral middle. The mean of F0 showed opposite behavior than the majority
literature, with decreased F0 for angry, but increased for sad. In contrast, syllabic word types do
not show separation, cf. Fig. 2, and thus, do not appear to make a difference for average behavior.
When resynthesizing words with µvalues, sad appeared more distinct than angry,4and
angry differed mildly from neutral, although certain words seemed angrier. Better sad synthetic
speech has been noted before (Iriondo, Al´ıas, Melench´on, & Llorca, 2004). The angry emotion
family may also be more diverse, and thus vary more.
Beyond isolated variables, Fig. 3(a-b) visualize runs in 3D as points in proportion encoding for
3 dimensions (Int,F0, and Dur) for BI and ESM (truncated µrvalues for ESM).5Despite outliers,
and quite large estimated sfor an emotion given its points and dimensions,6Fig. 3(a) indicates a
trend of 2 clouds of points by emotion, which again contrasts with non-separation by syllabic type
in 3(b). Albeit a run’s ESM and BI points do not necessarily occur at same place, overall clouds
seem similar for ESM and BI in 3(a).7
Next, for each prosodic variable, 2-way ANOVAs were done at 95% confidence level for data
sets BI and ESM (with truncation), followed by a multiple comparison for significant main factors
(using matlab6.1’s anova2 and multcompare). Multiple comparison did not consider interactions
and should be interpreted with caution. Results are in Table 3. The first test considered syllable
types and emotions, and only the emotion factor showed significant difference. Interactions were
not significant, and perceptual differences appeared due to emotion, and not to syllable type. The
second test covered users (persons A, B, C and D) and emotions. Again, for all variables, emotion
was a significant factor. For F0 and Dur user was also significant, and interaction between factors
was always significant. The third test regarded users and emotion-syllable type task. The emotion-
syllable type task was a significant factor, and so were interactions (except for Int in ESM), as were
users for, again, F0 and Dur. Multiple comparisons showed that all tests grouped by emotion, and
for the second and third tests, person A, a linguist expert, was involved when user was a significant
factor. Feedback indicated that expert A did more analytical decision-making, so perhaps novice
users are less “contaminated” by formal knowledge. However, user impact remains a point for
further research since significant interactions were observed which are not yet well understood, and
only 4 users were involved. Table 4 shows user behavior by emotion, prosodic variable (truncated
µrfor ESM), and data set, and indicates its complexity. Variation is quite noticeable, but Dur
appears less varied for most subjects, in particular for sad.
CART (as implemented by M. Riley) was used on BI and ESM to see how far the binary
distinction between sad and angry models obtained from runs could be learned, and to inspect
what features supported prediction. Each example, labeled either sad or angry, had proportions
for intensity, F0, and duration as features (non-truncated µrfor ESM).8Averaged precision, recall,
and F-score based on 10-fold cross validation are in Table 5.9Interestingly, despite the sample
variation, on average CART performed well above the 50% na¨ıve baseline at distinguishing sad and
4Sounds with µ
values available at: http://www.linguistics.uiuc.edu/ebbaalm/aiGApilot/aiGApilot MU-STAR.zip
5Note as caution that 3D plots are merely descriptive, and may be visually misinforming due to dimensionality or
point overlap.
6For example, for BI ssad = 0.32,sang = 0.43 when semotioni=qs2
Intemotioni
+s2
F0emotioni
+s2
Duremotioni
716% of BI angry equaled the individual set to literature values.
8Only 1 ESM fold had a decision node with value beyond [0,1] range.
9With 9
10 train vs. 1
10 test, with a different tenth for test in each fold, using the default parameters (except minor
tree cutoff to avoid an overfit).
9
Table 5: Ten-fold cross validation averages from CART classifying sad and angry evolved synthetic
models (ESM) and best individuals (BI). Data set had 240 instances, i.e. 24 test examples in each
fold. Averaged results are well above the 50% baseline.
Em-model Av. prec Av. recall Av. F % non-unique exs.
ANG-ESM 0.90 0.88 0.88 0.05 (3 types)
ANG-BI 0.95 0.91 0.92 0.28 (7 types)
SAD-ESM 0.90 0.88 0.88 0.05 (3 types)
SAD-BI 0.92 0.94 0.93 0.26 (8 types)
angry instances. For ESM, 0.9 average precision, 0.88 average recall, and 0.88 F-score was obtained
for both sad and angry predictions. For BI, performance even increased slightly, which may relate
to BI having more repeated feature vectors, cf. col. 5 in Table 5. Inspection of decision trees showed
that duration was mostly used as sole predictor. 5 ESM folds also used F0 for predictions, but
intensity was not used. This may indicate a hierarchy of prosodic feature importance and that
some features may expectedly show more vs. less variability; future work will clarify.
5 Conclusion
Given an initial study of 1-word utterances, the efficient aiGA was used to obtain average models
of emotional prosody in interactive resynthesis experiments, with sadness appearing more distinct.
At this point, microprosody and syllabic length appear less important, which supports word-level
encoding, although some words seem better rendered than others with averaged solution, and
user influence requires more work. Future experiments will include more users, longer utterances,
and more emotions. Evaluating solutions for emotion recognition and naturalness could also be
interesting. To conclude, aiGA has potential for evolving emotional prosody. Analysis indicated
that average intensity, F0 and duration behaved differently for the 2 emotions, and F0 showed an
interesting opposite behavior than expected. Moreover, 3D plotting indicated trends by emotion,
and CART models showed that emotion solutions across runs were predictable to quite high degree,
with duration being most important for prediction.
6 Acknowledgements
Many thanks to C. Shih, R. Proa˜no, D. Goldberg, and R. Sproat for valuable comments. The work
was funded by NSF ITR-#0205731, the Air Force Office of Scientific Research, Air Force Materiel
Command, USAF, under FA9550-06-1-0096, and the NSF under IIS-02-09199. Opinions, findings,
conclusions or recommendations expressed are those of the authors and do not necessarily reflect
the views of the funding agencies.
References
Abelin, ˚
A., & Allwood, J. (2003). Cross linguistic interpretation of expression of emotions. In
8th Int. Symposium on Social Communication
Boersma, P., & Weenink, D. (2005, Summer). Praat: doing phonetics by computer (version 4).
http://www.praat.org.
10
Goldberg, D. (1989). Genetic algorithms in search, optimization, and machine learning. Reading:
Addison W.
Gonz´ales, C., Lozano, J., & Larra˜nga, P. (2002). Mathematical modelling of UMDAc algorithm
with tournament selection: Behaviour on linear and quadratic functions. Int. J. of Approx.
Reasoning ,31 , 313–340.
Hofer, G., R. K., & Clark, R. (2005). Informed blending of databases for emotional speech
synthesis. In Interspeech pp. 501–504.
Hsia, C., Wu, C., & Liu, T. (2005). Duration-embedded bi-HMM for expressive voice conversion.
In Interspeech pp. 1921–1924.
Iriondo, I., Al´ıas, F., Melench´on, J., & Llorca, A. (2004). Modeling and synthesizing emotional
speech for Catalan text-to-speech synthesis. In ADS pp. 197–208.
Llor`a, X., Al´ıas, F., Formiga, L., Sastry, K., & Goldberg, D. (2005). Evaluation consistency in
iGAs: user contradictions as cycles in partial-ordering graphs (Technical Report IlliGAL No
2005022). UIUC.
Llor`a, X., Sastry, K., Goldberg, D., Gupta, A., & Lakshmi, L. (2005). Combating user fatigue
in iGAs: partial ordering, support vector machines, and synthetic fitness (Technical Report
IlliGAL No 2005009). UIUC.
Murray, I. R., & Arnott, J. L. (1993). Toward the simulation of emotion in synthetic speech: a
review of the literature on human vocal emotion. J. Ac. Soc. of Am.,93 (2), 1097–1108.
Ni Chasaide, A., & Gobl, C. (2001). Voice quality and the synthesis of affect. In COST 258 pp.
252–263.
Sato, Y. (2005). Voice quality conversion using interactive evolution of prosodic control. App.
Soft Comp.,5, 181–192.
Scherer, K. (2003). Vocal communication of emotion: a review of research paradigms. Speech
Comm.,40 (1-2), 227–256.
Schr¨oder, M. (2001). Emotional speech synthesis: A review. In EUROSPEECH pp. 561–564.
Schr¨oder, M. (2004). Dimensional emotion representation as a basis for speech synthesis with
non-extreme emotions. In ADS pp. 209–220.
Takagi, H. (2001). Interactive evolutionary computation: Fusion of the capabilities of EC opti-
mization and human evaluation. In IEEE, Volume 89 pp. 1275–1296.
Tatham, M., & Morton, K. (2005). Developments in speech synthesis. Wiley.
11
... • Criteria shifts: The interactive nature of the process can also shift the user criteria. Such criteria shifts tend to appear in long lasting interactive sessions [2]. Users can change their desired target based on the interactive process itself. ...
Conference Paper
Full-text available
Since their inception, active interactive genetic algorithms have successfully combat user evaluation fatigue induced by repetitive evaluation. Their success originates on building mod- els of the user preferences based on partial-order graphs to create a numeric synthetic fitness. Active interactive genetic algorithms can easily reduce up to seven times the number of eval- uations required from the user by optimizing such a synthetic fitness. However, despite basic understanding of the underlying mechanisms, active interactive genetic algorithms still lack of principled understanding of what properties make a partial ordering graph a successful model of user preferences. Also, there has been little research conducted about how to integrate to- gether the contribution of different users to successfully capitalize on parallelized evaluation schemes. This paper addresses both issues describing (1) what properties make a partial-order graph a success, and (2) how partial-order graphs obtained from different users can be merged meaningfully.
Article
Emotional prosody refers to the ways in which the tone of voice can be modulated to convey emotions, feelings, and attitudes. Previous studies have explored the perception of emotional prosody and whether native speakers (L1) have an in-group advantage in recognizing the emotional prosody of their own cultural groups over non-native speakers. However, little is known about whether these findings in non-tonal languages can be generalized to tonal languages. Mandarin Chinese uses the tone of voice to encode word meanings in addition to emotional prosody. This study investigates the perception of emotional prosody in Mandarin Chinese using an emotion judgment task, focusing on the effects of emotion type (e.g., neutral, joy, anger, sadness) and syllable length (e.g., monosyllable, disyllable, trisyllable, and sentence). Three groups were included, consisting of 20 native Chinese speakers (native group), 20 L1-English L2-Chinese learners (second language group), and 20 native English speakers without Chinese learning experience (non-native group). The results revealed that all three groups can identify emotional prosody well above the chance level in Mandarin Chinese words and sentences. Moreover, the native group and the second language (L2) group showed an in-group advantage in recognizing emotional prosody compared to the non-native group, highlighting the impact of linguistic experience in addition to cultural backgrounds on the perception of emotional prosody. Notably, the effects of emotion type and syllable length differed across the three groups in terms of their perception of emotional prosody. The native group had difficulty identifying positive emotional prosody, whereas both the L2 group and the non-native group showed a pattern of improved accuracy as syllable length increased, with an interaction effect with emotion type.
Article
This paper proposes two stage speech emotion recognition approach using speaking rate. The emotions considered in this study are anger, disgust, fear, happy, neutral, sadness, sarcastic and surprise. At the first stage, based on speaking rate, eight emotions are categorized into 3 broad groups namely active (fast), normal and passive (slow). In the second stage, these 3 broad groups are further classified into individual emotions using vocal tract characteristics. Gaussian mixture models (GMM) are used for developing the emotion models. Emotion classification performance at broader level, based on speaking rate is found to be around 99% for speaker and text dependent cases. Performance of overall emotion classification is observed to be improved using the proposed two stage approach. Along with spectral features, the formant features are explored in the second stage, to achieve robust emotion recognition performance in case of speaker, gender and text independent cases.
Conference Paper
Full-text available
Text-to-Speech (TTS) synthesis systems produce speech from an input text. Corpus based or unit selection TTS (US-TTS) are based on retrieving the best set of speech units from a large labelled speech database. To that effect, the unit selection is guided by dynamic programming and a weighted cost function. Several weight tuning approaches have been defined so as to integrate human preferences in the unit selection process, but with no great success beyond expert-based hand tuning. However, active interactive genetic algorithms (aiGAs) have showed promising results working on the unit selection text-to-speech (US-TTS) weight tuning problem in previous works. aiGAs are an evolution of classic interactive genetic algorithms (IGAs) in terms of reducing fatigue, ambiguity and frustration in user's evaluations. This paper presents a step further in the application of aiGAs to this problem by defining new indicators of the perceptually-based evolutionary process to obtain more reliable weights. The experiments have been conducted on one hour Spanish speech database and using an acoustic plus linguistic cost function.
Conference Paper
Full-text available
This opinion paper discusses subjective natural language problems in terms of their motivations, applications, characterizations, and implications. It argues that such problems deserve increased attention because of their potential to challenge the status of theoretical understanding, problem-solving methods, and evaluation techniques in computational linguistics. The author supports a more holistic approach to such problems; a view that extends beyond opinion mining or sentiment analysis.
Article
Full-text available
Active interactive genetic algorithms (aiGAs) rely on actively optimizing synthetic fitness func-tions. In interactive genetic algorithms (iGAs) framework, user evaluations provide the necessary input for synthesizing a reasonably accurate surrogate fitness function that models user evalua-tions or, in other words, his/her decision preferences. User evaluations collected via tournament selection only provide partial-ordering relations between solutions. Active iGAs assemble a partial-ordering graph of user evaluations. In such a directed graph, any contradictory evalu-ation provided by the user introduces a cycle in the graph. This property is explored in this paper to measure the consistency of the evaluations provided by the user along the evolutionary process. The consistency measures are applied to a real-world problem, the weight tuning of the cost function involved in corpus-based text-to-speech synthesis. Results show the useful-ness of such measures to identify inconsistent users during the evolutionary tuning process, and successfully the number of evaluations required by more than half.
Conference Paper
Full-text available
This paper presents a duration-embedded Bi-HMM framework for expressive voice conversion. First, Ward's minimum variance clustering method is used to cluster all the conversion units (sub-syllables) in order to reduce the number of conversion models as well as the size of the required training database. The duration-embedded Bi-HMM trained with the EM algorithm is built for each sub-syllable class to convert the neutral speech into emotional speech considering the duration information. Finally, the prosodic cues are included in the modification of the spectrum-converted speech. The STRAIGHT algorithm is adopted for high-quality speech analysis and synthesis. Target emotions including happiness, sadness and anger are used. Objective and perceptual evaluations were conducted to compare the performance of the proposed approach with previous methods. The results show that the proposed method exhibits encouraging potential in expressive voice conversion.
Article
With a growing need for understanding the process involved in producing and perceiving spoken language, this timely publication answers these questions in an accessible reference. Containing material resulting from many years' teaching and research, Speech Synthesis provides a complete account of the theory of speech. By bringing together the common goals and methods of speech synthesis into a single resource, the book will lead the way towards a comprehensive view of the process involved in human speech. The book includes applications in speech technology and speech synthesis. It is ideal for intermediate students of linguistics and phonetics who wish to proceed further, as well as researchers and engineers in telecommunications working in speech technology and speech synthesis who need a comprehensive overview of the field and who wish to gain an understanding of the objectives and achievements of the study of speech production and perception.
Chapter
IntroductionDifficulties in Voice Quality ResearchAcoustic Profiles of Particular Voice QualitiesRe-Synthesis of Voice QualitiesPerceived Affective Colouring of Particular Voice QualitiesVoice Quality and F0 to Communicate AffectConclusion AcknowledgementsReferences
Conference Paper
The quality of corpus-based text-to-speech systems depends on the accuracy of the unit selection process, which in turn relies on the cost function definition. This function should map the user perceptual preference when selecting synthesis units, which is a very difficult task. This paper continues our previous work on fusing the human judgements with the cost function by means of interactive weight tuning. The application of active interactive genetics algorithms mitigates user fatigue by improving user consistency. As a result, the obtained weights generate more natural synthetic speech when compared to previous objective and subjective proposals
Article
This paper presents a theoretical study of the behaviour of the univariate marginal distribution algorithm for continuous domains (UMDAc) in dimension n. To this end, the algorithm with tournament selection is modelled mathematically, assuming an infinite number of tournaments.The mathematical model is then used to study the algorithm’s behaviour in the minimization of linear functions and quadratic function , with and , i=0,1,…,n. Linear functions are used to model the algorithm when far from the optimum, while quadratic function is used to analyze the algorithm when near the optimum.The analysis shows that the algorithm performs poorly in the linear function . In the case of quadratic function the algorithm’s behaviour was analyzed for certain particular dimensions. After taking into account some simplifications we can conclude that when the algorithm starts near the optimum, UMDAc is able to reach it. Moreover the speed of convergence to the optimum decreases as the dimension increases.
Article
The current state of research on emotion effects on voice and speech is reviewed and issues for future research efforts are discussed. In particular, it is suggested to use the Brunswikian lens model as a base for research on the vocal communication of emotion. This approach allows one to model the complete process, including both encoding (expression), transmission, and decoding (impression) of vocal emotion communication. Special emphasis is placed on the conceptualization and operationalization of the major elements of the model (i.e., the speaker's emotional state, the listener's attribution, and the mediating acoustic cues). In addition, the advantages and disadvantages of research paradigms for the induction or observation of emotional expression in voice and speech and the experimental manipulation of vocal cues are discussed, using pertinent examples drawn from past and present research.