Conference PaperPDF Available

Estonian Emotional Speech Corpus: theoretical base and implementation

Authors:

Abstract and Figures

The establishment of the Estonian Emotional Speech Corpus (EESC) began in 2006 within the framework of the National Programme for Estonian Language Technology at the Institute of the Estonian Language. The corpus contains 1,234 Estonian sentences that express anger, joy and sadness, or are neutral. The sentences come from text passages read out by non-professionals who were not given any explicit indication of the target emotion. It was assumed that the content of the text would elicit an emotion in the reader and that this would be expressed in their voice. This avoids the exaggerations of acted speech. The emotion of each sentence in the corpus was then determined by listening tests. The corpus is publicly available at http://peeter.eki.ee:5000/. This article gives an overview of the theoretical starting-points of the corpus and their usefulness for its implementation.
Content may be subject to copyright.
4th International Workshop on Corpora for Research on
EMOTION SENTIMENT & SOCIAL SIGNALS
ES³ 2012
v
Editors
Laurence Devillers
Björn Schuller
Anton Batliner
Paolo Rosso
Ellen Douglas-Cowie
Roddy Cowie
Catherine Pelachaud
Université Paris-Sorbonne 4, France
Technische Universität München, Germany
Friedrich-Alexander-University, Germany
Universitat Politècnica de Valencia, Spain
Queen's University Belfast, UK
Queen's University Belfast, UK
CNRS - LTCI, France
Workshop Organizers/Organizing Committee
Laurence Devillers
Björn Schuller
Anton Batliner
Paolo Rosso
Ellen Douglas-Cowie
Roddy Cowie
Catherine Pelachaud
Université Paris-Sorbonne 4, France
Technische Universität München, Germany
Friedrich-Alexander-University, Germany
Universitat Politècnica de Valencia, Spain
Queen's University Belfast, UK
Queen's University Belfast, UK
CNRS - LTCI, France
Workshop Programme Committee
Vered Aharonson
Alexandra Balahur
Felix Burkhardt
Carlos Busso
Rafael Calvo
Erik Cambria
Antonio Camurri
Mohamed Chetouani
Thierry Dutoit
Julien Epps
Anna Esposito
Hatice Gunes
Catherine Havasi
Bing Liu
Florian Metze
Shrikanth Narayanan
Maja Pantic
Antonio Reyes
Fabien Ringeval
Peter Robinson
Florian Schiel
Jianhua Tao
José A. Troyano
Tony Veale
Alessandro Vinciarelli
Haixun Wang
AFEKA, Israel
ECs Joint Research Centre, Italy
Deutsche Telekom, Germany
University of Texas at Dallas, USA
University of Sydney, Australia
National University Singapore, Singapore
University of Genova, Italy
Université Paris 6, France
University of Mons, Belgium
University of New South Wales, Australia
IIASS, Italy
Queen Mary University, UK
MIT Media Lab, USA
University of Illinois at Chicago, USA
Carnegie Mellon University, USA
University of Southern California, USA
Imperial College London, UK
Universidad Politècnica de Valencia, Spain
Université de Fribourg, Switzerland
University of Cambridge, UK
LMU, Germany
Chinese Academy of Sciences, China
Universidad de Sevilla, Spain
University College Dublin, Ireland
University of Glasgow, UK
Microsoft Research Asia, China
Estonian Emotional Speech Corpus: theoretical base and implementation
Rene Altrov, Hille Pajupuu
Institute of the Estonian Language
Roosikrantsi 6, 10119 Tallinn, Estonia
E-mail: rene.altrov@eki.ee, hille.pajupuu@eki.ee
Abstract
The establishment of the Estonian Emotional Speech Corpus (EESC) began in 2006 within the framework of the National Programme
for Estonian Language Technology at the Institute of the Estonian Language. The corpus contains 1,234 Estonian sentences that
express anger, joy and sadness, or are neutral. The sentences come from text passages read out by non-professionals who were not
given any explicit indication of the target emotion. It was assumed that the content of the text would elicit an emotion in the reader and
that this would be expressed in their voice. This avoids the exaggerations of acted speech. The emotion of each sentence in the corpus
was then determined by listening tests. The corpus is publicly available at http://peeter.eki.ee:5000/.
This article gives an overview of the theoretical starting-points of the corpus and their usefulness for its implementation.
Keywords: emotional speech corpus, elicited emotions, non-acted speech, perception of emotions
1. Introduction
The Estonian Emotional Speech Corpus (EESC) is the
only publicly available corpus containing samples of
Estonian emotional speech. The main purpose of the
corpus is to serve research of emotion and language
technology applications (see http://peeter.eki.ee:5000/).
The creation of the corpus began by formulating
theoretical starting-points (Altrov, 2008), based on
overviews of existing emotion corpora and previous
emotion research (Campbell, 2000; Cowie & Cornelius,
2003; Douglas-Cowie et al., 2003; Scherer et al., 2001;
Ververidis & Kotropoulos, 2006). Several questions
concerning the scope of the corpus and data selection had
to be answered: 1) Which emotions should the corpus
cover? 2) Should the corpus contain spontaneous, elicited,
or acted emotions? 3) Should the texts in the corpus be
spoken, or read? 4) Which texts should be selected and of
what length, content and context? 5) Should the texts be
presented by professional, or trained speakers (actors,
announcers), or non-professionals (ordinary people)?
6) What size should the corpus be? 7) How many
readers/speakers should be used? 8) Whom and how
many people should be used as emotion evaluators in the
perception tests?
2. Theoretical starting-points and creation
of the corpus
The main decisions taken concerning the establishment of
the corpus were (Figure 1):
1) Initially three emotions: sadness, anger and joy, plus
neutral speech were included in the corpus as being the
most useful emotions for language technology
applications (Campbell, 2000; Iida et al., 2003). In this
corpus these three emotions also include other related
similar emotions. Thus, joy includes gratitude, happiness,
pleasantness and exhilaration present in the reader's
voice; sadness includes loneliness, disconsolation,
concern and hopelessness; and anger includes resentment,
irony, reluctance, contempt, malice and rage. Neutral
speech in the corpus is normal speech without any
significant emotion.
2) Simulated emotions and actors were not used due to
concerns that actors might overact and use emotions that
are too intense and prototypical, and therefore differ from
speech that would be produced by a speaker experiencing
a genuine emotion (Campbell, 2000; Iida et al., 2003;
Scherer, 2003).
Authentic and moderately expressed emotions were to be
gathered from text passages read out by
non-professionals. The presumption was that the context
of the text would stimulate the reader to express the
emotion contained therein without them being told which
emotion to use (Iida et al., 2003; Navas et al., 2004).
The text passages chosen were journalistic texts,
unanimously recognised by readers in a special test, to
contain the emotions of joy, anger or sadness. The reason
for choosing journalistic texts was that when the corpus
was created, it was primarily seen as being a tool for the
text-to-speech synthesis of journalistic texts.
The person to read out the texts was chosen very
carefully: they had to have good articulation, a pleasant
voice and a sense of empathy. Experts were asked to
evaluate their articulation. As empathic readers are better
at rendering the emotions contained in a text, the
candidates were asked to take the empathy test by
Baron-Cohen & Wheelwright (2004). Another test was
carried out to evaluate the pleasantness of the candidates'
voices and listeners were asked to pick the speaker with
the most pleasant voice (Altrov & Pajupuu, 2008).
Finally, a female voice was chosen and 130 text passages
were recorded for the corpus. The passages were
segmented into sentences, which were then available to be
used in the tests to determine the emotion of sentence.
The emotional sense of each corpus sentence is
determined by listening tests. The creators of the corpus
were not completely sure how well listeners would do
trying to identify the emotions contained in non-acted
speech without actually seeing the speaker. Therefore, the
participants in the listening tests were carefully chosen to
50
increase the success rate in the identification of the
emotion.
Earlier research implies that more mature listeners may
recognise emotions from vocal cues better than younger
ones (e.g., students), because emotion recognition is a
culture-specific skill that can be acquired only with time
(Toivanen et al., 2004). Thus the creators of EESC
decided to use Estonians who were over 30 and had spent
their lives in Estonia.
Figure 1: Creation of the EESC.
Previous studies also show that in addition to age,
empathy may play a great role in the recognition of
emotion (Baron-Cohen & Wheelwright, 2004;
Chakrabarti et al., 2006). Relying on the presumption that
empathic people are more capable of recognising
emotions in voice than non-empathic people (Keen,
2006), candidates were asked to take the empathy test by
Baron-Cohen & Wheelwright (2004).
Candidates were also asked, on a voluntary basis, to
answer the EPIP-NEO questionnaire (for the Estonian
version of the questionnaire see Mõttus et al., 2006) to
study links between a person's personality traits and their
ability to identify emotions.
The corpus contains 190 registered testers. Collected user
data includes: sex, age, education, nationality, mother
tongue, language of education, work experience, empathy
quotient, and personality profile.
4) The 1,234 sentences in the corpus were used for 14
web-based tests. The underlying principle of the tests was
that the content of two successive sentences must not
form a logical sequence. Listening test subjects heard
isolated sentences without seeing the text and then had to
decide which emotion the sentences contained. The
available choices were the three emotions: sadness, anger,
or joy, or neutral speech.
At least 30 Estonians listened to each sentence.
In 908 sentences more than 50% of listeners identified
one and the same emotion, or neutrality.
One issue with the listening tests that needed to be
addressed was the role of the content in identifying the
emotion of the sentence. Thus, the same sentences were
used in 14 reading tests and subjects were asked to decide
on the emotion or neutrality of the sentences by reading
them (without audio). These subjects were not
participants in listening tests.
The emotions identified by the listeners and readers did
not always coincide. This led to the establishment of two
categories (Table 1):
sentences where content did not affect emotion
identification (the results of reading tests differ
from the results of listening tests);
sentences where content might have affected
emotion identification (the results of reading tests
coincide with the results of listening tests).
Tests
Joy
Anger
Neutral
Not sure
Sentence
type in
corpus
1. Ehkki Ott minu olemasolust midagi ei teadnud.
[Although Ott knew nothing of my existence.]
By
listening
87.5
0.0
12.5
-
Joy,
no content
influence
By
reading
4.0
0.0
32.0
32.0
2. Ükskõik, mida ma teen, ikka pole ta rahul!
[Whatever I do, he is never satisfied!]
By
listening
0.0
14.3
5.7
-
Sadness,
no content
influence
By
reading
0.0
64.3
0.0
0.0
Täiesti mõistetamatu! [Completely incomprehensible!]
By
listening
0.0
100.0
0.0
-
Anger,
content
influence
By
reading
0.0
83.0
11.0
5.6
Table 1: Classification of emotions in the corpus by
emotion identification in reading and listening tests (test
results in %).
In Table 2 the number of corpus sentences is given by
groups.
Emotion
Sentences
Content
influence on
identification
No content
influence on
identification
joy
232
163
69
anger
277
177
100
sadness
191
88
103
neutral
208
87
121
unable to identify
326
Total
1234
Table 2: Number of sentences in emotion corpus.
Although such double testing of each Corpus sentence is
rather time-consuming, it works as a validator for the
CHOICE OF EMOTIONS
joy anger sadness neutral
CHOICE OF READING MATERIAL
Journalistic text passages
Identification of emotion solely from writing
(without hearing the text)
Min. 10 testers
CHOICE OF READERS
Good articulation, pleasant voice, empathy
Reading and recording passages;
Segmenting them into sentences
LISTENING TEST
Identification of sentence emotion by audition
joy? anger? sadness? neutral?
CHOICE OF LISTENERS
Adult Estonians with good empathic abilities
Min. 30 testers
READING TEST
Identification of sentence emotion by reading
joy? anger? sadness? neutral?
CLASSIFYING SENTENCES
Comparing the results of listening and reading tests for
the classification of sentences (see Table 1):
content affects the identification of sentence
emotion
content does not affect the identification of
sentence emotion
51
corpus. Corpus users can be sure that corpus sentences
contain emotions that can be identified during listening.
Users can select sentences where emotion is rendered by
voice only or sentences where emotion is also rendered by
content.
5) The corpus was designed so that it could be used for
multiple purposes and extended by adding readers,
sentences and emotions.
3. Options for corpus users
Users can search for sentences expressing anger, joy, or
sadness, or neutral sentences from the corpus
(http://peeter.eki.ee:5000/reports/list/).
Sentences are displayed as text and can be listened to by
clicking on them. The identification rate of emotion in
each sentence is also displayed.
Queries can be narrowed down to include only sentences
in which:
content did not affect the identification of emotion;
content might have affected the identification of
emotion.
The audio-recordings and text of sentences can be
downloaded and saved (wav, textgrid). There are three
labelling levels: phonemes, words and pauses, sentences.
4. Implementation details
The corpus is a web-based application that uses freeware:
Linux, PostgreSQL, Python, Praat. All data except for
audio files have been saved in a PostgreSQL database.
The web interface was created and all data processing
carried out by using the Python programming language
and Pylons web framework. The application can be
installed in Windows and Linux systems. The web
interface is available for Estonian, English, Finnish,
Latvian, Russian and Italian, and can be easily adapted for
other languages. For the technical description of the
corpus see http://peeter.eki.ee:5000/docs/
5. Preliminary results
Currently the corpus is in a stage where the validity of the
theoretical starting-points can be verified and, if
necessary, corrections can be made.
1. It has been confirmed that listeners can easily identify
moderately expressed emotions from the voice of a
non-professional reader. For 73.5% of corpus sentences
over 50% of listeners identified one and the same
emotion, or decided that the sentence was neutral (Altrov
& Pajupuu, forthcoming), see Table 3.
Listening response
Joy
Anger
Sadness
Neutral
Emotional sentences
identified by more than
50% of listeners
232
277
191
208
Mean percentage of
identification and std
75.4
14.5
73.3
14.6
72.1
14.7
68.3
11.9
Table 3: Statistics of the emotional and neutral sentences
identified by the listening test.
2. In the early stages of creating the EESC the decision
was made to use people older than 30 as emotion
identifiers. This decision relied on the assumption that
people who have lived longer in a certain culture are more
likely to have acquired the skills of culture-specific
expression of emotions. In order to find out if the decision
to use older people as corpus testers was justified, Altrov
and Pajupuu (2010) compared the results of emotion
identification by people older than 30 and younger than
28 and found that the two groups differed significantly.
Younger people identified more sentences as neutral. Both
groups were also compared with Latvians. The latter
identified emotions quite differently from Estonians.
From these results it can be said that the identification of
emotions really is culture-specific and accurate emotion
identification requires spending a longer period in a
particular culture. It is therefore wise to use people who
have lived in Estonia longer for identifying emotions from
vocal expression.
3. Currently a study is being undertaken on how much
listeners' empathic abilities affect their ability to identify
emotions from vocal expression.
4. Another issue that needs to be addressed is whether
classifying corpus sentences according to the influence of
sentence content on emotion identification is justified,
i.e., if any significant differences can be found between
the acoustic parameters of the two groups – “content
affects identification” and “content does not affect
identification”. So far, the corpus material has only been
used for studying the difference in intensity of sentence
emotions in the two groups. ANOVA analysis has shown
that the intensity of sentences expressing anger and joy
and neutral sentences in the two groups differ
significantly. However, there is no such difference in
intensity in sentences expressing sadness (Table 4).
Although it is just one acoustic characteristic, it may mean
that the content of text affects how an emotion is
acoustically expressed, which also means that dividing
corpus sentences into two groups is justified.
Pairs:
content
influences
no content
influences
Df
Sum Sq
Mean
Sq
F value
Pr(>F)
joy
1
103.00
103.00
5.62
0.0178
Residuals
4189
76707.60
18.31
anger
1
271.38
271.38
11.92
0.0006
Residuals
5053
114992.73
22.76
sadness
1
2.80
2.80
0.13
0.7166
Residuals
3467
73757.47
21.27
neutral
1
591.40
591.40
31.66
0.0000
Residuals
3949
73755.52
18.68
Table 4: ANOVA results on emotional intensity of
sentences in two groups: “content affects identification”
and “content does not affect identification”.
6. Conclusion
This paper gives an overview of the theoretical base,
creation and content of the Estonian Emotional Speech
Corpus. The EESC contains 1,234 Estonian sentences that
have passed both reading and listening tests. Test takers
identified 908 sentences that expressed anger, joy,
52
sadness, or were neutral. The sentences were divided into
two groups: sentences in which content affected the
identification of the emotion and sentences in which it did
not. Development of the corpus continues. Corpus
sentences have also been categorised as positive, negative
and neutral. Preparations for extending the corpus by
adding video clips with spontaneous speech and their
testing are under way. The corpus is freely available and
used in the language technological projects for emotional
speech synthesis, as well as for recognition of emotions.
7. Acknowledgements
The study was supported by the National Programme for
Estonian Language Technology and the project
SF0050023s09 “Modelling intermodular phenomena in
Estonian”.
8. References
Altrov, R. (2008). Eesti emotsionaalse kõne korpus:
teoreetilised toetuspunktid. Keel ja Kirjandus, 4, pp.
261–271.
Altrov, R., Pajupuu, H. (2008). The Estonian Emotional
Speech Corpus: release 1. In F. Cermák, R.
Marcinkevicienè, E. Rimkutè & J. Zabarskaitè, The
Third Baltic Conference on Human Language
Technologies. Vilnius: Vytauto Didžiojo Universitetas,
Lietuviu kalbos institutas, pp. 9–15.
Altrov, R., Pajupuu, H. (2010). Estonian Emotional
Speech Corpus: Culture and age in selecting corpus
testers. In I. Skadiņa, A. Vasiļjevs (Eds.), Human
Language Technologies – The Baltic Perspective
Proceedings of the Fourth International Conference
Baltic HLT 2010. Amsterdam: IOS Press, pp. 25–32.
Altrov, R., Pajupuu, H. (forthcoming). Estonian
Emotional Speech Corpus: Content and options. In G.
Diani, J. Bamford, S. Cavalieri (Eds.). Variation and
Change in Spoken and Written Discourse: Perspectives
from Corpus Linguistis. Amsterdam: John Benjamins.
Baron-Cohen, S., Wheelwright, S. (2004). The Empathy
Quotient: An investigation of adults with asperger
syndrome or high functioning autism, and normal sex
differences. Journal of Autism and Developmental
Disorders, 34(2), pp. 163–175.
Campbell, N. (2000). Databases of emotional speech. In
R. Cowie, E. Douglas-Cowie, & M. Schröder (Eds.),
ISCA Workshop on Speech and Emotions. Newcastle:
North Ireland, pp. 34–38.
Chakrabarti, B., Bullmore, E., Baron-Cohen, S. (2006).
Empathizing with basic emotions: common and
discrete neural substrates. Social neuroscience, 1(3-4),
pp. 364–384.
Cowie, R., Cornelius, R.R. (2003). Describing the
emotional states that are expressed in speech. Speech
Communication, 40(1-2), pp. 5–32.
Douglas-Cowie, E., Campbell, N., Cowie, R., Roach, P.
(2003). Emotional speech: Towards a new generation
of databases, Speech Communication, 40, pp. 33–60.
Iida, A., Campbell, N., Higuchi, F., Yasumura, M. (2003).
A corpus-based speech synthesis system with emotion.
Speech Communication, 40(1-2), pp. 161–187.
Keen, S. (2006). A theory of narrative empathy.
NARRATIVE, 14(3), pp. 207–236.
Mõttus, R., Pullmann, H., Allik, J. (2006). Toward more
readable Big Five personality inventories. European
Journal of Psychological Assessment, 22(3), pp. 149–
157.
Navas, E., Castelruiz, A., Luengo, I., Sanchez, J.,
Hernaez, I. (2004). Design and recording an
audiovisual database of emotional speech in Basque. In
International conference on language resources and
evaluation (LREC), Lisbon Portugal, pp. 1387–1390.
Scherer, K.R., Banse, R., Wallbott, H.G. (2001). Emotion
inferences from vocal expression correlate across
languages and cultures. Journal of Cross-Cultural
Psychology, 32(1), pp. 76–92.
Toivanen, J., Väyrynen, E., Seppänen, T. (2004).
Automatic discrimination of emotion from spoken
Finnish. Language & Speech, 47(4), pp. 383–412.
Ververidis, D., Kotropoulos, C. (2006). Emotional speech
recognition: Resources, features, and methods. Speech
Communication, 48(9), pp. 1162–1181.
53
... The sentences could be listened to as many times as needed. For 73.5% of the corpus sentences the same emotion or neutrality was suggested by over 50% of the listeners (i.e. more than 2 times better than chance), see [2]. In addition, the same sentences were presented to another group of evaluators (in 14 reading tests), who had not participated in the listening tests. ...
... [Although Ott knew nothing of my existence.] has not received an unambiguous emotion identification from the readers, but 87.5% of the listeners have found it to be joy; for details, see [2]. The sentences in Group 1 are such where the verbal content of the message may help the listeners to decide on the emotion. ...
Article
Full-text available
This paper deals with the issue of the influence of verbal content on listeners who have to identify or evaluate speech emotions, and whether or not the emotional aspect of verbal content should be eliminated. We compare the acoustic parameters of sentences expressing joy, anger, sadness and neutrality of two groups: (1) where the verbal content aids the listener in identifying emotions; and (2), where the verbal content does not aid the listener in identifying emotions. The results reveal few significant differences in the acoustic parameters of emotions in the two groups of sentences, and indicate that the elimination of emotional verbal content in speech presented for emotion identification or evaluation is, in most cases, not necessary.
... Bu dillerden biri olan Estonca için hazırlanmış derlem tipi veri kümesi, kadın bir konuşmacı tarafından gazete metinlerinin seslendirilmesiyle elde edilmiştir. Toplam uzunluğu 1,08 saat uzunluğundadır [36,37]. Brezilya Portekizcesi için erkek bir konuşmacı tarafından seslendirilen 13.311'i benzersiz kelime olmak üzere toplam 71.358 kelime içeren ve 10,47 saat uzunluğunda olan bir veri kümesi geliştirilmiştir [30]. ...
Article
Full-text available
Text-to-Speech (TTS) systems are an important part of human-computer interaction. In the TTS process, a series of spectrograms are predicted for a given text, which is then converted into waveforms that can be heard by humans. The success of TTS systems is not equal for different languages due to limited development resources. To efficiently develop a TTS system, a large, accessible corpus is needed. The lack of such corpuses, especially for languages with resource constraints such as Turkish, is one of the biggest obstacles to developing TTS systems. Creating a large corpus is a time-consuming, challenging, and costly task. In this study, a corpus was created that can be used in the development of Turkish TTS systems. The text data that was previously prepared was voiced by a male speaker using Istanbul Turkish, regardless of emotion. The text data contains 109,826 words. The voiced speech data is approximately 12 hours, 38 minutes, and 59 seconds long and was recorded at a sampling frequency of 22050 Hz This Turkish corpus was compared to "The LJ Speech Dataset," which was prepared for English and yielded successful results, and suggestions were made for future studies. This corpus was prepared to encourage academic-level Turkish TTS studies while avoiding academic plagiarism. In order to observe the performance of the prepared Turkish corpus, the GlowTTS model was trained using this dataset. A Turkish TTS system was developed with the trained GlowTTS model. A MOS-LQO value of 2.12 was obtained as a result of comparing the voice synthesized using the developed Turkish TTS system with the natural voice. Preliminary results show that the prepared corpus makes an effective contribution to the.
... Estonian itself can hardly be characterized as lowresourced due to a variety of NLP tools (Orasmaa et al., 2016;Kaalep et al., 2018;Laur et al., 2020) and corpora (Kaalep et al., 2010;Altrov and Pajupuu, 2012;Muischnek et al., 2016) available for the language. What still remains a difficult and severely under-resourced task to tackle is nonstandard dialectal language. ...
Conference Paper
While standard Estonian is not a low-resourced language, the different dialects of the language are under-resourced from the point of view of NLP, given that there are no vast hand normalized resources available for training a machine learning model to normalize dialectal Estonian to standard Estonian. In this paper, we crawl a small corpus of parallel dialectal Estonian - standard Estonian sentences. In addition, we take a savvy approach of generating more synthetic training data for the normalization task by using an existing dialect generator model built for Finnish to "dialectalize" standard Estonian sentences from the Universal Dependencies tree banks. Our BERT based normalization model achieves a word error rate that is 26.49 points lower when using both the synthetic data and Estonian data in comparison to training the model with only the available Estonian data. Our results suggest that synthetic data generated by a model trained on a more resourced related language can indeed boost the results for a less resourced language.
... We conduct experiments on four typical discrete speech emotion datasets, including Berlin [23], CASIA [24], eNTERFACE [25] and Estonian [26]. The statistics of these data sets are summarized in Table 1. ...
... We conduct classification experiments on five typical speech emotion datasets, including Berlin [26], Estonian [27], eN-TERFACE'05 [28], CASIA [29] and SAVEE [30], to evaluate the performance of feature selection approaches. In our experiments, 1582 features of INTERSPEECH 2010 are extracted for all datasets [31]. ...
Article
Full-text available
In this paper, we propose a novel feature selection model based on subspace learning with the use of a large margin principle. First, we present a new margin metric described by a given instance and its nearest missing and nearest hit, which can be explained as the nearest neighbor with a different label and the same label, respectively. Specifically, for a given instance, the margin is the ratio of the distance of the nearest missing to that of the nearest hit rather than the difference of distances, which contributes to better balance since the distance to the nearest missing is usually much larger than the nearest hit. The proposed model seeks a subspace in which the margin metric is maximized. Moreover, considering that the nearest neighbors of a given sample are uncertain in the presence of many irrelevant features, we treat them as hidden variables and estimate the expectation of margin. To perform the feature selection, an 2,1\ell _{2,1} -norm is imposed on the subspace projection matrix to enforce row sparsity. The resulting trace ratio optimization problem, which can be connected to a nonlinear eigenvalue problem, is hard to solve. Thus, we design an efficient iterative algorithm and present a theoretical analysis of the convergence. Finally, we evaluate the proposed method by comparing it against several other state-of-the-art methods. The extensive experiments on real-world datasets show the superiority of our proposed approach.
... The perception tests were conducted electronically in the environment of the Estonian Emotional Speech Corpus (Altrov and Pajupuu 2012). The test subjects were asked to listen to the 4 x 10 synthesized speech sequences (2 × male voice (Test A and Test B) and 2 × female voice (Test A and Test B)), and to determine the emotion or the neutrality of each speech sequence. ...
Article
Full-text available
Abstract. The goal of this study was to conduct modelling experiments, the purpose of which was the expression of three basic emotions (joy, sadness and anger) in Estonian parametric text-to-speech synthesis on the basis of both a male and a female voice. For each emotion, three different test models were constructed and presented for evaluation to subjects in perception tests. The test models were based on the basic emotions’ characteristic parameter values that had been determined on the basis of human speech. In synthetic speech, the test subjects most accurately recognized the emotion of sadness, and least accurately the emotion of joy. The results of the test showed that, in the case of the synthesized male voice, the model with enhanced parameter values performed best for all three emotions, whereas in the case of the synthetic female voice, different emotions called for different models: the model with decreased values was the most suitable one for the expression of joy, and the model with enhanced values was the most suitable for the expression of sadness and anger. Logistic regression was applied to the results of the perception tests in order to determine the significance and contribution of each acoustic parameter in the emotion models, and the possible need to adjust the values of the parameters. Keywords: Estonian, emotions, speech synthesis, acoustic model, speech rate, intensity, fundamental frequency DOI: http://dx.doi.org/10.12697/jeful.2015.6.3.06
... To provide the sentences with an emotion label the sentences were extracted from the context and presented to a group of evaluators whose native language was Estonian. They were asked to decide whether the sentence sounded joyful, angry, sad, or neutral (see Altrov and Pajupuu 2012). ...
Article
Full-text available
We investigated the influence of culture and language on the understanding of speech emotions. Listeners from different cultures and language families had to recognize moderately expressed vocal emotions (joy, anger, sadness) and neutrality of each sentence in foreign speech without seeing the speaker. The web-based listening test consisted of 35 context-free sentences drawn from the Estonian Emotional Speech Corpus. Eight adult groups participated, comprising: 30 Estonians; 17 Latvians; 16 North-Italians; 18 Finns; 16 Swedes; 16 Danes; 16 Norwegians; 16 Russians. All participants lived in their home countries and, except the Estonians, had no knowledge of Estonian. Results showed that most of the test groups differed significantly from Estonians in the recognition of most emotions. Only Estonian sadness was recognized well by all test groups. Results indicated that genealogical relation of languages and similarity of cultural values are not advantages in recognizing vocal emotions expressed in a different culture and language.
... The research material comes from the Estonian Emotional Speech Corpus (see Altrov and Pajupuu 2012). The speech in the corpus has been obtained by extracted from longer text passages -it is assumed that every text evokes a certain mood in the reader, which is vocally expressed. ...
Article
Full-text available
The study addresses cultural influence in the recognition of moderately expressed emotions in a second language (L2) and foreign speech. The web-based listen-ing test consisted of context-free sentences drawn from the Estonian Emotional Speech Corpus. The task was to recognize the emotion (joy, anger, sadness) or neutrality of each sentence without seeing the speaker. Three adult groups participated: (1) 36 Estonians, with Estonian as mother tongue; (2) 16 highly educated Russians living in Estonia, with Russian as their mother tongue and Estonian as a second language; (3) 16 highly educated Russians living in Russia, with Russian mother tongue and no knowledge of Estonian. The results showed a significant difference between Estonians and Russians living in Estonia in their recognition of joy and neutrality; Russians living in Russia differed significantly from Estonians and Russians living in Estonia on all emotion scores. This confirms that cultural norms are mastered through interaction: to recognize vocal emotions expressed in another language it is necessary to live in the culture and communicate in its language.
Article
Full-text available
The existence of a mapping between emotions and speech prosody is commonly assumed. We propose a Bayesian modelling framework to analyse this mapping. Our models are fitted to a large collection of intended emotional prosody, yielding more than 3,000 minutes of recordings. Our descriptive study reveals that the mapping within corpora is relatively constant, whereas the mapping varies across corpora. To account for this heterogeneity, we fit a series of increasingly complex models. Model comparison reveals that models taking into account mapping differences across countries, languages, sexes and individuals outperform models that only assume a global mapping. Further analysis shows that differences across individuals, cultures and sexes contribute more to the model prediction than a shared global mapping. Our models, which can be explored in an online interactive visualization, offer a description of the mapping between acoustic features and emotions in prosody. A mapping between emotions and speech prosody is commonly assumed. This study shows, using Bayesian modelling, that differences across individuals, cultures and sexes contribute more to the model prediction than a shared global mapping.
Article
This paper focuses on the cross-corpus speech emotion recognition (SER) task. To overcome the problem that the distribution of training (source) samples is inconsistent with that of testing (target) samples, we propose a non-negative matrix factorization based transfer subspace learning method (NMFTSL). Our method tries to find a shared feature subspace for the source and target corpora, in which the discrepancy between the two distributions is eliminated as much as possible and their individual components are excluded, thus the knowledge of the source corpus can be transferred to the target corpus. Specifically, in this induced subspace, we minimize the distances not only between the marginal distributions but also between the conditional distributions, where both distances are measured by the maximum mean discrepancy criterion. To estimate the conditional distribution of the target corpus, we propose to integrate the prediction of target label and the learning of feature representation into a joint learning model. Meanwhile, we introduce a difference loss to exclude the individual components from the shared subspace, which can further reduce the mutual interference between the source and target individual components. Moreover, we propose a discrimination loss to introduce the labels into the shared subspace, which can improve the discrimination ability of the feature representation. We also provide the solution for the corresponding optimization problem. To evaluate the performance of our method, we construct 30 cross-corpus SER schemes using 6 popular speech emotion corpora. Experimental results show that our approach achieves better overall performance than state-of-the-art methods.
Article
Full-text available
The Estonian version of the International Personality Item Pool NEO (IPIP-NEO; Goldberg, 1999) was administered to 297 participants in parallel with the Estonian version of the NEO-PI-R (Kallasmaa, Allik, Realo, & McCrae, 2000). On average, the EPIPNEO items were 3 words, 7 syllables, and 18 characters shorter than the NEO-PI-R items. By all relevant psychometrical properties the EPIP-NEO was comparable to the NEO-PI-R. The mean convergent correlation between the facet scales was .73. The scales with shorter and grammatically simpler items tended to have higher internal consistency. In an independent cross-validation sample the initial results were generally replicated. The scales also demonstrated an adequate cross-observer agreement. It is concluded that the EPIP-NEO, as a more readable personality inventory compared to the NEO-PI-R, is suitable for a wider range of samples with different levels of reading skills. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Chapter
Full-text available
The Estonian Emotional Speech Corpus serves as the acoustic basis for emotional text-to-speech synthesis. Because the Estonian synthesizer is a TTS-synthesizer, we started off by focusing on read texts and the emotions contained in them. The corpus is built on a theoretical model and we are currently at the stage of verifying the components of the model. In the present article we give an overview of the corpus and the principles used in selecting its testers. Some studies show that people who have lived longer in a certain culture can more easily recognize vocal expressions of emotion that are characteristic of the culture without seeing the speaker's facial expressions. We therefore decided not to use people under 30 years of age as testers of emotions in our theoretical model. We used two tests to verify the selection principles for the testers. In the first test, 27 young adults aged under 30 were asked to listen to and identify the emotion (joy, anger, sadness, neutral) of 35 sentences. We then compared the results with those of adults aged over 30. In the second test we asked 32 Latvians listen to the same sentences, and then compared the results with those of Estonians. Our analysis showed that younger and older testers, Estonians and Latvians perceive emotions quite differently. From these test results we can say that the selection principle of corpus testers, using people who are more familiar with Estonian culture, is acceptable.
Article
Full-text available
Whereas the perception of emotion from facial expression has been extensively studied cross-culturally, little is known about judges’ ability to infer emotion from vocal cues. This article reports the results from a study conducted in nine countries in Europe, the United States, and Asia on vocal emotion portrayals of anger, sadness, fear, joy, and neutral voice as produced by professional German actors. Data show an overall accuracy of 66% across all emotions and countries. Although accuracy was substantially better than chance, there were sizable differences ranging from 74% in Germany to 52% in Indonesia. However, patterns of confusion were very similar across all countries. These data suggest the existence of similar inference rules from vocal expression across cultures. Generally, accuracy decreased with increasing language dissimilarity from German in spite of the use of language-free speech samples. It is concluded that culture- and language-specific paralinguistic patterns may influence the decoding process.
Article
Full-text available
Article
Full-text available
In this paper we overview emotional speech recognition having in mind three goals. The first goal is to provide an up-to-date record of the available emotional speech data collections. The number of emotional states, the language, the number of speakers, and the kind of speech are briefly addressed. The second goal is to present the most frequent acoustic features used for emotional speech recognition and to assess how the emotion affects them. Typical features are the pitch, the formants, the vocal tract cross-section areas, the mel-frequency cepstral coefficients, the Teager energy operator-based features, the intensity of the speech signal, and the speech rate. The third goal is to review appropriate techniques in order to classify speech into emotional states. We examine separately classification techniques that exploit timing information from which that ignore it. Classification techniques based on hidden Markov models, artificial neural networks, linear discriminant analysis, k-nearest neighbors, support vector machines are reviewed.
Chapter
This book focuses on aspects of variation and change in language use in spoken and written discourse on the basis of corpus analyses, providing new descriptive insights, and new methods of utilising small specialized corpora for the description of language variation and change. The sixteen contributions included in this volume represent a variety of diverse views and approaches, but all share the common goal of throwing light on a crucial dimension of discourse: the dialogic interactivity between the spoken and written. Their foci range from papers addressing general issues related to corpus analysis of spoken dialogue to papers focusing on specific cases employing a variety of analytical tools, including qualitative and quantitative analysis of small and large corpora. The present volume constitutes a highly valuable tool for applied linguists and discourse analysts as well as for students, instructors and language teachers.
Article
Narrative 14.3 (2006) 207-236 We are living in a time when the activation of mirror neurons in the brains of onlookers can be recorded as they witness another's actions and emotional reactions. Contemporary neuroscience has brought us much closer to an understanding of the neural basis for human mind reading and emotion sharing abilities—the mechanisms underlying empathy. The activation of onlookers' mirror neurons by a coach's demonstration of technique or an internal visualization of proper form and by representations in television, film, visual art, and pornography has already been recorded. Simply hearing a description of an absent other's actions lights up mirror neuron areas during fMRI imaging of the human brain. The possibility that novel reading stimulates mirror neurons' activation can now, as never before, undergo neuroscientific investigation. Neuroscientists have already declared that people scoring high on empathy tests have especially busy mirror neuron systems in their brains. Fiction writers are likely to be among these high empathy individuals. For the first time we might investigate whether human differences in mirror neuron activity can be altered by exposure to art, to teaching, to literature. This newly enabled capacity to study empathy at the cellular level encourages speculation about human empathy's positive consequences. These speculations are not new, as any student of eighteenth-century moral sentimentalism will affirm, but they dovetail with efforts on the part of contemporary virtue ethicists, political philosophers, educators, theologians, librarians, and interested parties such as authors and publishers to connect the experience of empathy, including its literary form, with outcomes of changed attitudes, improved motives, and better care and justice. Thus a very specific, limited version of empathy located in the neural substrate meets in the contemporary moment a more broadly and loosely defined, fuzzier sense of empathy as the feeling precursor to and prerequisite for liberal aspirations to greater humanitarianism. The sense of crisis stirred up by reports of stark declines in reading goes into this mix, catalyzing fears that the evaporation of a reading public leaves behind a population incapable of feeling with others. Yet the apparently threatened set of links among novel reading, experiences of narrative empathy, and altruism has not yet been proven to exist. This essay undertakes three tasks preliminary to the scrutiny of the empathy-altruism hypothesis as it might apply to experiences of narrative empathy (to be developed in greater detail in the forthcoming Empathy and the Novel). These tasks include: a discussion of empathy as psychologists understand and study it; a brief introduction to my theory of narrative empathy, including proposals about how narrative empathy works; and a review of the current research on the effects of specific narrative techniques on real readers. Empathy, a vicarious, spontaneous sharing of affect, can be provoked by witnessing another's emotional state, by hearing about another's condition, or even by reading. Mirroring what a person might be expected to feel in that condition or context, empathy is thought to be a precursor to its semantic close relative, sympathy. Personal distress, an aversive emotional response also characterized by apprehension of another's emotion, differs from empathy in that it focuses on the self and leads not to sympathy but to avoidance. The distinction between empathy and personal distress matters because empathy is associated with the moral emotion sympathy (also called empathic concern) and thus with prosocial or altruistic action. Empathy that leads to sympathy is by definition other-directed, whereas an over-aroused empathic response that creates personal distress (self-oriented and aversive) causes a turning-away from the provocative condition of the other. None of the philosophers who put stock in the morally improving experience of narrative empathy include personal distress in their theories. Because novel reading can be so easily stopped or interrupted by an unpleasant emotional reaction to a book, however, personal distress has no place in a literary theory of empathy, though it certainly contributes to aesthetic emotions, such as those Sianne Ngai describes in her important book Ugly Feelings. In empathy, sometimes described as an emotion in its own right, we feel what we believe to be the emotions of others. Empathy is thus...
Article
This paper describes an emotional speech database recorded for standard Basque. This database was recorded in the framework of a project in which the goal was to develop an avatar. therefore, the image corresponding to the expression of the different emotions was also needed. This is why an audiovisual database was developed. The designed database contains six basic emotions as well as the neutral speaking style. It consists in isolated words and sentences read by a professional dubbing actress. At present, this database is being used to study the prosodic models related with each emotion in standard Basque.
Article
We propose a new approach to synthesizing emotional speech by a corpus-based concatenative speech synthesis system (ATR CHATR) using speech corpora of emotional speech. In this study, neither emotional-dependent prosody prediction nor signal processing per se is performed for emotional speech. Instead, a large speech corpus is created per emotion to synthesize speech with the appropriate emotion by simple switching between the emotional corpora. This is made possible by the normalization procedure incorporated in CHATR that transforms its standard predicted prosody range according to the source database in use. We evaluate our approach by creating three kinds of emotional speech corpus (anger, joy, and sadness) from recordings of a male and a female speaker of Japanese. The acoustic characteristics of each corpus are different and the emotions identifiable. The acoustic characteristics of each emotional utterance synthesized by our method show clear correlations to those of each corpus. Perceptual experiments using synthesized speech confirmed that our method can synthesize recognizably emotional speech. We further evaluated the method’s intelligibility and the overall impression it gives to the listeners. The results show that the proposed method can synthesize speech with a high intelligibility and gives a favorable impression. With these encouraging results, we have developed a workable text-to-speech system with emotion to support the immediate needs of nonspeaking individuals. This paper describes the proposed method, the design and acoustic characteristics of the corpora, and the results of the perceptual evaluations.
Article
Research on speech and emotion is moving from a period of exploratory research into one where there is a prospect of substantial applications, notably in human–computer interaction. Progress in the area relies heavily on the development of appropriate databases. This paper addresses four main issues that need to be considered in developing databases of emotional speech: scope, naturalness, context and descriptors. The state of the art is reviewed. A good deal has been done to address the key issues, but there is still a long way to go. The paper shows how the challenge of developing appropriate databases is being addressed in three major recent projects––the Reading–Leeds project, the Belfast project and the CREST–ESP project. From these and other studies the paper draws together the tools and methods that have been developed, addresses the problems that arise and indicates the future directions for the development of emotional speech databases.RésuméL’étude de la parole et de l’émotion, partie du stade de la recherche exploratrice, en arrive maintenant au stade qui est celui d’applications importantes, notamment dans l’interaction homme–machine. Le progrès en ce domaine dépend étroitment du développement de bases de données appropriées. Cet article aborde quatre points principaux qui méritent notre attention à ce sujet: l’étendue, l’authenticité, le contexte et les termes de description. Il présente un compte-rendu de la situation actuelle dans ce domaine et évoque les avancées faites, et celles qui restent à faire. L’article montre comment trois récents projets importants (celui de Reading–Leeds, celui de Belfast, et celui de CREST–ESP) ont relevé le défi posé par la construction de bases de données appropriées. A partir de ces trois projets, ainsi que d’autres travaux, les auteurs présentment un bilan des outils et méthodes utilisés, identifient les problèmes qui y sont associés, et indiquent la direction dans laquelle devraient s’orienter les recherches à venir.