Conference PaperPDF Available

Abstract

The creation of a set of short synthetic sounds is described that were rated regarding valence, arousal, discrete emotions, and how well they would fit specific speech acts. Sounds were designed to provide nonverbal signals, for example, in the context of human robot interaction, or emotional qualifiers to synthetic speech.
1115
1116
1117
1118
1142
1151 1152
1153
1154
1179
1180
1181
1182
1209
1210
1211
1237
1238 1239
1240
1267
1268
1269
1270
1325
1326
1327
1328
1353
1125
1126
1127
1128
1144
1160
1161
1162
1189
1190
1191
1192
1217
1218
1219 1220
1247
1248
1249
1250
1275
1276
1277
1278
1335
1336
1337
1338
1359
1360
1
2
3
4
5
6
7
8
9
123456789
disgust interest
Simply the B.E.S.T.!
Creation and Validation of the Bremen Emotional Sounds Toolkit
Arvid Kappas, Dennis Küster, Pasquale Dente, & Christina Basedow
Jacobs University Bremen
Bradley, M. M. & Lang, P. J. (2007). The International Affective Digitized Sounds (2nd Edition; IADS-2): Affective ratings of sounds and instruction manual. Technical
report B-3. University of Florida, Gainesville, Fl.
Calvo, R. A., D'Mello, S., Gratch, J., & Kappas, A. (Eds.). (2014). The Oxford Handbook of Affective Computing. New York, N.Y.: Oxford University Press.
Izard, C. E. (1991). The psychology of emotions. New York, N.Y.: Plenum Press.
Kappas, A. (2010). Smile when you read this, whether you like it or not: Conceptual challenges to affect detection.IEEE Transactions on Affective Computing,1,38-
41.doi:10.1109/T-AFFC.2010.
Komatsu, T., & Yamada, S. (2011). How does the agents’ appearance affect users’ interpretation of the agents’ attitudes: experimental investigation on expressing the
same artificial sounds from agents with different appearances.International Journal of Human-Computer Interaction,27,260279.
doi:10.1080/10447318.2011.537209
Read, R., & Belpaeme, T. (2012). How to use non-linguistic utterances to convey emotion in child-robot interaction.In Proceedings of the 7th Annual ACM/IEEE
International Conference on Human-Robot Interaction,219-220.doi:10.1145/2157689.2157764
Read, R., & Belpaeme, T. (2014). Situational context directs how people affectively interpret robotic non-linguistic utterances (pp.4148). In Proceedings of the 2014
ACM/IEEE International Conference on Human-robot Interaction,41-48.doi:10.1145/2559636.2559680
Russell, J. A., Weiss, A., & Mendelsohn, G. A. (1989). Affect grid: a single-item scale of pleasure and arousal. Journal of Personality and Social Psychology,
57,493.
Vrana, S. R., Spence, E. L., & Lang, P. J. (1988). The startle probe response: a new measure of emotion?. Journal of Abnormal Psychology,97,487.
Acknowledgments
This research was funded by EU FP7 EMOTE (ICT-2011-8 317923).
Contact & Download
Prof. ArvidKappas
a.kappas@jacobs-university.de
Jacobs University Bremen
Campus Ring 1
28759 Bremen, Germany
The creation of aset of short synthetic sounds is described that were rated regarding valence, arousal, discrete emotions,
and how well they would fit specific speech acts. Sounds were designed to provide nonverbal signals, for example, in the
context of human robot interaction, or emotional qualifiers to synthetic speech.
Abstract
Supporting Summary
Communication between humans and machines, such as robots, is no longer a topic of science fiction. While some issues, such as understanding spoken
language or the recognition of faces, are to a certain degree engineering challenges that are increasingly being met, there are other issues that are initially in
the psychological realm, such as the expression of emotions that are still conceptually challenging. In fact, one of the great obstacles to facilitating the
interaction between humans and machines is that while emotions and nonverbal cues play an important role in human-human interaction, it is a challenge to
create machines that express and sense user states reliably (Kappas, 2010). Yet, there is an increasing belief that emotion savvy machines are crucial to
improve the interaction between intelligent artificial systems and humans. In this context a new discipline has emerged in recent years, Affective Computing
(Calvo, D’Mello, Gratch, & Kappas, 2014), that bridges several disciplines, including engineering, computer science and psychology. While considerable
effort has been placed on the synthesis of expressive displays, vocal/acoustic cues are much less studied. In fact, text-to-speech synthesis systems typically
only have rudimentary capacities of including affective paralinguistic information.
Partially inspired by movie portrayals of robots (e.g., R2-D2; WALL-E), nonverbal sounds have been created recently by several authors as a means to
replace or augment synthetic speech that is lacking paralinguistic cues (e.g., Komatsu & Yamada, 2011; Read & Belpaeme, 2012, 2014). We describe the
creation of a set of 408 synthetic sounds that were evaluated by a total of 518 judges according to valence and arousal, discrete emotions, as well as how
well they would fit specific speech acts. The BremenEmotionalSoundsToolkit, freely available to researchers, provides many possibilities to communicate
acoustically without depending on any given language, including the use in multi-lingual environments, with children, psychiatric patients, or patients with
brain damage. In addition, the sounds provide a toolkit for the study of affective sounds in related applications. Unlike theInternational Affective Digitized
Sounds (Bradley & Lang, 2007) no natural sounds were used (e.g., screams), which may significantly reduce the degree to which their perception is shaped
by context-specific phenomena. For example one would not expect one’s car to give an alert message with the voice of a crying baby. But a synthetic sound
fits with the notion that a machine is trying to communicate. Furthermore, these sounds combine well with text-to-speech synthesis to create hybrid sound-
speech acts that allow on the one hand the emotional qualification of individual utterances, but also on the other hand help to render an artificial system
more like a sentient entity by adding more biological” features, perhaps resembling features of prototypical animal or child-like vocalizations, without being
mistaken for them.
We have included in the tool kit also ambiguously judged stimuli, because these might be easily qualified by situational context. Here, the function of the
sound-utterance combo, or even individual sounds is fully social in creating the impression of a plausible (artificial) interaction partner.
Participants (sound evaluation)
N= 518 completed across all studies (137 female); Mage = 33.7 (SD = 8.65)
194 unique participants, from 41 different countries
Design
A small set of participants (N= 9) initially produced (encoded) sounds (fig. xx)
56 sounds each; 8 affective triplets (e.g., Surprise-Amazed-Astonished”)
6 speech acts (e.g., Greeting), 2 intensity levels (high vs. low)
Large set of participants for the sound evaluation (decoding)
recruited online via Crowdflower(www.crowdflower.com)
$3 compensation per set (about 30m duration)
Only top-performing Crowdflowerworkers
All sounds were presented individually in a total of 13 sub studies
7 using the Differential Emotion Scale (DES; Izard, 1991),
4 using a valence+arousal Affect Grid (AG; figure xz)
2 to evaluate speech acts (SPA).
Figure 1. Encoding environment using an iPad & Bebot Robot Synth app (Normalware, Version 2.0)
Methods: Design
Results
References Discussion Next Steps
Bremen Emotional Sounds Toolkit
B.E.S.T. (2014)
International Affective Digitized Sounds 2
IADS-2 (2007)
1111
1112
1113
1114
1140
1141
1147
1148
1149
1150
1175
1176
1177
1178
1205
1206
1207
1208
1233
1234
12351236
1263
1264
1265
1266
1321
1322
13231324
1351
1352
1115
1116
1117
1118
1142
1151 1152
1153
1154
1179
1180
1181
1182
1209
1210
1211
1237
1238
1239
1240
1267
1268
1269
1270
1325
1326
1327
1328
1353 1119
1120
1155
1183
1184
1212
1241
1242
1271
1329
1330
1354
1355
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570 1571
1572
1121
1122
1123
1124
1143
1156
1157
1158
1159
1185
1186
1187
1188
1213
1214
1215
1216
1243
1244
1245
1246
1272
1273
1274
1331
1332
1333
1334
1356
1357
1358
1125
1126
1127
1128
1144
1160
1161
1162
1189
1190
1191
1192
1217
1218
1219
1220
1247
1248
1249 1250
1275
1276
1277
1278
1335
1336
1337 1338
1359
1360
1129
1130
1131 1132
1163
1164
1165
1166
1193
1194
1195
1196
1221
1222
1223 1224
1251
1252
1253 1254
1279
1280
1281
1282
1339
1340
1341
1342
1361
1133
1134
1135
1136
1145
1146
1167
1168
1169
1170
1197
1198
1199
1200
1225
1226
1227
1228
1255
1256
1257
1258
1283
1284
1285
1286
1343
13441345
1346
1362
1363
1137
1138
1139
1171
1172
1173
1174
1201
1202
1203
1204
1229
1230
1231
1232
1259
1260
1261
1262
1287
1288
1289 1290
1347
1348
1349
1350
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
Valence
Arousal
Pleasure
Pleasure
Arousal
Arousal
Creation of
Sentences Synthesis Implementation
on Nao
Lab-Study with
children (N= 23) Analysis
Embodied Voice Evaluation Study (EVES)
Design
Decisions
Selection of
Statements
Synthesis &
Implementation
Online Study
with 60 stimuli Analysis
Feedback and Emblems Study (FEaST)
Arousal
Pleasure
Anger
Disgust
Enjoyment
Fear
Interest
Sadness
Shame
Surprise
Contempt
Guilt
0
1
2
3
4
5
6
7
8
9
Surprise
Shame
Sadness
Interest
Fear
Enjoyment
Disgust
Anger
B.E.S.T.
Emotion
Dimensions
Discrete
Emotions
Speech Acts
Methods: Materials & Procedure
Figure 2. Digital Affect Grid adapted from Russell, Weiss, and Mendelsohn (1989) Figure 3. Sound evaluations performed by participant using DES triplets as defined by Izard (1991)
www.faculty.jacobs-
university.de/dkuester/ICPS2015/
BEST_ICPS_2015.pdf
Overall, the B.E.S.T. aims to fill a niche in research and applications requiring emotional sounds that are very homogeneous and
contextually neutral. While existing sets of affective sounds such as the IADS can cover a significantly larger proportion of the
valence-arousal space, IADS-sounds are very heterogeneous both in respect to their physical properties as well as psychologically
in terms of their associated social contexts (e.g., the sound of a dentists drill vs. a giggling baby). In contrast, sound emblems from
the B.E.S.T. are likely to elicit changes in emotional states in a less association-driven manner. At the same time, our emblemscan be
a means to manipulate mood in a subtle way without being physically discomforting (cf., e.g., startle-probes, Vrana, Spence, & Lang,
1988).
Sounds from the B.E.S.T. are nevertheless limited in the extent to which they can be used to convey different discrete emotional
states. While the emblems cover a relatively large portion of the space in terms of perceived arousal, our validation studiessuggest
that the possible range on the valence (pleasure) dimension is much more limited. Further, while different discrete positive and
negative states overall tend to be localized as expected in the valence-arousal space, agreement between encoders and decoders is
rather low. This suggests that the B.E.S.T. may be most useful in psychological research that aims to use an auditory variable to elicit,
or assess, relatively subtle changes in mood when experimental hypotheses need to be concealed, or when the nature of the
required stimuli has to be non-verbal, non-musical, or otherwise removed from context-dependent confounds. For example, in the
context of the ongoing EU-FP7 EMOTE project, the B.E.S.T. is used to represent the emotional states of a robotic tutor that helps
children to learn geography-related skills. Here, the application requires short emotional sounds that can quickly suggest an
emotional tendency but that are still malleable enough to allow being re-interpreted if the robot should happen to misjudge the
emotional state of the child.
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.