Conference PaperPDF Available

Modeling mutual influence of interlocutor emotion states in dyadic spoken interactions

Authors:

Abstract and Figures

In dyadic human interactions, mutual influence - a person's in- fluence on the interacting partner's behaviors - is shown to be important and could be incorporated into the modeling frame- work in characterizing, and automatically recognizing the par- ticipants' states. We propose a Dynamic Bayesian Network (DBN) to explicitly model the conditional dependency between two interacting partners' emotion states in a dialog using data from the IEMOCAP corpus of expressive dyadic spoken in- teractions. Also, we focus on automatically computing the Valence-Activation emotion attributes to obtain a continuous characterization of the participants' emotion flow. Our pro- posed DBN models the temporal dynamics of the emotion states as well as the mutual influence between speakers in a dialog. With speech based features, the proposed network improves classification accuracy by 3.67% absolute and 7.12% relative over the Gaussian Mixture Model (GMM) baseline on isolated turn-by-turn emotion classification. Index Terms: emotion recognition, mutual influence, Dynamic Bayesian Network, dyadic interaction
Content may be subject to copyright.
Modeling Mutual Influence of Interlocutor Emotion States in Dyadic Spoken
Interactions
Chi-Chun Lee, Carlos Busso, Sungbok Lee, Shrikanth Narayanan
Signal Analysis and Interpretation Laboratory (SAIL)
Electrical Engineering Department
University of Southern California, Los Angeles, CA 90089, USA
chiclee@usc.edu, busso@usc.edu, sungbokl@usc.edu, shri@sipi.usc.edu
Abstract
In dyadic human interactions, mutual influence - a person’s in-
fluence on the interacting partner’s behaviors - is shown to be
important and could be incorporated into the modeling frame-
work in characterizing, and automatically recognizing the par-
ticipants’ states. We propose a Dynamic Bayesian Network
(DBN) to explicitly model the conditional dependency between
two interacting partners’ emotion states in a dialog using data
from the IEMOCAP corpus of expressive dyadic spoken in-
teractions. Also, we focus on automatically computing the
Valence-Activation emotion attributes to obtain a continuous
characterization of the participants’ emotion flow. Our pro-
posed DBN models the temporal dynamics of the emotion states
as well as the mutual influence between speakers in a dialog.
With speech based features, the proposed network improves
classification accuracy by 3.67% absolute and 7.12% relative
over the Gaussian Mixture Model (GMM) baseline on isolated
turn-by-turn emotion classification.
Index Terms: emotion recognition, mutual influence, Dynamic
Bayesian Network, dyadic interaction
1. Introduction
In dyadic (two person) human-human conversation, the interac-
tions between the two participants have shown to exhibit vary-
ing degrees and patterns of mutual influence along several as-
pects such as talking style/prosody, gestural behavior, engage-
ment level, emotion, and many other types of user states [1].
This mutual influence guides the dynamic flow of the conver-
sation and often plays an important role in shaping the overall
tone of the interaction. In fact, we can view a dyadic conver-
sation as two interacting dynamical state systems such that the
evolution of a speaker’s user state depends not only on its own
history but also the interacting partner’s history. This modeling
will not only allow us to capture interactants’ user states more
reliably, but it could also provide a higher level description of
the interaction details, such as talking in-sync, avoidance, or
arguing.
The increasing sophistication of automatic meeting and di-
alog analysis due to the advances in audio-visual technologies,
modeling emotion evolution has since become an important as-
pect of dialog modeling. Emotion evolution is related to peo-
ple’s perception on the overall tones of interaction, and it can
also be used to identify salient portions in a conversation. Fur-
ther, if we can better model the mutual influence during inter-
action, we could bring insights into designing communication
strategy for a human-machine interaction agent to promote effi-
cient communication. In this paper, we propose and implement
a model describing the evolution of emotion states of the two
participants engaged in dyadic dialogs by incorporating the idea
of mutual influence during interaction.
Emotion can be represented by three-dimensional at-
tributes as presented in [2]: (V)Valence: positive - negative,
(A)Activation: aroused - calm, (D)Dominance: strong - weak,
with each attribute associated with a numerical value indicating
the level of expression. In our model, we focus on Activation
and Valence dimension only. This dimensional representation
offers a general description of the emotion, and it provides a
natural way for describing dynamic emotion evolution in a dia-
log since not all utterances in a dialog can be easily labeled as a
specific categorical emotion. Our approach contrasts with most
of the previous emotion classification schemes that have primar-
ily focused on utterance level recognition of categorical labels
[3] or emotion attributes [4]. Others, such as proposed in [5]
have used features that encode contextual information to per-
form emotion recognition. However, most of these works have
neither considered decoding dynamic emotions through the di-
alog, nor have they incorporated the mutual influence exhibited
between interactants in their models.
Because of its ability to model conditional dependency be-
tween variables within and across time, we utilize the Dynamic
Bayesian Network (DBN) framework to model the mutual in-
fluence and temporal dependency of speakers’ emotional states
in a conversation. The experiment of this paper used the IEMO-
CAP database [6] since it provides a rich corpus of expressive
dyadic spoken interaction. Also, detailed annotation of emotion
is available for every utterance in the corpus. We hypothesize
that by including cross speaker dependency and modeling the
temporal dynamics of the emotion states in a dialog, we can
obtain better emotion recognition performance and bring im-
proved insights into mutual influence behaviors in dyadic inter-
action.
The paper is organized as follows. Our research method-
ology is described in Section 2. The experimental results and
discussion are presented in Section 3. Conclusion and future
work are given in Section 4.
2. Research Methodology
2.1. Database and Annotation
2.1.1. IEMOCAP Database
We use the IEMOCAP database [6] for the present study. The
database was collected for the purpose of studying expressive
dyadic interaction from a multimodal perspective. The design-
ing of the database assumed that by exploiting dyadic interac-
tions between actors, a more natural and richer emotional dis-
Copyright © 2009 ISCA 6
-
10 September, Brighton UK1983
Tab le 1 : Emotion Label Clustering (k=5)
Emotion Cluster Number of Turns Cluster Centroid (V,A)
Class 1 1254 (2.19 , 3.29)
Class 2 1954 (3.15 , 3.14)
Class 3 2027 (4.06 , 2.21)
Class 4 1092 (1.89 , 2.25)
Class 5 2016 (3.84 , 3.55)
play would be elicited than in speech read by a single subject
[7]. This data allows us to investigate our hypothesis about the
mutual influence between speakers during spoken interaction.
The database was motion captured and audio recorded in five
dyadic sessions with 10 subjects, where each session consists
of a different pair of male-female actors both acting out scripted
plays and engaging in spontaneous dialogs. The analysis in this
paper utilizes the recorded speech data from both subjects in
every dialog available with speech transcriptions and emotional
annotations. Three human annotations on categorical emotion
labels, such as happy, sad, neutral, angry, etc, and two human
evaluation of the three emotion attributes (Valence, Activation,
Dominance) are available for every utterance in the database.
Each dimension is labeled on a scale of 1 to 5 indicating differ-
ent levels of expressiveness.
The database was originally manually segmented into ut-
terances. But, to ensure that we have both speakers’ acoustic
information for a given analysis window in our dynamic mod-
eling, we define a turn change, T , as one analysis window.
Each T consist of two turns. Each turn is defined as the portion
of speech belonging to a single speaker before he/she finishes
speaking, and may consists of multiple original segmented ut-
terances. Figure 1 shows an example that explains our defini-
tion.
A
B
T-1 T
Turn_A1 Turn_A2
Turn_B1 Turn_B2
Figure 1: Example of Analysis Windows.
The example has two speakers, A and B, and a total of two
analysis windows, T-1 and T, segmented. Speaker A is defined
as the first person to speak in a dialog, and is always the starting
point of any analysis window. A speaker can speak multiple
utterances in a given turn as shown in Figure 1 of Turn A2.Two
turns - one from each speaker, denotes a turn change,whichis
defined as our one analysis window. Annotators were asked to
provide a label for every utterance in the database. Since our
basic unit is a turn, an emotional label is given to every turn as
described in the following section.
2.1.2. Emotion Annotation
In this work, we focused on the Valence-Activation dimensions
of emotion representation, since the combination of these two
dimensions can be intuitively thought as corresponding to mark-
ing most of the conventional categorical emotions [4]. The di-
mension values for each turn is obtained by averaging the two
annotated values. In order to further reduce the number of emo-
tional states values, 52=25, we cluster these two dimensions’
values. Based on our empirical observation, we decided to
group these two dimension values into five clusters using the
K-Means clustering algorithm. Figure 2 shows our clustering
output. Although this averaging may create quantization noise,
from Figure 2, we can see that this process does provide reason-
ably interpretable clusters. For example, cluster 3 represented
1 1.5 2 2.5 3 3.5 4 4.5 5
1
1.5
2
2.5
3
3.5
4
4.5
5
Valence
Activation
Figure 2: K-Means Clustering Output of Valence-Activation.
by diamond-shaped markers could be thought as corresponding
to angry because of its concentration on lower values of valence
with higher level of activation; in fact, about 70% of all angry
utterances in the database where at least 2 annotators agree on,
reside in cluster 3. Cluster 2 represented by point-shaped mark-
ers are centered at about the mid-range of Valence-Activation
levels could be thought as neutral emotion, and about 51%
of the neutral utterances of the database reside in this cluster.
There are a total of 5 pairs of subjects in 151 dialogs consisting
of 8343 turns used in this paper. A table showing the distribu-
tion of turns for each emotion cluster and its clustering centroid
is given in Table 1.
2.2. Dynamic Bayesian Network Model
Dynamic Bayesian Network (DBN) is a statistical graphical
modeling framework, where each node in a network is a random
variable and the connecting arrows represent the conditional de-
pendency between random variables. Since we want to capture
the time dependency and mutual influence between speakers’
emotion states, we propose to use the DBN structure shown in
Figure 3.
EMO_A EMO_A
EMO_B EMO_B
T
T - 1
. . . . . .
F_A F_A
F_B F_B
Figure 3: Proposed Dynamic Bayesian Network Structure.
In Figure 3, the EMO A and EMO B nodes represent the
emotional class label for speakers A, B in the dialog, and the
FAandFB nodes represent the respective observed acous-
tic information modeled by Mixture of Gaussian Distribution;
the black rectangle represents the hidden mixture weights for
the GMM. The proposed network tries to model two aspects of
emotion evolution in an interaction. One is the time dependency
of the emotion evolution, where a person’s emotion state is con-
ditionally dependent on his/her previous emotion state modeled
as a first order Markov process. Second, the model incorporates
the mutual influence between the two speakers in the dyadic in-
teraction, where one speaker’s emotion state is affected by the
interacting partner’s emotion. The joint probability of emotion
1984
states EBt and EAt and feature vectors YBt and YAt for a dia-
log under this model can be factored as shown in Equation 1.
P({EAt,Y
At},{EBt,Y
Bt})= (1)
P(EA1)P(YA1|EA1)P(EB1|EA1)P(YB1|EB1)×
T
Y
t=2
P(EBt|EBt1)P(EBt|EAt)P(YBt |EBt)×
T
Y
t=2
P(EAt|EAt1)P(EAt |EBt1)P(YAt|EAt )
3. Experimental Results and Discussion
3.1. Feature Extraction
We focused on acoustic cues for the modeling study in this pa-
per. All features except speech rate were extracted using the
Praat Toolkit [8], while speech rate was estimated as the number
of phonemes per second obtained from ASR forced alignment
output detailed in [6] . The following is the list of extracted
features at the turn level as previously defined.
F0 Frequency: Mean, Standard Deviation, Minimum,
Maximum, 25% Quantile, 75% Quantile, Range, In-
terQuantile Range, Median, Kurtosis, Skewness
Harmonic to Noise Ratio (HNR): Mean, Standard Devi-
ation, Minimum, Maximum, 25% Quantile, 75% Quan-
tile, Range, InterQuantile Range, Median, Kurtosis,
Skewness
Intensity/Energy: Mean, Standard Deviation, Minimum,
Maximum, 25% Quantile, 75% Quantile, Range, In-
terQuantile Range, Median, Kurtosis, Skewness
Speech Rate: Mean, Maximum, Minimum
13 MFCC Coefficients: Mean, Standard Deviation
27 Mel Frequency Bank Filter Output: Mean, Standard
Deviation
This resulted in a 116-dimension feature vector. Further-
more, feature normalization was obtained by performing z-
normalization on the feature vectors with respect to each in-
dividual speaker’s neutral utterances. The rationale behind this
normalization was that while individuals may express emotions
differently, by normalizing with respect to neutral utterances,
speaker-dependent emotional modulation should be more com-
parable across speakers.
3.2. Experiment Setup
Experiment I: Recognize the 5 emotion classes de-
scribed in Section 2.1.2
Experiment II: Recognize only the Activation and Va-
lence dimension (each with 3 classes) separately using
the same proposed structure
Experiment II was performed to help us identify which of the
emotion dimension is likely to be affected by mutual influence
in an interaction. Here, each dimension was clustered again into
3 classes (High, Medium, Low) using the K-Means algorithm.
Table 2 shows a summary of data distribution and centroid of
emotion classes for Experiment II.
For both experiments, forward feature selection was per-
formed with accuracy percentage as the stopping criterion to
reduced the number of features. We then analyzed four differ-
ent structures representing different aspects of emotional state
evolution in a dialog. The four different structures considered
Tab le 2 : Valence & Activation Clustering (k=3)
Valence Activation
No. of Turns Centroid No. of Turns Centroid
Low 2355 2.05 3096 2.21
Medium 3271 3.29 2525 2.97
High 2717 4.18 2722 3.69
are shown in Figure 4. The first structure (1) is our baseline
model that does not incorporate any time or mutual influence
dependency. Therefore, it recognizes each turn separately with
trained GMM model using just the acoustic cues. Structure (2)
incorporates time dependency of individual speaker’s emotion
without mutual influence from the interacting partner. Struc-
ture (3) models only the mutual influence between speakers,
and Structure (4) is our proposed complete model that combined
both time and cross-speaker dependencies.
We tied the GMM parameters of both speakers’ observa-
tion feature vector for both experiments to maximize the use of
training data. Each trained baseline GMM models’ parameters
was passed onto all three other structures to ensure any changes
in classification accuracy is due to the change in emotion depen-
dency structures. The model was implemented and tested using
the Bayes Net Toolbox [9]. All experiments were done with 15-
fold cross validation, where 140 dialogs were selected at train-
ing and about 10 dialogs were used as testing. The numbers of
mixture for the GMM was determined empirically to be four. At
training, emotional labels and feature vectors were provided to
learn the mixture weights and conditional dependency between
emotional states using the EM Algorithm with Junction Tree In-
ference. At testing, the trained network decoded both speakers’
emotion labels by computing the most likely path of emotion
state evolution throughout the dialog given the sequence of ob-
servations.
1: (Baseline)
T-1 T
3: (Cross-Speaker Dependency)
2: (Individual Time-Dependency)
...
... ... ...
T-1 T
... ...
T-1 T
4: (Proposed Structure - Both Dependency)
Figure 4: Structures of Emotion States Evolution.
3.3. Results and Discussion
The results of both experiments are summarized in Table 3. The
performance measure used is the number of accurately classi-
fied turns divided by the total numbers of turns tested. Two
different results are shown for the Experiment II. The same col-
umn in Table 3 means that the experiment was carried out us-
ing the same set features obtained from feature selection output
in Experiment I, and the optimized column means that forward
feature selection was performed on each of the Activation and
Valence experiments separately.
In the Experiment I, the results show that it is beneficial to
incorporate both time dependency and mutual influence on the
emotional state, since both Structure (2) and (3) improve the
classification performance. Our proposed DBN model which
combined both dependencies obtained an absolute 3.67% in-
crease in accuracy (relative 7.12% improvement) over our base-
line model. To see where the improvement comes from, we can
examine the results from Experiment II where the classification
was performed on Valence-Activation dimension separately.
1985
Tab le 3 : Summary of Experiment Accuracy Percentage
DBN Structure I: 5 - Emotion Classes II: Activation-Only (3-Class) II: Valence-Only (3-Class)
Same Optimized Same Optimized
Chance 24.29% 37.11% 37.11% 39.21% 39.21%
Baseline - GMM (1) 51.53% 62.30% 63.45% 56.59% 59.89%
Time Dependency (2) 52.68% 62.02% 61.92% 59.78% 63.40%
Mutual Influence (3) 53.37% 62.52% 62.30% 59.60% 62.67%
Proposed Model (4) 55.20% 62.35% 62.49% 61.26% 65.02%
In Experiment II, the first thing to point out is that the classi-
fication accuracy using baseline GMM on the Valence and Acti-
vation separately shows that by exclusively using speech related
features, the classification accuracy is higher with the Activa-
tion dimension than with the Valence dimension by absolute
5.71% (relative 10.01%). And this agrees with our knowledge
about the discriminative power of acoustic features [10] in each
of these dimensions. The second observation is that we im-
proved classification accuracy in the Valence dimension by ap-
proximately 5% absolute (relative 8%) over baseline. However,
the effect is not as observable with the Activation dimension.
It appears that the advantage of this modeling comes primarily
in the Valence dimension instead of the Activation dimension.
We hypothesize that the mutual influence on interacting partners
may be more significant in the Valence dimension. However,
further analysis is necessary to verify this claim.
In summary, our proposed model, which captures both time
dependency and mutual influence between speakers, was able to
improve the overall classification accuracy. In spite of the lim-
ited amount of interaction data (151 dialogs with 10 subjects)
with potentially noisy emotion classes, it is still encouraging to
see that our model is able to capture these effects and improve
the recognition results.
4. Conclusions and Future Work
Interpersonal interactions often exhibit mutual influence along
different elements of the interlocutor behavior. In this paper,
we utilized the Dynamic Bayesian Network (DBN) to model
this effect to better capture the flow of emotion in dialogs. In
turn, we use the model for performing emotion recognition in
the Valence-Activation dimension. As shown in Section 3, it is
advantageous to model the dynamics and mutual influence of
emotion states in dialog for improving emotion classification.
There are two main limitations with this paper. The first
arises because we only had two human annotations on emo-
tion attributes for each utterance. In order to incorporate both
annotations to serve as our ground truth, we took the average
of two annotations values for every turn; this created noise in
the emotion labels. We plan on acquiring more annotations in
the future to alleviate this problem. The other limitation is that
we just relied on speech based features for our modeling; for-
tunately, the IEMOCAP database has detailed facial and rigid
head/hand gesture information as well as transcriptions provid-
ing the language information, all of which have been shown
useful for emotion modeling, could be incorporated within the
model in the future.
Several other future directions can be pursued. One im-
mediate extension is to provide a mapping between decoded
Valence-Activation state to some more human interpretable
emotion categories, and extend this framework as a first stage
processing for inferring higher-level dialog attributes. Further,
mutual influence between speakers can happen at multiple lev-
els. In this paper, we examined this effect through recognizing
emotion states at the turn level. Prior works have shown mutual
influence on lexical structure [11] and on predicting task suc-
cess [12] at the dialog level. We can analyze this effect along
such levels using hierarchical structures. Furthermore, we are
in the process of obtaining other forms of interaction databases
with both natural human interaction and acted interaction. Once
we acquire better insights into mutual influence in human inter-
actions, we not only will be able to improve dialog modeling,
but may also be able to incorporate such information in the de-
sign of robust machine spoken dialog interfaces.
5. Acknowledgements
The paper was supported in part by funds from NSF, Army, and
USC Annenberg Fellowship.
6. References
[1] J. K. Burgoon, L. A. Stern, and L. Dillman, Interpersonal Adapta-
tion: Dyadic Interaction Patterns. Cambridge University Press,
1995.
[2] K. Roland, “The prosody of authentic emotions,” in Speech
Prosody Conference, 2002, pp. 423–426.
[3] A. Metallinou, S. Lee, and S. Narayanan, “Audio-visual emotion
recognition using gaussian mixture models for face and voice,” in
In Proceedings of IEEE International Symposium of Multimedia,
Berkeley, CA, December 2008.
[4] M. Grimm, E. Mower, K. Kroschel, and S. Narayanan, “Primitives
based estimation and evaluation of emotions in speech,Speech
Communication, vol. 49, pp. 787–800, November 2007.
[5] J. Liscombe, G. Riccardi, and D. Hakkani-Tur, “Using context to
improve emotion detection in spoken dialog systems,” in Inter-
speech, 2005, pp. 1845–1848.
[6] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower,
S. Kim, J. Chang, S. Lee, and S. Narayanan, “IEMOCAP: In-
teractive emotional dyadic motion capture database,Journal of
Language Resources and Evaluation, vol. 42, pp. 335–359, 2008.
[7] C. Busso and S. Narayanan, “Recording audio-visual emotional
database from actors: a closer look,” in Second Intl. Workship
on Emotion: Corpora for Research on Emotion and Affect, Int’l
conference on Language Resources and Evaluation, May 2008,
pp. 17–22.
[8] P. Boersma and D. Weenink, “Praat: doing phonetics by computer
(version 5.1.03) [Computer program],” March 2009. [Online].
Available: http://www.praat.org/
[9] K. P. Murphy, “The bayes net toolbox for matlab,Computing
Science and Statistics, 2001.
[10] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee,
A. Kazemzadeh, S. Lee, and S. Narayanan, Analysis of emo-
tion recognition using facial expressions, speech and multimodal
information,” in Proc. of the Int’l Conf. on Multimodal Interfaces,
October 2004.
[11] A. Nenkova, A. Gravano, and J. Hirschberg, “High frequency
word entrainment in spoken dialogue,” in Proceedings of ACL-08:
HLT, vol. Companion, no. 169-172, 2008.
[12] D. Reitter and J. D. Moore, “Predicting success in dialogue,” in
Proceedings of the 45th Annual Meeting of the Association of
Computational Linguistics, 2007, pp. 808–815.
1986
... A number of different conversion schemes has been employed to convert interval labels in IEMOCAP to AOLs, but they lack a consistent set of thresholds [29,30,[59][60][61]. Therefore, we adopted the idea of clustering the labels in the dataset to identify suitable decision thresholds [59,61]. ...
... A number of different conversion schemes has been employed to convert interval labels in IEMOCAP to AOLs, but they lack a consistent set of thresholds [29,30,[59][60][61]. Therefore, we adopted the idea of clustering the labels in the dataset to identify suitable decision thresholds [59,61]. ...
Preprint
Full-text available
There is growing interest in affective computing for the representation and prediction of emotions along ordinal scales. However, the term ordinal emotion label has been used to refer to both absolute notions such as low or high arousal, as well as relation notions such as arousal is higher at one instance compared to another. In this paper, we introduce the terminology absolute and relative ordinal labels to make this distinction clear and investigate both with a view to integrate them and exploit their complementary nature. We propose a Markovian framework referred to as Dynamic Ordinal Markov Model (DOMM) that makes use of both absolute and relative ordinal information, to improve speech based ordinal emotion prediction. Finally, the proposed framework is validated on two speech corpora commonly used in affective computing, the RECOLA and the IEMOCAP databases, across a range of system configurations. The results consistently indicate that integrating relative ordinal information improves absolute ordinal emotion prediction.
... Contextual information between successive speech segments is accounted for by employing Long Short-Term Memory (LSTM) networks which are known to be well suited for affective computing [9] as their model architecture allows for temporal long-range context exploitation. Motivated by studies as in [10] where it was shown that emotion recognition profits from taking into account speech cues from a speaker's interlocutor, we also consider situational context in the sense of other participants' speech by simultaneously modeling multiple speakers. As in [3], we evaluate our techniques on a subset of the AMI corpus and achieve a considerable accuracy gain compared to previous methods based on HMMs. ...
... Another important aspect for multimodal emotional modeling is to consider the dynamic nature of emotions, which are externalized over-time. Knowing an emotional state from previous time-steps/frames helps in predicting the current emotion [14], [15]. It is essential to capture temporal information so that the model can track emotional changes within the speaking turn. ...
Article
Full-text available
Emotion recognition using audiovisual features is a challenging task for human-machine interaction systems. Under ideal conditions (perfect illumination, clean speech signals, and non-occluded visual data) many systems are able to achieve reliable results. However, few studies have considered developing multimodal systems and training strategies to build systems that can perform well under non ideal conditions. Audiovisual models still face challenging problems such as misalignment of modalities, lack of temporal modeling, and missing features due to noise or occlusions. In this paper, we implement a model that combines auxiliary networks, a transformer architecture, and an optimized training mechanism to achieve a robust system for audiovisual emotion recognition that addresses, in a principled way, these challenges. Our evaluation analyzes how well this model performs in ideal conditions and when modalities are missing. We contrast this method with other multimodal fusion methods for emotion recognition. Our experimental results based on two audiovisual databases demonstrate that the proposed framework achieves: 1) improvements in emotion recognition accuracy, 2) better alignment and fusion of audiovisual features at the model level, 3) awareness of temporal information, and 4) robustness to non-ideal scenarios.
... Other studies have explored modeling strategies to improve the prediction of valence. Lee et al. [25] used dynamic Bayesian networks to capture time dependencies and mutual influence of interlocutors during dyadic interactions. Contextual information was found to be particularly useful in the prediction of valence, leading to relative improvements higher than the one observed for arousal. ...
Article
Full-text available
The prediction of valence from speech is an important, but challenging problem. The expression of valence in speech has speaker-dependent cues, which contribute to performances that are often significantly lower than the prediction of other emotional attributes such as arousal and dominance. A practical approach to improve valence prediction from speech is to adapt the models to the target speakers in the test set. Adapting a speech emotion recognition (SER) system to a particular speaker is a hard problem, especially with deep neural networks (DNNs), since it requires optimizing millions of parameters. This study proposes an unsupervised approach to address this problem by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set. Speech samples from the selected speakers are used to create the adaptation set. This approach leverages transfer learning using pre-trained models, which are adapted with these speech samples. We propose three alternative adaptation strategies: unique speaker, oversampling and weighting approaches. These methods differ on the use of the adaptation set in the personalization of the valence models. The results demonstrate that a valence prediction model can be efficiently personalized with these unsupervised approaches, leading to relative improvements as high as 13.52%.
... Other studies have explored modeling strategies to improve the prediction of valence. Lee et al. [25] used dynamic Bayesian networks to capture time dependencies and mutual influence of interlocutors during dyadic interactions. Contextual information was found to be particularly useful in the prediction of valence, leading to relative improvements higher than the one observed for arousal. ...
Preprint
Full-text available
The prediction of valence from speech is an important, but challenging problem. The externalization of valence in speech has speaker-dependent cues, which contribute to performances that are often significantly lower than the prediction of other emotional attributes such as arousal and dominance. A practical approach to improve valence prediction from speech is to adapt the models to the target speakers in the test set. Adapting a speech emotion recognition (SER) system to a particular speaker is a hard problem, especially with deep neural networks (DNNs), since it requires optimizing millions of parameters. This study proposes an unsupervised approach to address this problem by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set. Speech samples from the selected speakers are used to create the adaptation set. This approach leverages transfer learning using pre-trained models, which are adapted with these speech samples. We propose three alternative adaptation strategies: unique speaker, oversampling and weighting approaches. These methods differ on the use of the adaptation set in the personalization of the valence models. The results demonstrate that a valence prediction model can be efficiently personalized with these unsupervised approaches, leading to relative improvements as high as 13.52%.
... Similar to the visual modality, given the raw speech signal, researchers first extracted their desired features such as above, then fed them into the classifier. Different from the above individual features methods, Lin [54], Lee [55], and Yeh [56] tried to use interpersonal features in audio modality to boost the performance of automatic emotion recognition. However, they did not explore whether synchronization will be beneficial or not, which is the main target of this study. ...
Article
Full-text available
During social interaction, humans recognize others’ emotions via individual features and interpersonal features. However, most previous automatic emotion recognition techniques only used individual features—they have not tested the importance of interpersonal features. In the present study, we asked whether interpersonal features, especially time-lagged synchronization features, are beneficial to the performance of automatic emotion recognition techniques. We explored this question in the main experiment (speaker-dependent emotion recognition) and supplementary experiment (speaker-independent emotion recognition) by building an individual framework and interpersonal framework in visual, audio, and cross-modality, respectively. Our main experiment results showed that the interpersonal framework outperformed the individual framework in every modality. Our supplementary experiment showed—even for unknown communication pairs—that the interpersonal framework led to a better performance. Therefore, we concluded that interpersonal features are useful to boost the performance of automatic emotion recognition tasks. We hope to raise attention to interpersonal features in this study.
... This result is consistent with the CCC values observed in Table 2. The high variability in the weights for valence illustrates the benefits of adding temporal modeling in its prediction [55]. ...
Article
Full-text available
A critical issue of current speech-based sequence-to-one learning tasks, such as speech emotion recognition(SER), is the dynamic temporal modeling for speech sentences with different durations. The goal is to extract an informative representation vector of the sentence from acoustic feature sequences with varied length. Traditional methods rely on static descriptions such as statistical functions or a universal background model (UBM), which are not capable of characterizing dynamic temporal changes. Recent advances in deep learning architectures provide promising results, directly extracting sentence-level representations from frame-level features. However, conventional cropping and padding techniques that deal with varied length sequences are not optimal, since they truncate or artificially add sentence-level information. Therefore, we propose a novel dynamic chunking approach, which maps the original sequences of different lengths into a fixed number of chunks that have the same duration by adjusting their overlap. This simple chunking procedure creates a flexible framework that can incorporate different feature extractions and sentence-level temporal aggregation approaches to cope, in a principled way, with different sequence-to-one tasks. Our experimental results based on three databases demonstrate that the proposed framework provides: 1) improvement in recognition accuracy, 2) robustness toward different temporal length predictions, and 3) high model computational efficiency advantages.
Preprint
While Emotion Recognition in Conversations (ERC) has seen a tremendous advancement in the last few years, new applications and implementation scenarios present novel challenges and opportunities. These range from leveraging the conversational context, speaker and emotion dynamics modelling, to interpreting common sense expressions, informal language and sarcasm, addressing challenges of real time ERC and recognizing emotion causes. This survey starts by introducing ERC, elaborating on the challenges and opportunities pertaining to this task. It proceeds with a description of the main emotion taxonomies and methods to deal with subjectivity in annotations. It then describes Deep Learning methods relevant for ERC, word embeddings, and elaborates on the use of performance metrics for the task and methods to deal with the typically unbalanced ERC datasets. This is followed by a description and benchmark of key ERC works along with comprehensive tables comparing several works regarding their methods and performance across different datasets. The survey highlights the advantage of leveraging techniques to address unbalanced data, the exploration of mixed emotions and the benefits of incorporating annotation subjectivity in the learning phase.
Article
Full-text available
Prosodic variation in oral communication can occasion a shift in how a speaker's feelings and emotions are interpreted, a finding that is both commonsense and well established within the scientific community. However, neither psychologists nor researchers in linguistic and communication sciences have yet achieved clarity concerning the exact relations between the two phenomena. This paper outlines the results of a study dealing with the expression of emotions in "natural" spoken language. In this study some light could be shed on the fundamental relations between units of prosody and other signalling systems active in the expression of emotions. Results show that emotions are merely perceived as discrete and are in fact semantic composites, constructed out of several elements which each bear individual semantic components. These are: 1. A semantic component of unexpectedness which can be expressed by a continuous prosodic unit: a locally raised F0 maximum. 2. The emotional dimension activation linking the semantic components excited and calm. These correlate with the continuously varying speech rate, i.e. the faster the speech rate the more excited the speaker is perceived to be and vice versa. 3. The semantic components strong and weak, which create the emotional dimension dominance, are expressed through the higher or lower prominence of an utterance. 4. Finally, the emotional quality positive attitude/feeling, as one pole of the dimension valence, can be expressed by a discrete local intonation pattern. Since all of these prosodic units serve to create different emotion expressions - e.g. a reduced speech rate is found where speakers are perceived to be calm or content, but also as irked and demotivated, - the differentiation between the perceived categories is dependent upon both the concurrent verbal elements and the context of interaction.
Article
Full-text available
List of figures and tables Preface Part I. Overview: 1. Introduction Part II. Interaction Adaptation Theories and Models: 2. Biological approaches 3. Arousal and affect approaches 4. Social norm approaches 5. Communication and cognitive approaches Part III. Issues in Studying Interaction Adaptation: 6. Reconceptualising interaction adaptation patterns 7. Operationalising adaptation patterns 8. Analysing adaptation patterns Part IV. Multimethod Tests of Reciprocity and Compensation: 9. A first illustration 10. Further illustrations Part V. Developing a New Interpersonal Adaptation Theory: 11. The theories revisited 12. A research agenda References Index.
Conference Paper
Full-text available
Research on human emotional behavior, and the development of automatic emotion recognition and animation systems, rely heavily on appropriate audio-visual databases of expressive human speech, language, gestures and postures. The use of actors to record emotional databases has been a popular approach in the study of emotions. Recently, this method has been criticized since the emotional content expressed by the actors seems to differ from the emotions observed in real-life scenarios. However, a deeper look at the current settings used in the recording of the existing corpora reveals that a key problem may not be the use of actors itself, but the ad-hoc elicitation method used in the recording. This paper discusses the main limitations of the current settings used in collecting acted emotional databases, and suggests guidelines for the design of new corpora recorded from actors that may reduce the gap observed between the laboratory condition and real-life applications. As a case study, the paper discusses the interactive emotional dyadic motion capture database (IEMOCAP), recently recorded at the University of Southern California (USC), which inspired the suggested guidelines.
Conference Paper
Full-text available
Emotion expression associated with human communication is known to be a multimodal process. In this work, we investigate the way that emotional information is conveyed by facial and vocal modalities, and how these modalities can be effectively combined to achieve improved emotion recognition accuracy. In particular, the behaviors of different facial regions are studied in detail. We analyze an emotion database recorded from ten speakers (five female, five male), which contains speech and facial marker data. Each individual modality is modeled by Gaussian mixture models (GMMs). Multiple modalities are combined using two different methods: a Bayesian classifier weighting scheme and support vector machines that use post classification accuracies as features. Individual modality recognition performances indicate that anger and sadness have comparable accuracies for facial and vocal modalities, while happiness seems to be more accurately transmitted by facial expressions than voice. The neutral state has the lowest performance, possibly due to the vague definition of neutrality. Cheek regions achieve better emotion recognition accuracy compared to other facial regions. Moreover, classifier combination leads to significantly higher performance, which confirms that training detailed single modality classifiers and combining them at a later stage is an effective approach.
Conference Paper
Full-text available
Most research that explores the emotional state of users of spo- ken dialog systems does not fully utilize the contextual nature that the dialog structure provides. This paper reports results of machine learning experiments designed to automatically clas- sify the emotional state of user turns using a corpus of 5,690 dialogs collected with the "How May I Help You SM " spoken di- alog system. We show that augmenting standard lexical and prosodic features with contextual features that exploit the struc- ture of spoken dialog and track user stateincreases classification accuracy by 2.6%.
Conference Paper
Full-text available
The interaction between human beings and computers will be more natural if computers are able to perceive and respond to human non-verbal communication such as emotions. Although several approaches have been proposed to recognize human emotions based on facial expressions or speech, relatively limited work has been done to fuse these two, and other, modalities to improve the accuracy and robustness of the emotion recognition system. This paper analyzes the strengths and the limitations of systems based only on facial expressions or acoustic information. It also discusses two approaches used to fuse these two modalities: decision level and feature level integration. Using a database recorded from an actress, four emotions were classified: sadness, anger, happiness, and neutral state. By the use of markers on her face, detailed facial motions were captured with motion capture, in conjunction with simultaneous speech recordings. The results reveal that the system based on facial expression gave better performance than the system based on just acoustic information for the emotions considered. Results also show the complementarily of the two modalities and that when these two modalities are fused, the performance and the robustness of the emotion recognition system improve measurably.
Conference Paper
Full-text available
Task-solving in dialogue depends on the lin- guistic alignment of the interlocutors, which Pickering & Garrod (2004) have suggested to be based on mechanistic repetition ef- fects. In this paper, we seek confirmation of this hypothesis by looking at repetition in corpora, and whether repetition is cor- related with task success. We show that the relevant repetition tendency is based on slow adaptation rather than short-term prim- ing and demonstrate that lexical and syntac- tic repetition is a reliable predictor of task success given the first five minutes of a task- oriented dialogue.
Conference Paper
Full-text available
Cognitive theories of dialogue hold that en- trainment, the automatic alignment between dialogue partners at many levels of linguistic representation, is key to facilitating both pro- duction and comprehension in dialogue. In this paper we examine novel types of entrain- ment in two corpora—Switchboard and the Columbia Games corpus. We examine en- trainment in use of high-frequency words(the most common words in the corpus), and its as- sociation with dialogue naturalness and flow, as well as with task success. Our results show that such entrainment is predictive of the per- ceived naturalness of dialogues and is signifi- cantly correlated with task success; in overall interaction flow, higher degrees of entrainment are associated with more overlaps and fewer interruptions.
Article
Emotion primitive descriptions are an important alternative to classical emotion categories for describing a human’s affective expressions. We build a multi-dimensional emotion space composed of the emotion primitives of valence, activation, and dominance. In this study, an image-based, text-free evaluation system is presented that provides intuitive assessment of these emotion primitives, and yields high inter-evaluator agreement.An automatic system for estimating the emotion primitives is introduced. We use a fuzzy logic estimator and a rule base derived from acoustic features in speech such as pitch, energy, speaking rate and spectral characteristics. The approach is tested on two databases. The first database consists of 680 sentences of 3 speakers containing acted emotions in the categories happy, angry, neutral, and sad. The second database contains more than 1000 utterances of 47 speakers with authentic emotion expressions recorded from a television talk show. The estimation results are compared to the human evaluation as a reference, and are moderately to highly correlated (0.42 < r < 0.85). Different scenarios are tested: acted vs. authentic emotions, speaker-dependent vs. speaker-independent emotion estimation, and gender-dependent vs. gender-independent emotion estimation.Finally, continuous-valued estimates of the emotion primitives are mapped into the given emotion categories using a k-nearest neighbor classifier. An overall recognition rate of up to 83.5% is accomplished. The errors of the direct emotion estimation are compared to the confusion matrices of the classification from primitives. As a conclusion to this continuous-valued emotion primitives framework, speaker-dependent modeling of emotion expression is proposed since the emotion primitives are particularly suited for capturing dynamics and intrinsic variations in emotion expression.