ArticlePDF Available

Audio-visual emotion recognition using FCBF feature selection method and particle swarm optimization for fuzzy ARTMAP neural networks

Authors:

Abstract and Figures

Humans use many modalities such as face, speech and body gesture to express their feeling. So, to make emotional computers and make the human-computer interaction (HCI) more naturally and friendly, computers should be able to understand human feelings using speech and visual information. In this paper, we recognize the emotions from audio and visual information using fuzzy ARTMAP neural network (FAMNN). Audio and visual systems fuse at decision and feature levels. Finally, the particle swarm optimization (PSO) is employed to determine the optimum values of the choice parameter (α), the vigilance parameters (ρ), and the learning rate (β) of the FAMNN. Experimental results showed that the feature-level and decision-level fusions improve the outcome of unimodal systems. Also PSO improved the recognition rate. By using the PSO-optimized FAMNN at feature level fusion, the recognition rate was improved by about 57 % with respect to the audio system and by about 4.5 % with respect to the visual system. The final emotion recognition rate on the SAVEE database was reached to 98.25 % using audio and visual features by using optimized FAMNN.
This content is subject to copyright. Terms and conditions apply.
Audio-visual emotion recognition using FCBF feature
selection method and particle swarm optimization
for fuzzy ARTMAP neural networks
Davood Gharavian
1,2
&Mehdi Bejani
3
&
Mansour Sheikhan
1
Received: 18 January 2015 /Revised: 28 October 2015 /Accepted: 18 December 2015
#Springer Science+Business Media New York 2016
Abstract Humans use many modalities such as face, speech and body gesture to express their
feeling. So, to make emotional computers and make the human-computer interaction (HCI)
more naturally and friendly, computers should be able to understand human feelings using
speech and visual information. In this paper, we recognize the emotions from audio and visual
information using fuzzy ARTMAP neural network (FAMNN). Audio and visual systems fuse
at decision and feature levels. Finally, the particle swarm optimization (PSO) is employed to
determine the optimum values of the choice parameter (α), the vigilance parameters (ρ), and
the learning rate (β) of the FAMNN. Experimental results showed that the feature-level and
decision-level fusions improve the outcome of unimodal systems. Also PSO improved the
recognition rate. By using the PSO-optimized FAMNN at feature level fusion, the recognition
rate was improved by about 57 % with respect to the audio system and by about 4.5 % with
respect to the visual system. The final emotion recognition rate on the SAVEE database was
reached to 98.25 % using audio and visual features by using optimized FAMNN.
Keywords Audio-visual emotion recognition .Particle swarm optimization .Fuzzy ARTMAP
neural network
Multimed Tools Appl
DOI 10.1007/s11042-015-3180-6
*Davood Gharavian
dgharavian@gmail.com
Mehdi Bejani
St_m_bejani@azad.ac.ir
Mansour Sheikhan
msheikhn@azad.ac.ir
1
Department of Electrical Engineering, Islamic Azad University, South Tehran Branch, Tehran, Iran
2
Department of Electrical Engineering, Shahid Beheshti University, Tehran, Iran
3
Islamic Azad University, South Tehran Branch, Tehran, Iran
1 Introduction
Humans communicate with each other far more naturally than they do with computers. One of
the main problems in humancomputer interaction (HCI) systems is the transmission of
implicit information. To make HCI more naturally and friendly, computers must enjoy the
ability to understand humans emotional states the same way as human does.
In the recent years, emotion recognition has found many applications such as medical-
emergency domain to detect stress and pain [15], interactions with robots [27,41],
computer games [26], and developing manmachine interfaces for helping weak and
old people [36].
There are many modals such as face, body gesture and speech that people use to express
their feelings. Combination of these modals depends on the place they occur and on the
subjects themselves; therefore, there are a wide variety of patterns for combining [30].
Some studies in psychology and linguistics confirm the relation between affective displays
and specific audio and visual signals [2,17].
Mehrabian [33] states that there are basically three elements in any face-to-face communication.
Facial expression and speech articulation in the visual channel are the most important affective cues
(55 % and 38 %, respectively), and words contribute only 7 % of the overall impression.
There are some approaches for quantifying and measuring emotions such as discrete
categories and dimensional description [40]. In this work, we used basic discrete emotion
categories including happiness, fear, sadness, anger, surprise, neutral and disgust that are
rooted in the language of daily life. This description was specially supported by the cross-
cultural studies conducted by Ekman [16]. Most of the existing studies of automatic emotion
recognition focus on recognizing these basic emotions. These seven emotional states are
common and have been used in the majority of previous works [5,7,14,21,30,31,37,38,
46]. Our method is general and can be extended to more emotional states. Using universal
emotion models, it is easy to recognize emotional states [49].
Two main fusion approaches used in the literature are feature-level fusion and
decision-level fusion. The goal of this paper is to simulate human perception of emotions
by combining emotion-related information from facial expression and Audio. So, we
used different approaches to fuse audio and facial expression information. The classifier
type also affects emotion recognition rate significantly. Usually different classifiers such
as artificial neural networks (ANNs), support vector machines (SVMs), decision trees, K-
nearest neighbor (KNN), Gaussian mixture models (GMMs), hidden Markov models
(HMMs),andBayesiannetworkshavebeenusedforemotionrecognition.Alsore-
searchers have proposed the hybrid and multi-classifier methods [49]. Here, we used
fuzzy adaptive resonance theory mapping (ARTMAP) neural network [9] as our pro-
posed classifier, and particle swarm optimization (PSO) was employed to determine the
optimum values of the choice parameter (α), the vigilance parameters (ρ)andthe
learning rate (β) of the fuzzy ARTMAP neural network (FAMNN).
The remainder of this paper is organized as follows: Section 2reviews the recent
researches in this field. Section 3presents our methodology for this problem. In this
section, we first discuss about the SAVEE database that has been used in this work, and
then on how audio and video features were extracted, as well as feature reduction and
feature selection procedures. Also the FAMNN is introduced as a classifier, and finally,
the particle warm optimization method is presented for optimizing FAMNN and improv-
ing the classification accuracy task. Section 4contains the experimental results. In
Multimed Tools Appl
Section 5, the influence of PSO-optimized FAMNN on the performance of emotion
recognition is reported. Finally, conclusions are drawn in Section 6.
2 Background and related works
Recently, audio-visual based emotion recognition methods have attracted the attention of the
research community. In the survey of Pantic and Rothkrantz [39], only four studies were found
to focus on audio-visual affect recognition. Since then, affect recognition, using audio and
visual information, has been the subject of many researches. The most updated survey on
affect recognition methods for audio, visual and spontaneous expressions belongs to Zeng et
al. [49]. Here, some main works in this field are mentioned in brief.
De Silva and Pei Chi [14] used a rule based method for decision level fusion of speech and
visual based systems. In speech, pitch was extracted as the feature and used in the nearest-
neighbor classification method. In video, they tracked facial points with optical flow, and
hidden Markov model (HMM) was trained as the classifier. The decision level fusion
improved the result of the individual systems.
Song et al. [46] used a tripled hidden Markov model (THMM) to model joint dynamics of
the three signals perceived from the subject: a) pitch and energy as speech features, b) motion
of eyebrow, eyelid, and cheek as facial expression features, and c) lips and jaw as visual speech
signals. The proposed THMM architecture was tested for seven basic emotions (surprise,
anger, joy, sadness, disgust, fear, and neutral), and its overall performance was 85 %.
Mansoorizadeh and Moghaddam Charkari [30] compared feature level and decision level
fusions of speech and face information. They proposed an asynchronous feature-level fusion
approach that improved the result of combination. For speech analysis, they used the features
related to energy and pitch contour. For face analysis, the features representing the geometric
characteristic of face area were used. The multimodal results showed an improvement over the
individual systems.
Hoch et al. [24] developed an algorithm for bimodal emotion recognition. They used a
weighted linear combination for the decision level fusion of speech and facial expression
systems. They also applied a database of 840 audio-visual samples with 7 speakers and 3
emotions. Their system classified 3 emotions (positive, negative and neutral) with an average
of 90.7 % recognition rate. By using a fusion model based on a weighted linear combination,
the performance improvement became nearly 4 % compared to that of unimodal emotion
recognition.
Paleari [38] presented a semantic affect-enhanced multimedia indexing (SAMMI) to extract
real-time emotion appraisals from non-prototypical person independent facial expressions, and
vocal prosody. Different probabilistic methods for fusion were compared and evaluated with a
novel fusion technique called NNET. The results showed that NNET can improve the
recognition score by about 19 % and the mean average precision by about 30 % with respect
to the best unimodal system.
Haq and Jackson [21] used feature and decision level fusion for audio and visual features on
the SAVEE database. 106 utterance-level audio features (fundamental frequency, energy,
duration and spectral) and 240 visual features (marker locations on the face) were used for
this system. The Gaussian classifier was employed to fuse the information in different levels.
They used principal component analysis (PCA) and linear discriminate analysis (LDA) feature
selection algorithms. Using PCA and LDA, 92.9 % and 97.5 % emotion classification rates for
Multimed Tools Appl
audio-visual features, 50 % and 56 % for audio features and 91 % and 95.4 % for visual
features were reported.
Bejani et al. [5] investigated a multi-classifier audio-visual system that combined the speech
features (MFCC, pitch, energy and formants) and facial features (based on ITMI and QIM) on
the eNterface05 database. By using the multi-classifier system, the recognition rate was
increased up to 22.7 % over the speech based system and up to 38 % over the facial expression
based system.
In recent years, emotion recognition has had many applications in more generic mediated
communications. Lopez-de-Ipina et al. [28] identified novel technologies and biomarkers or
features for the early detection of Alzheimer s disease (AD) and its degree of severity. It
concerns the Automatic Analysis of Emotional Response (AAER) in spontaneous speech based
on Emotional Temperature and fractal dimension to validate tests and biomarkers for future
diagnostic use. The AAER shows very promising results for the definition of features useful in
the early diagnosis of AD. Harley et al. [22] presented a novel approach for measuring and
synchronizing emotion data from three modalities (automatic facial expression recognition, self-
report, electro dermal activity) and their consistency regarding learnersemotions. They found a
high level of coherence between the facial recognition and self-report data (75.6 %), but low levels
of consistency between them and electro dermal activation, suggesting that a tightly coupled
relationship does not always exist between emotional response components. Weisgerber et al. [47]
tested facial, vocal and musical emotion recognition capacities in schizophrenic patients. Dai et al.
[13] proposed a computational method for emotion recognition on vocal social media to estimate
complex emotion as well as its dynamic changes in a three-dimensional PAD (PositionArousal
Dominance) space. They analyzed the propagation characteristics of emotions on the vocal social
media site WeChat.
In recent years, the researchers are focused on finding reliable informative features and
combining powerful classifiers in order to improve the performance of emotion recognition
rate in real-life applications [37,44]. In this way, developing optimal design methods for
classification is an active research field. Here, we propose a PSO-optimized FAMNN that
improves the emotion recognition results as compared with the audio, visual and audio-visual
systems.
It is clear that emotional states influence audio and visual features of a person. In other
words, audio and visual features maintain information about emotional states that synergisti-
cally influence the recognition process. Usage of data fusion method for audio and visual
information and sequential process such as feature reduction, feature selection, classification
and classifier optimization make a well-designed approach in this research. In this work, we
examine different fusion approaches for audio-visual emotion recognition system, report the
results, and finally, propose the most appropriate fusion method for such systems. To reduce
computation cost and use the most effective features, feature reduction and feature selection
algorithms were used for the audio and visual features.
3 Methodology
Various audio and visual information fusions were made by different setups of feature
reduction and selection methods and classifiers in emotion recognition systems. In this
setup, the audio features (Mel-frequency cepstral coefficient (MFCC), pitch, energy and
formants) and the visual features (marker locations on the face) were extracted and the
Multimed Tools Appl
features were reduced by the PCA feature reduction algorithm. Next, the FCBF feature
selectionmethodwasappliedtothereducedfeatures.ThentheFAMNNwasusedfor
various setups of the audio-visual emotion recognition systems. Finally, the PSO was
employed to optimize the FAMNN to improve the experimental results.
The main goal of the present work is to quantify the performance of audio and visual
systems, recognize the strengths and weaknesses of these systems setups, and compare the
obtained setups to combine these two modalities for increasing the performance of the system.
To combine the visual and audio information, two different approaches were imple-
mented: feature level fusion, in which a single classifier with features of both modalities is
used, and decision level fusion, which uses a separate classifier for each modality, and the
outputs are combined using stacked generalization method where the output of the
ensemble serves as a feature vector to a meta-classifier. We used FAMNN as a meta-
classifier to improve the generalization performance. Figure 1shows an overview of the
proposed recognition system.
FAMNN 1 shows the result of audio emotion recognition, and FAMNN 2 classifies the
visual features after feature reduction and selection stages. The audio and visual features are
also mixed together and pass the PCA and FCBF stages; then the selected features are fused to
FAMNN 3.
The PCA-reduced audio and visual features were mixed together and then FCBF feature
selection was applied to the mixed audio-visual reduced features. The selected features were
used in FAMNN 4 for emotion recognition stage. FAMNN 5 used the selected separate audio
and visual features and classified the emotion states accordingly.
The output of FAMNN 1 and FAMNN 2 serves as a feature vector to FAMNN 6. This
experiment is a decision level fusion of audio and visual systems. In the following, the details
are described.
3.1 Database
We used the Surrey Audio-Visual Expressed Emotion (SAVEE) database (http://personal.ee.
surrey.ac.uk/Personal/P.Jackson/SAVEE/Database.html) that was recorded from four native
Fig. 1 Overview of the emotion recognition system
Multimed Tools Appl
male English speakers (aged 2731 years) with 60 markers painted on their face in CVSSPs
3D vision laboratory at Surrey University, UK. Figure 2presents some examples of facial
markers placed on four subjects with various emotions.
The sentences were recorded in seven emotional states: anger, disgust, fear, happiness,
neutral, sadness and surprise. The recordings consisted of 15 phonetically-balanced
TIMIT sentences per emotion: 3 common, 2 emotion specific and 10 generic sentences
that were different for each emotion. The 3 common and 2 emotion specific sentences were
recorded in neutral emotion, which resulted in 30 sentences for neutral emotion and 480
utterances in the database.
Emotion and sentence prompts were displayed on a monitor in front of actor during the
recordings. The 3DMD dynamic face capture system [1] provided color video and Beyer
dynamics microphone signals over several months during different periods of the year. The
sampling rate was 44.1 kHz for audio and 60 fps for video. The 2D video of frontal face of the
actor was recorded with one color camera.
3.2 Feature extraction
3.2.1 Audio features
Most of the existing approaches to audio emotion recognition used acoustic features as
classification input. The popular features are prosodic features (e.g., pitch-related feature
and energy-related features) and spectral features (e.g., MFCC and cepstral features). So,
pitch, intensity, MFCC and formant features at the frame-level were used in this work for
audio emotion recognition. Due to their popularity, descriptive power and bibliographical
suggestions in related works [49], these features are used. 5-ms frames of the speech signal
are analyzed every 10-ms using a Hanning windowfunctioninPraatspeechprocessing
software [6]. Because of large number of features at the frame-level, the statistical value of
features over a specified sentence was used for training and testing of this system.
Therefore, the mean, standard deviations, maximum and minimum values of the pitch,
and energy were computed using Praat.
In addition MFCC was computed using Praat. MFCCs are a popular and powerful
analytical tool in the field of speech recognition. In this work, we took the first 12
coefficients as the useful features. The mean, standard deviation, maximum, and mini-
mum values of MFCC features were calculated, which produced a total number of 48
MFCC features.
Fig. 2 Facial markers placed on the subjects of SAVEE database with different emotions (from left): KL (anger),
JK (happiness), JE (sadness) and DC (neutral)
Multimed Tools Appl
Formant frequencies are the properties of the vocal tract system. In this paper, the first
three formant frequencies and their bandwidths were calculated using Praat. The mean,
standard deviation, maximum, and minimum values of formant features were calculated,
which produced a total number of 24 formant features. In total, we extracted 80 features
from speech signal and used them for emotion recognition.
3.2.2 Visual features
Visual features were created by painting 60 frontal markers on the face of the actor. The
markers were painted on forehead, eyebrows, low eyes, cheeks, lips and jaw. After data
capture, the markers were manually labeled for the first frame of a sequence and tracked for
the remaining frames using a marker tracker. The tracked markers x and y coordinates were
normalized. Each markers mean displacement from the bridge of the nose was subtracted.
Finally, 480 visual features were obtained from the 2D marker coordinates, which consisted of
mean, standard deviation, maximum, and minimum values of the adjusted marker coordinates.
In some previous works [7,20,21,27], the facial markers on the face were used for
facial expression recognition. Duo to excellent performance of these features in facial
expression recognition and focus on other tasks (classification, optimization and fusion),
we used them. To detect and extract facial points automatically in real world application,
some techniques e.g., active appearance models (AAM) have been used [12]. And some
software such as Luxand FaceSDK [29] provides the coordinates of facial feature points.
It allows tracking and recognizing faces in live video.
3.3 Feature reduction
For dimension reduction and construction of a lower-size feature space, a statistical method
was used to maximize the relevant information preserved. This can be done by applying a
linear transformation, y=Tx,whereyis a feature vector in the reduced feature space, xis the
original feature vector, and Tis the transformation matrix. PCA [45]iswidelyusedtoextract
essential characteristics from high dimensional data sets and discard noise. PCA involves
feature centering and whitening, covariance computation, and Eigen decomposition. We
applied PCA as linear transformation technique for feature reduction.
3.4 Feature selection
In this study, the fast correlation-based filter (FCBF) [18] method was used for feature
selection. This method selects the features, which are individually informative and two-
by-two weakly dependent. It was noted that the Mutual Information (MI) of two vectors
XandY,I(X,Y), computes their statistical dependency in the following way:
IX;YðÞ¼
yY
xX
pX¼x;Y¼yðÞlog pX¼x;Y¼yðÞ
pX¼xðÞpY¼yðÞ

ð1Þ
Where, p is the probability function. Obviously, I(X,Y)isequalto0,whenX
and Y are independent, (p(X=x,Y=y)=p(X=x)p(Y=y)), and increases when their
dependency increases.
Multimed Tools Appl
In FCBF method, Y is the vector of data labels, and X
i
is the vector of ith feature value for
all data. That is, when the number of features is N, there are N + 1 vectors. The FCBF selects
features in two steps:
1. Removing the features (X
i
), which are not dependent on the label vector Y:
I(X
i
,Y)>ε; where, εis a positive threshold between 0 and 1. In this way, the FCBF
selects the features that are individually informative. In this work, εwas set to 0.01.
2. Removing the remaining feature (X
i
), which its dependency on the other remaining feature
(X
j
)ismorethanI(X
i
,Y), while I(X
i
,Y)>I(X
j
,Y).
In this way, the FCBF selects those individually informative features that are also two-
by-two weakly dependent.
3.5 Classification
In the present study, the FAMNN was used as the emotion classifier. The theoretical
foundation of Adaptive Resonance Theory (ART) has been introduced by Carpenter et
al. [9]. The network has structural design for incremental supervised learning of
recognition categories and multidimensional maps in reply to the random order of
binary or analog input vectors. It gains a synthesis of fuzzy logic and ART neural
networks by taking advantage of a close formal resemblance between the calculation
of fuzzy method and the ART category choice, resonance, and learning.
Fig. 3 Structure of the FAMNN
Multimed Tools Appl
The FAMNN has been successfully used in many tasks, for e.g., remote sensing, data
mining, and pattern recognition. The FAMNN is believed to be fast among the members of the
ARTMAP family due to cheap mapping between the inputs and outputs.
The FAMNN networks has two fuzzy ART networks, ART
a
and ART
b
, interconnected by
an inter-ART via an associative memory module (Fig. 3). The inter-ART module consists of a
match tracking, and a self-regulatory mechanism whose purpose is to minimize the network
error and maximize the generalization.
The input patterns of ART
a
and ART
b
are represented by the vector a = [a
1
,.., a
Ma
]and
b=[b
1
,.., b
Mb
].
For ART
a
,x
a
denotes the F
1
a
output vector, y
a
denotes the F
2
a
output vector, and w
j
a
denotes
the jth ART
a
weight vector. Also for ART
b
,x
b
denotes the F
1
b
output vector, y
b
denotes the F
2
b
output vector, and w
k
b
denotes the kth ART
b
weight vector. For the map field, x
ab
denotes the
F
ab
output vector, and w
j
ab
denotes the weight vector from the jth F
2
a
note to F
ab
.
After the resonance is confirmed in the networks, J gets the active category for the ART
a
network, and K is the active category for the ART
b
network. The next step is match-tracking to
verify if the active category on ART
a
corresponds to the desired output vector presented to
ART
b
. The vigilance criterion is given by [8]:
ρab ¼jybΛWab
jkj
ybð2Þ
Once the resonance state is completed by vigilance criterion, the weight is updated
according to the following equation [8]:
Wnew
j¼βIΛWold
j

þ1βðÞWold
jð3Þ
The performance of FAMNN is affected by three network parameters:
1- The choice parameter α(α>0), which acts on the category selection.
2- The baseline vigilance parameter, ρ(ρ
a
,ρ
b
,andρ
ab
)(ρ
a
[0,1]) that controls the network
resonance. The vigilance parameter is responsible for the number of formed categories.
3- The learning rate, (β[0,1]) that controls the velocity of network adaptation.
Table 1shows the specifications of simulated FAMNN in this work.
Tab le 1 Specification of FAMNN in the base experiments
Specification Value
Learning rate (β) 1
Vigilance parameter (ρ)0.99
Choice parameter (α) 1
Number of classes 7
Number of training samples 384
Number of test samples 96
Multimed Tools Appl
3.6 Optimization
As mentioned before, the optimum values for FAMNN parameters were determined by PSO.
The PSO algorithm was first proposed by Kennedy and Eberhart in 1995 [25]. This
algorithm is an evolutionary technique that was inspired by social behavior of bird flocking
or fish schooling, and simulates the nature of the particles in a swarm. Figure 4shows the
examples of these patterns in nature. The PSO algorithm provides a population-based search
procedure in which individuals, called particles, change their position (state) with time. In a
PSO system, the particles fly around a multi-dimensional search space. During the flight, each
particle adjusts its position according to its own experience and neighboring particle, making
use of the best position encountered by itself and its neighbor. In this algorithm, each particle
has a velocity and a position as follows [25]:
vikþ1ðÞ¼vikðÞþc1r1PixikðÞðÞþc2r2GxikðÞðÞ ð4Þ
xikþ1ðÞ¼xikðÞþvikþ1ðÞ ð5Þ
Where, i is the particle index, k is the discrete time index, v
i
is the velocity of the ith particle,
x
i
is position of the ith particle, P
i
is the best position found by the ith particle (personal best),
G is the best position found by swarm (global best), c
1
and c
2
are two positive constants called
cognitive and social parameters (c
1
=c
2
=2), and r
1
and r
2
are random numbers in the interval
[0,1] applied to the ith particle.
The PSO algorithm is similar to evolutionary computation (EC) techniques such as
Genetic Algorithm (GA). These techniques are population-based stochastic optimization
technique and utilize a fitness function to evaluate the population. They all update the
population and search for the optimum with random techniques. The PSO unlike EC and
GA techniques do not have genetic operators such as crossover and mutation. Particles
update themselves with the internal velocity. Also, the information sharing mechanism in
PSO is significantly different in comparison to other EC algorithms. In EC approaches,
chromosomes share information with each other; thus, the whole population moves like
one group towards an optimal area. But, in PSO, only the Bbest^particle gives out the
Fig. 4 PSO update of a particles
position x(k). to x(k+ 1) in a 2-
dimensional space
Multimed Tools Appl
information to others. PSO is designed very effective, in solving real valued global
optimization problems, which makes it suitable for large scale studies. Figure 5shows
the update by PSO of a particles position from x(k) to x(k + 1).
4 Experimental
The audio-visual emotion recognition system was tested over the SAVEE audio-visual emo-
tional database. All the experiments were person-independent. We used roughly 80 % of the
data to train the classifiers and the remaining 20 % to test them. The emotion recognition was
conducted through unimodal audio, unimodal visual, decision level, feature-level fusion
Fig. 5 Two PSO patterns in nature
0
20
40
60
80
100
120
ANG DIS FEA HAP Neutral SAD SUR AVG
ycaruccAnoitingoceR
Emoonal Stetes
Audio
Visual
FL
FL-FR
FL-FS
DL
Fig. 6 Emotion recognition accuracy of the different systems. Each group of adjacent columns denote the
classification accuracy of a single class. The first group contains the average recognition rate. The vertical axis is
the recognition accuracy in percentage: Audio, Visual, FL(Feature-level fusion), FL-FR(Feature-level fusion after
feature reduction), FL-FS(Feature-level fusion after feature selection), DL(Decision-level fusion). Class labels
were abbreviated by their first three letters
Multimed Tools Appl
(before feature reduction, after feature reduction and after feature selection). The results are
presentedinFig.6.
4.1 Audio experiments
In these experiments, 80 audio features were applied to PCA for feature reduction; 20 reduced
features were applied to FCBF feature selection in the next stage, and 12 features were selected.
The classification experiments were performed for seven emotional states using FAMNN. Figure 1
illustrates this setup using FAMNN1. The overall performance of this classifier was 53 %.
To show the good performance of our Audio recognition system, we examined it by
eNterface05 database [32]. The overall performance of this system was 63.1 %. The result
was better than our previous work (55 %) [5]. This shows the good performance of our method
for audio emotion recognition.
4.2 Visual experiments
In these experiments, 480 facial features were applied to PCA for feature reduction; 30 reduced
features were applied to FCBF feature selection in the next stage, and 6 features were selected.
The classification experiments were performed for seven emotional states using FAMNN.
Figure. 1shows this setup using FAMNN 2. The overall performance of this classifier was
93.75 %. The recognition accuracy in some states (e.g., happiness, neutral and sadness) was
100 %. Unfortunately, The SAVEE database is the only free public database that uses facial
markers. So we could not to evaluate performance of our visual system.
4.3 Audio-visual experiments
The overall results of the unimodal systems showed that for accurate and reliable
recognition of emotion classes, the modalities should be combined in a way that they
benefit the interrelationships between the individual classes and the underlying mo-
dalities. In the following paragraphs, we present and compare different combination
schemes. Two main fusion approaches used in the literature are feature level fusion
and decision level fusion.
4.3.1 Decision level fusion
In this experiment, we used stacked generalization method for decision-level fusion. The
output of the audio and visual ensembles serves as a feature vector to another FAMNN. The
Fig. 7 Block diagram of the decision level fusion
Multimed Tools Appl
overall performance of this method was 95 %. As shown in Fig. 7, the output of FAMNN 1
and FAMNN 2 serves as a feature vector to FAMNN 6.
4.3.2 Feature level fusion
In this experiment, all audio and visual features were combined to get a total of 560 audio-
visual features. Then these features were applied to PCA for feature reduction. Of which,
67 reduced features were applied to FCBF feature selection in the next stage and 6 features
were selected. The classification experiments were performed for seven emotions with
FAMNN. The overall performance of this emotion recognition system based on the audio-
visual information at the feature level fusion classifier was 96.88 %. As shown in Fig. 8,
this classifier has been performed by FAMNN 3.
4.3.3 Fusion after feature reduction
The 30 reduced audio and 20 reduced visual features were combined together and then
FCBF was applied to the 50 reduced audio and visual features. In the next stage, 10
selected features were applied to the FAMNN classifier. The overall performance of this
emotion recognition system based on Audio-visual data at feature level fusion after
feature reduction was 97.92 %. Fig. 9indicates this classifier with FAMNN 4. Table 2
shows the confusion matrix of the emotion recognition system based on the audio-visual
data at feature level fusion after feature reduction. The recognition accuracy in some
states (e.g., anger, disgust, fear, neutral and sadness) was 100 %. Also, some emotions
are usually confused. Happiness is misclassified as surprise state by about 11.11 % and
sadness is misclassified as fear by about 9.09 %.
4.3.4 Fusion after feature selection
The 6 selected audio features and the 12 selected visual features grouped together were
applied to the FAMNN. The overall performance of this emotion recognition system
based on the audio-visual data at feature level fusion after feature selection was 85.72 %.
This classifier with FAMNN 5 is shown in Fig. 10.
Fig. 8 Block diagram of the feature level fusion
Fig. 9 Block diagram of the fusion after feature reduction
Multimed Tools Appl
Table 3shows the emotion recognition results for the unimodal and different combining
methods.
Combining of audio and visual information in different ways enhances the perfor-
mance of unimodal systems. The results showed that the feature level fusion after feature
reduction has better performance. The mean accuracy of this method is 97.92. Accord-
ingly, this method improves the recognition rate by up to 45 % over the audio based
system, and by up to 4 % over the visual based system.
5 Influence of FAMNN parameters optimization on emotion recognition
accuracy
As mentioned before, PSO was used in this study to determine the optimum values for
FAMNN parameters.
The operation of FAMNN is affected by three network parameters: the choice parameter, α,
the baseline vigilance parameter, ρ(ρ
a
,ρ
b
,andρ
ab
), and the learning rate, β, which has a value
between 0 and 1. The choice parameter takes values in the interval (0,1), while the baseline
vigilance parameter assumes values in the interval [0,1].
In this study, the optimum values of the mentioned FAMNN parameters, which are
corresponding to the minimum of fitness function, were determined by PSO. The fitness
function in FAMNN simulation is was obtained by:
F¼1
pc
Where, pc is the percentage of correct classification. The parameterssetting in PSO
algorithm is listed in Table 4.
Tab le 2 Confusion matrix of the emotion recognition system based on Fusion after feature reduction
ANG DIS FEA HAP Neutral SAD SUR
ANG 100 0 0 0 0 0 0
DIS 0 100 0 0 0 0 0
FEA 0 0 100 0 0 0 0
HAP 0 0 0 88.88 0 0 11.11
SUR 0 0 0 0 100 0 0
SAD 0 0 9.09 0 0 90.90 0
SUR 0 0 0 0 0 0 100
Fig. 10 Block diagram of the fusion after feature selection
Multimed Tools Appl
The optimized parameters of the FAMNN for audio, visual, audio-visual data at decision
level fusion and feature level fusion after feature reduction (the best result of our experiments)
are reported in Table 5. The accuracies of emotion recognition when using the optimized
FAMNN parameters are reported in Table 5for mentioned modals. Table 5further the
accuracies of these modals when the parameters of the FAMNN are set by the user to α=1,
β= 1 and ρ
a
=ρ
b
=ρ
ab
= 0.99 as typical values.
As can be seen in Table 5, by using the optimized audio FAMNN parameters, the average
audio emotion recognition accuracy improves by at least 10.5 %. The average audio emotion
recognition is 63 %. Similar to the audio system, the average visual emotion recognition
accuracy improves by at least 2 %, and the average visual emotion recognition is 95.83 %.
The best result of our experiments is for the audio-visual feature level fusion after feature
reduction. By using the optimized FAMNN parameters for this system, the average emotion
recognition accuracy improves by at least 0.33 %. So, the best result in this work is 98.25 %.
In previous work [19], we optimize audio emotion recognition FAMNN with GA. To
compare performance of different methods of optimization, Specifications of the GA opti-
mized FAMNN are reported in Table 6. Experimental results show that the two algorithms,
almost the same results. But PSO, partly has better results and faster as well.
Emotion recognition performance for multimodal emotion recognition systems in other
works and human evaluation may be helpful for analyzing the performance of the proposed
work.
In SAVEE database, each actors data were evaluated by 10 subjects at utterance level in
three ways: audio, visual, and audio-visual. All of the evaluators were students at University of
Surrey, UK with the age range of 21 to 29 years. To avoid gender biasing, half of the
Tab le 3 Recognition rate of emotional states for various proposed systems
Audio Visual FL FL-FR FL-FS DL
ANG 55 80 100 100 88.88 91.9
DIS 49 93.75 87.5 100 90.9 93.75
FEA 47.7 90 100 100 90.9 92
HAP 48.2 100 100 88.88 75 91.9
Neutral 57 100 100 100 92.30 100
SAD 57 100 90.9 90.90 69.23 100
SUR 57 92.30 100 100 90 94.75
AVG 53 93.75 96.88 97.92 85.72 95
Tabl e 4 Specifications of the PSO
algorithm for optimizing FAMNN Specification Value
Population size 30
C1 2
C2 2
Max number of iterationρ
a
50
Max particle velocity 4
Multimed Tools Appl
evaluators were female. Also 5 of them were native, and the rest had lived in the UK for more
than a year. The 120 clips from each actor were divided into 10 groups, resulting 12 clips per
group. For each evaluator, a different data set was created, which resulted in 10 different sets
for each of the audio, visual and audio-visual data per actor. The subjects were trained by using
slides containing three facial expression pictures, two audio files, and a short movie clip for
each of the emotions. The subjects were asked to play audio, visual and audio-visual clips, and
select from one of the seven emotions on a paper sheet. The responses were averaged over 10
subjects for each actor. The average human classification accuracy is shown in Table 7.The
mean was averaged over 4 subjects.
The performance of our work was lower than human evaluation for audio data. But our
work showed higher classification performance than human evaluation for visual and audio-
visual information. There are a few possible reasons for this. First, the difference in training
data, i.e., machine was trained on a large part of data but humans were trained on a small
amount of data, the task was discrete emotion classification, and the emotions may not be
properly acted. Also, the quality of the human evaluators, they might not be very good at the
job. But it is typical in the literature [10,21,23,35]. The best visual system overall results was
95.83 % (for human is 88 %) and with audio-visual system, it was 98.25 % (for human is
91.8). The comparison of human perception and this work is shown in Fig. 11. Table 8shows
the performance of the proposed system and other multimodal emotion recognition systems.
6 Conclusion
The basics of most existing researches on emotion recognition can be summarized in three
stages: feature extraction, feature selection and emotion recognition. A number of promising
methods for audio, visual, and audio-visual feature extraction and feature selection have so far
been proposed. So, feature extraction was not the goal of this paper. However, the good set of
audio features was used in this paper. And, by using facial point markers on the face, visual
Tab le 5 Specifications of the optimized FAMNN
αβρ
a
ρ
ab
Non-optimized FAMNN CR Optimized FAMNN CR
Audio 0.37 0.77 0.80 0.87 53 63.5
visual 0.44 0.99 0.97 0.91 93.75 95.83
Audio-visual 0.001 1 0.99 0.95 97.92 98.25
DL 0.46 0.79 0.99 0.96 95 97.37
Tab le 6 Specifications of the GA optimized FAMNN
αβρ
a
ρ
ab
Non-optimized FAMNN CR Optimized FAMNN CR
Audio 0.33 0.7 0.90 0.84 53 60
visual 0.31 0.94 0.59 0.53 93.75 95.83
Audio-visual 0.42 0.9 0.95 0.39 97.92 97.92
DL 0.02 0.99 0.76 0.47 95 97.37
Multimed Tools Appl
emotion recognition was carried out appropriately. By using FCBF feature selection method,
the efficient features are determined. Developing better methods and classifiers for emotion
recognition and, fusion of different systems are two of the most important issues that need
sufficient attention. We are focused on fusion of powerful classifiers in order to improve the
performance of emotion recognition rate. Different classifiers such as ANNs, SVMs, KNN,
and GMMs have been used for emotion recognition. Here, we used PSO optimized FAMNN
as a powerful classifier. Also, a different fusion of audio and visual systems was tested.
Experimental results confirm the excellent performance of our classifier and optimization by
PSO. Also, they show that the performances of audio and visual systems were improved by
using the different fusions.
This paper proposed particle swarm optimization-based FAMNN for audio-visual emotion
recognition. FAMNN combines audio and visual information in feature and decision levels
using stacked generalization approach. For this purpose, we employed the audio features such
as MFCC, pitch, energy and formants, as well as marker locations on the face as visual
features.
Experimental results showed that the performances of unimodal systems were im-
proved by using the feature level and decision level fusions and PSO optimization-based
FAMNN. The PSO algorithm was employed to determine the optimum value of the
choice parameter (α), the vigilance parameters (ρ) and the learning rate (β)ofFAMNN.
As result, the recognition rate was improved by about 45.25 % with respect to the non-
optimized audio unimodal system, and by about 5 % with respect to the visual system. In
this study, we focused on a well proposed multi-modal fusion approach. The final
emotion recognition rate on the SAVEE database was reached to 98.25 % using audio
and visual features by using the optimized FAMNN.
Tab le 7 Average human classification accuracy
Human KL JE JK DC Mean
Audio 53.2 67.7 71.2 73.7 66.5
Visual 89 89.8 88.6 84.7 88
Audio-visual 92.1 92.1 91.3 91.7 91.8
Mean is averaged over 4 subjects (KL-JE-JK-DC) data
0
20
40
60
80
100
Audio
Visual Audio-visual
Human
Proposed
Method
Fig. 11 Comparison of human
perception and this work
Multimed Tools Appl
Tab le 8 Performance of typical systems for multimodal emotion recognition in the recent decade
Reference Classifier Features Fusion Database Acc Acc
(Fusion)
Paleari and Huet [38] SVM,NN Facial points,
Formants, Prosody,
MFCC, LPC
F,D e Nt erf ac e05 Audio:25
Video:33
39
Busso et al. [7] SVM 102 markers, Prosody F,D an actress Audio:70.9
Video:85.1
89
Mansooizadeh et al. [31] SVM Facial points,
Prosody
F,D,H eNterface05 Audio:33
Video:37
71
De Silva and Pei Chi [14] HMM, Nearest neighbor Facial points, Pitch D 2 subjects, 144 clips Audio:62
Video:32
72
Cheng-Yao et al. [11] SVM Facial points,
Prosody
F 2 subjects, 350 clips Audio:63
Video:75
84 B^
Schuller et al. [43] SVM Face model, Formants,
Prosody, MFCC
F ABC database Audio:74
Video:61
81
Zeng et al. [48] SNoW and HMM Facial points,
Prosody
D 20 subjects, Eleven
affect categories
Audio:58
Video:44
95
Bejani et al. [5] Multi-classifier ITMI, QIM
MFCC, Prosody, Formants
F,D,H eNterface05 Audio:54.99
Video:39.27
77.78
Haq et al. [21] Gaussian (PCA) Facial points,
Prosody, formants,
Duration, spectral
F,D SAVEE Audio:50
Video:91
92.9
Haq and Jackson [21] Gaussian (LDA) Facial points,
Prosody, Formants,
Duration, Spectral
F,D SAVEE Audio:56
Video:95.4
97.5
Banda et al. [4]SVM LBP,
Prosody, Energy
D SAVEE Audio:79
Video:95
98
Haq et al. [20] SVM Facial points,
830 audio features
D SAVEE Audio:58
Video:66
83
This work Optimized FAMNN Facial points,
Prosody, Formants,
Duration, Spectral
F,D SAVEE Audio:63
Video:95.8
98.25
Multimed Tools Appl
Future works will investigate new combining classification methods such as mixture of
experts or new ways to the optimize FAMNN such as Cuckoo [42], Grey Wolf [34]and
Imperialist Competitive Algorithm (ICA) [3].
Acknowledgments This work was supported by Islamic Azad University-South Tehran Branch under a
research project entitled BAudio-Visual Emotion Modeling to Improve humancomputer interaction^.
References
1. 3DMD 4D Capture System. Online: http://www.3dmd.com, accessed on 3 May, 2009
2. Ambady N, Rosenthal R (1992) Thin slices of expressive behavior as predictors of interpersonal conse-
quences: a meta-analysis. Psychol Bull 111(2):256274
3. Atashpaz-Gargari E, Lucas C, (2007) Imperialist competitive algorithm: an algorithm for optimization
inspired by imperialistic competition. IEEE Congress on Evolutionary Computation 46614667
4. Banda N, Robinson P (2011) Noise analysis in audio-visual emotion recognition. International Conference
on Multimodal Interaction, Alicante, Spain
5. Bejani M, Gharavian D, Moghaddam Charkari N (2012) Audiovisual emotion recognition using ANOVA
feature selection method and multi-classifier neural networks. Neural Comput & Applic 24(2):399412
6. Boersma P, Weenink D (2007) Praat: doing phonetics by computer (version 4.6.12) [computer program]
7. Busso C et al (2004) Analysis of emotion recognition using facial expressions, speech and multimodal
information. In: Proceedings of the sixth ACM International Conference on Multimodal Interfaces (ICMI
04), pp 205211
8. Carpenter GA (2003) Default ARTMAP. In: Proceedings of the International Joint Conference on Neural
Networks, Portland, Oregon, USA vol 2. pp 13961401
9. Carpenter GA, Grossberg S, Markuzon N, Reynolds JH, Rosen DB (1992) Fuzzy ARTMAP: a neural
network architecture for incremental supervised learning of analog multidimensional maps. IEEE Trans
Neural Netw 3:698713
10. Chen C, Huang Y, Cook P (2005) Visual/acoustic emotion recognition. ICME 2005:14681471
11. Cheng-Yao C, Yue-Kai H, Cook P (2005) Visual/acoustic emotion recognition, pp 14681471
12. Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell
23(6):681685
13. Dai W, HanD DY, Xu D (2015) Emotion recognition and affective computing on vocal social media. Inf
Manag. doi:10.1016/j.im.2015.02.003
14. De Silva LC, Pei Chi N (2000) Bimodal emotion recognition. In: Proceedings of the Fourth IEEE
International Conference on Automatic Face and Gesture Recognition, vol 1. pp 332335
15. Devillers L, Vidrascu L (2006) Real-life emotions detection with lexical and paralinguistic cues on human
human call center dialogs. In: The proceedings of Interspeech, pp 801804
16. Ekman P (1971) Universals and cultural differences in facial expressions of emotion. Proc Nebr Symp Motiv
19:207283
17. Ekman P, Rosenberg EL (2005) What the face reveals: Basic and applied studies of spontaneous expression
using the facial action coding system. Second ed. Oxford Univ Press
18. Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:15311555
19. Gharavian D, Sheikhan M, Nazerieh AR, Garoucy S (2011) Speech emotion recognition using FCBF feature
selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput & Applic. doi:10.
1007/s00521-011-0643-1
20. Haq S, Asif M, Ali A, Jan T, Ahmad N, Khan Y (2015) Audio-visual emotion classification using filter and
wrapper feature selection approaches. Sindh Univ Res J (Sci Ser) 47(1):6772
21. Haq S, Jackson PJB (2009) Speaker-dependent audio-visual emotion recognition. In: Proc. IntlConf.on
Auditory-Visual Speech Processing, pp 5358
22. Harley Jason M et al (2015) A multi-componential analysis of emotions during complex learning with an
intelligent multi-agent system. Comput Hum Behav 48:615625. doi:10.1016/j.chb.2015.02.013
23. Hassan A, Damper R (2010) Multi-class and hierarchical SVMs for emotion recognition. ISCA,
INTERSPEECH, pp 23542357
24. Hoch S, Althoff F, McGlaun G, Rigooll G (2005) Bimodal fusion of emotional data in an automotive
environment. In: The Proceedings of the International Conference on Acoustics, Speech, and Signal
Processing vol 2. pp 10851088
Multimed Tools Appl
25. Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of the IEEE International
Conference on Neural Networks, Perth, Australia vol 4. pp 19421948
26. Klein J, Moon Y, Picard RW (2002) This computer responds to user frustration: theory, design and results.
Interact Comput 14:119140
27. Lee C-C, Mower E, Busso C, Lee S, Narayanan S (2009) Emotion recognition using a hierarchical binary
decision tree approach. In: The proceedings of Interspeech, pp 320323
28. López-de-Ipiña K, Alonso-Hernández JB et al (2015) Feature selection for automatic analysis of emotional
response based on nonlinear speech modeling suitable for diagnosis of Alzheimer s disease. Neurocomputing
150:392401. doi:10.1016/j.neucom.2014.05.083
29. Luxand FaceSDK 5.0.1 Face Detection and Recognition Library. online: https://www.luxand.com/facesdk/
index.php
30. Mansoorizadeh M, Moghaddam Charkari N (2009) Hybrid feature and decision level fusion of face and
speech information for bimodal emotion recognition. Proceedings of the 14th International CSI Computer
Conference
31. Mansoorizadeh M, Moghaddam Charkari N (2009) Multimodal information fusion application to human
emotion recognition from face and speech. Multimed Tools Appl
32. Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface05 audio-visual emotion database. In: Proc. 22nd
intl. conf. on data engineering workshops (ICDEW06)
33. Mehrabian A (1968) Communication without words. In: Psychology Today, vol 2. pp 5356
34. Mirjalili SA, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:4661. (http://www.
mathworks.com/matlabcentral/fileexchange/44974-grey-wolf-optimizergwo-)
35. Morrison D, Wang R, Silva D (2007) Ensemble methods for spoken emotion recognition in call-centres.
Speech Comm 49(2):98112
36. Oudeyer P-Y (2003) The production and recognition of emotions in speech: features and algorithms. Int J
Hum Comput Interact Stud 59:157183
37. Paleari M, Benmokhtar R, Huet B (2008) Evidence theory-based multimodal emotion recognition.
InMMM09, pp 435446
38. Paleari M, Huet B (2008) Toward emotion indexing of multimedia excerpts. In: CBMI
39. Pantic M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. IEEE Trans
Pattern Anal Mach Intell 22:14241445
40. Picard RW (1997) Affective computing. MIT Press
41. Polzehl T, Sundaram S, Ketabdar H, Wagner M, Metze F (2009) Emotion classification in childrensspeech
using fusion of acoustic and linguistic features. In: The proceedings of Interspeech, pp 340343
42. Rajabioun R (2011) Cuckoo optimization algorithm. Appl Soft Comput l.11:55085518. (http://www.
mathworks.com/matlabcentral/fileexchange/35635-cuckoo-optimization-algorithm)
43. Schuller B, Arsic D, Rigoll G, Wimmer M, Radig B (2007) Audiovisual behavior modeling by combined
feature spaces. In: ICASSP, pp 733736
44. Sheikhan M, Bejani M, Gharavian D (2012) Modular neural-SVM scheme for speech emotion recognition
using ANOVA feature selection method. Neural Comput Appl J. doi:10.1007/s00521-012-0814-8
45. Shlens J (2005) A tutorial on principal component analysis. Systems Neurobiology Laboratory, Salk Institute
for Biological Studies, La Jolla
46. Song M, You M, Li N, Chen C (2008) A robust multimodal approach for emotion recognition.
Neurocomputing
47. Weisgerber A, Vermeulen N et al (2015) Facial, vocal and musical emotion recognition is altered in paranoid
schizophrenic patients. Psychiatry Res. doi:10.1016/j.psychres.2015.07.042
48. Zeng Z, Hu Y, Roisman GI, Wen Z, Fu Y, Huang TS (2007) Audio-visual spontaneous emotion recognition.
Artif Intell Hum Comput 4451:7290
49. Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and
spontaneous expressions. PAMI 31:3958
Multimed Tools Appl
Davood Gharavian was born in Neyshabour, Iran, in 1973. He received the B.S. degree in electronic
engineering from Amirkabir University, Tehran, Iran, in 1995 and M.S. in communication engineering from
Tarbiat Modares University, Tehran, Iran in 1998 and Ph.D. degrees in electronic engineering from Amirkabir
University, Tehran, Iran in 2004. He is currently an Assistant Professor in Electrical Engineering Department of
Shahid Beheshti University. His research interests include digital signal processing, speech and image processing,
digital signal processor, industrial networks and smart grid.
Dr. Gharavian has published more than 21 journal papers and more than 17 conference papers. He is the
author of two books in the fields of communication systems and power line carrier.
Mehdi Bejani received the B.S. degrees in Electrical Engineering from Shahid Rajaee University, Tehran, Iran
(2007), and the M.S. degrees in Electrical Engineering from Islamic Azad University, South Tehran Branch in
2010. His research interests include Human computer interaction, multimodal affective computing, Machine
learning and Image &Speech Processing.
Multimed Tools Appl
Mansour Sheikhan was born in Tehran, Iran, in 1966. He received the B.S. degree in electronic engineering from
Ferdowsi University, Meshed, Iran, in 1988 and M.S. and Ph.D. degrees in communication engineering from
Islamic Azad University, Tehran, Iran, in 1991 and 1997, respectively. He is currently an Associate Professor in
Electrical Engineering Department of Islamic Azad University-South Tehran Branch. His research interests
include security in communication networks, intelligent systems, signal processing, and neural networks.
Dr. Sheikhan has publishedmorethan 70 journal papers and more than 60 conference papers. He is the author
of two books in the fields of optical signal processing and communication systems and has been selected as the
outstanding researcher of IAU in 2003, 2008, and 20102013.
Multimed Tools Appl
... Haq et al. (2015) has proposed feature selection algorithm for SAVEE database on audio features in the speaker-dependent scenario. Further, Gharavian et al. (2017) had used Fast Correlation-Based Filter algorithm for feature selection on spectral features like MFCC, Formants, and corresponding statistical features with fuzzy ARTMAP neural networks on SAVEE database. Significant improvement in AER was observed in Bozkurt et al. (2011) when weighted MFCC features were joint with spectral in addition to prosody features. ...
... Most of the works with which the present work is compared as mentioned above belong to the low-level handcrafted features category. Noroozi et al. (2017), Haq et al. (2015) and Gharavian et al. (2017) have considered low level features like MFCC, Prosody, pitch, energy and formants. These low-level features are insufficient in distinguishing the different classes of emotions from speech and thus shows less accuracy of lower than 60%. ...
Article
Emotions are explicit and serious mental activities, which find expression in speech, body gestures and facial features, etc. Speech is a fast, effective and the most convenient mode of human communication. Hence, speech has become the most researched modality in Automatic Emotion Recognition (AER). To extract the most discriminative and robust features from speech for Automatic Emotion Recognition (AER) recognition has yet remained a challenge. This paper, proposes a new algorithm named Shifted Linear Discriminant Analysis (S-LDA) to extract modified features from static low-level features like Mel-Frequency Cepstral Coefficients (MFCC) and Pitch. Further 1-D Convolution Neural Network (CNN) was applied to these modified features for extracting high-level features for AER. The performance evaluation of classification task for the proposed techniques has been carried out on the three standard databases: Berlin EMO-DB emotional speech database, Surrey Audio-Visual Expressed Emotion (SAVEE) database and eNTERFACE database. The proposed technique has shown to outperform the results obtained using state of the art techniques. The results shows that the best accuracy obtained for AER using the eNTERFACE database is 86.41%, on the Berlin database is 99.59% and with SAVEE database is 99.57%.
... Therefore, combining these channels can enhance the amount of information available to determine the emotional state. As a result, a variety of strategies for merging distinct modalities with varying degrees of fusion have been devised [Gharavian, Bejani and Sheikhan, 2016]. Several survey articles on emotion recognition tasks from single or several Modalities have been published. ...
... Several survey articles on emotion recognition tasks from single or several Modalities have been published. The authors of [Gharavian, Bejani and Sheikhan, 2016] detailed the emotion databases, pre-processing procedures, different feature extraction techniques, feature selection approaches, and classifiers utilized in voice emotion identification as single Modalities. ...
Article
Full-text available
Humans are emotional beings. When we express about emotions, we frequently use several modalities, whether we want to so overtly (i.e., Speech, facial expressions,..) or implicitly (i.e., body language, text,..). Emotion recognition has lately piqued the interest of many researchers, and various techniques have been studied. A review on emotion recognition is given in this article. The survey seeks single and multiple source of data or information channels that may be utilized to identify emotions and includes a literature analysis on current studies published to each information channel, as well as the techniques employed and the findings obtained. Ultimately, some of the present emotion recognition problems and future work recommendations have been mentioned.
... All this is essential to recognize emotions robustly across datasets of different languages, different sizes, and different emotion ranges. Along with the global optimum solutions proposed in literature [9] [10] [11], there exists optimization solutions with the specific focus on feature optimization in SER [12], [13]. Even though, these methods improved the accuracy of the SER systems however, there is still a great margin to achieve. ...
... In one of the studies [15], Yogesh et al. presented a new particle swarm-based optimization for feature selection in SER. To obtain the optimum values of features for speech emotions, Gharwan has used an optimization algorithm [11]. In another study [16], deep convolution neural networks are used to extract spectrogram features. ...
Article
Full-text available
Speech Emotion Recognition (SER) is a popular topic in academia and industry. Feature engineering plays a pivotal role in building an efficient SER. Although researchers have done a tremendous amount of work in this field, there are still the issues of speech feature choice and the correct application of feature engineering that remains to be solved in the domain of SER. In this research, a feature optimization approach that uses a clustering-based genetic algorithm is proposed. Instead of randomly selecting the new generation, clustering is applied at the fitness evaluation level to detect outliers for exclusion to be part of the next generation. The approach is compared with the standard Genetic Algorithm in the context of audio emotion recognition using Berlin Emotional Speech Database (EMO-DB), Ryerson Audio-Visual Database of Speech and Song (RAVDESS) and, Surrey Audio-Visual Expressed Emotion Dataset (SAVEE). Results signify that the proposed technique effectively improved the emotion classification in speech. The recognition rate of 89.6% for general speakers (both male and female), 86.2% for male speakers, and 88.3% for female speakers on EMO-DB, 82.5% for general speakers, 75.4% for male speakers, and 91.1% for female speaker on RAVDESS, and 77.7% for general speakers on SAVEE is obtained in speaker-dependent experiments. For speaker-independent experiments, we achieved the recognition rate of 77.5% on EMO-DB, 76.2% on RAVDESS and, 69.8 % on SAVEE. All the experiments were performed on MATLAB and the Support Vector Machine (SVM) was used for classification. Results confirm that the proposed method is capable of discriminating emotions effectively and performed better than the other approaches used for comparison in terms of performance measures.
... A second technique that utilized here is expanded fleeting intricacy by applying consistent secret Markov models considering a few states utilizing low-level highlights rather than worldwide insights. In supporting occasion depiction has shown incredible potential for our everyday lives, e.g., quake recognition [18], [19] with assorted datasets have various qualities. As a piece of observing client exercises in open organizations [20], bunch exercises at low inactivity with the most recent huge information examination motor are considered for our proposed speech detection approach. ...
Conference Paper
Full-text available
Emotion Recognition through voice tests is a new exploration subject in the Human-Computer Interaction (HCI) field. The requirement for it has emerged for an all the more simple correspondence interface among people and PCs since PCs have become the fundamental piece of our lives. To accomplish this objective, a PC would need to have the option to separate its present circumstance and react contrastingly relying upon that specific perception. The proposed human identification includes understanding a client's passionate state and to make the human-PC cooperation more regular, the principle objective is that the PC ought to have the option to perceive the enthusiastic conditions of people in equivalent to a human does. The proposed framework focuses on the recognizable proof of fundamental enthusiastic states like indignation, happiness, nonpartisan, and pity from human voice tests. While characterizing various speech recognitions, highlights like MEL frequency cepstral coefficient and energy are utilized. The proposed strategy depicts and thinks about the exhibitions of learning multiclass Support Vector Machine (SVM) , Random Forest (RF) and their mix of speech recognition acknowledgment. The MFCC and SVM algorithm proves to be an efficient no-regret online algorithm which detects the speech recognition with average classification accuracy of 89% which is reasonably acceptable.
... Revathy et al. (2015) discuss the effectiveness of the Hidden Markov Model tool kit (HTK) for speaker-independent emotion recognition systems for the EMO-DB database with 68% accuracy. Further, Gharavian et al. (2017) had used Fast Correlation-Based Filter (FCBF) feature selection on MFCC, Formants, and corresponding statistical features with fuzzy ARTMAP neural networks with an average accuracy of 53% on SAVEE database. The multitaper MFCC feature was used for AER had shown better results than traditional single taper hamming window MFCC features as in Besbes and Lachiri (2017). ...
Article
Full-text available
Recently, the recognition of the emotional state or stress level of a user has been accepted as an emerging research topic in the field of human–machine interface and psychiatry. Stress causes variation in the speech produced which can be measured as a negative emotion. If this negative emotion continues for a longer period, it may bring severe havoc in the life of a person either physically or psychologically. This paper proposes to employ regularization techniques to identify the pertinent features from the high dimensional feature set to improve the accuracy of Automatic Emotion Recognition (AER) from stressed/emotional speech. Melscaled power spectrum (MSPS), Mel frequency cepstral coefficients (MFCC), and shifted delta coefficients (SDC) features were individually considered to compute AER. Further, combinations of these features were used to improve the performance of AER. However, the expected accuracy in AER was not observed due to overfitting. To overcome this limitation, Least Absolute Shrinkage Selector Operator (LASSO) and ridge regularization techniques were employed to select the relevant features. The ridge regression technique applied on the combined feature set (MSPS, MFCC, and SDC) yielded the best accuracy of AER. The AER obtained for eNTERFACE database is 56.41% and for SAVEE database is 72.50%. A reduction in computation complexity of 43% was achieved.
... Unlike VGG, Goo-gleNet has made a bolder attempt on the Internet. Although this model has 22 layers, its size is much smaller than Alex-Net and VGG, and it still performs well [28]. The most effective way to obtain a higher-quality model is to increase the number of layers (that is, depth) of the model or the number of convolution kernels or neurons in each layer of the model (model width). ...
Article
Full-text available
With the rapid development of social economy and the extensive and in-depth development of national fitness activities, national physical fitness monitoring and research work has achieved rapid development. In recent years, the application of deep learning technology has also achieved research breakthroughs in the field of computer vision. How deep learning technology can effectively capture motion information in sample data and use it to realize the recognition and classification of human actions is currently a research hot spot. Today’s popularization of various shooting devices such as mobile phones and portable action cameras has contributed to the vigorous growth of image data. Therefore, through computer vision technology, image data is widely used in practical application scenarios of human feature recognition. This paper proposes a deep learning network based on the recognition of human body feature changes in sports, improves the recognition method, and compares the recognition accuracy with the original method. The experimental results of this paper show that the result of this paper is 1.68% higher than the original recognition method, the accuracy rate of the improved motion history image is increased by 14.8%, and the overall recognition rate is higher. It can be seen from the above experimental results that this method has achieved good results in human body action recognition.
... Today, the speech emotion recognition system (SER) assesses the emotional state of the speaker by examining his/her speech signal [24][25][26]. Work [27] proposes key technologies for recognition of speech emotions based on neural networks and recognition of facial emotions based on SVM, and in paper [28], they show a system of emotion recognition based on an artificial neural network (ANN) and its comparison with a system based on the scheme Hidden Markov Modeling (HMM). Both systems were built on the basis of probabilistic pattern recognition and acoustic phonetic modeling approaches. ...
Article
Full-text available
The emotional speech recognition method presented in this article was applied to recognize the emotions of students during online exams in distance learning due to COVID-19. The purpose of this method is to recognize emotions in spoken speech through the knowledge base of emotionally charged words, which are stored as a code book. The method analyzes human speech for the presence of emotions. To assess the quality of the method, an experiment was conducted for 420 audio recordings. The accuracy of the proposed method is 79.7% for the Kazakh language. The method can be used for different languages and consists of the following tasks: capturing a signal, detecting speech in it, recognizing speech words in a simplified transcription, determining word boundaries, comparing a simplified transcription with a code book, and constructing a hypothesis about the degree of speech emotionality. In case of the presence of emotions, there occurs complete recognition of words and definitions of emotions in speech. The advantage of this method is the possibility of its widespread use since it is not demanding on computational resources. The described method can be applied when there is a need to recognize positive and negative emotions in a crowd, in public transport, schools, universities, etc. The experiment carried out has shown the effectiveness of this method. The results obtained will make it possible in the future to develop devices that begin to record and recognize a speech signal, for example, in the case of detecting negative emotions in sounding speech and, if necessary, transmitting a message about potential threats or riots.
... Based on SAVEE database, proposed an Informed Segmentation and Labelling Approach (ISLA) that uses speech signals to alter the dynamics of the lower and upper face regions. Gharavian et al. (2017), recognized emotions by employing SAVEE database based on fuzzy ART-MAP Neural Network (FAMNN) and extracting features like MFCC, pitch, Zero Crossing Rate (ZCR) for the audio channel. Based on BAUM-1s database, Zhalehpour et al. (2016) extracted MFCC, and Relative Spectral Features (RASTA) from speech while for the facial expressions, Local Phase Quantization (LPQ) features and Patterns of Oriented Edge Magnitude (POEM) features were used. ...
Article
Full-text available
This paper focuses on Multimodal Emotion Recognition (MER) which can be conceptually perceived as the superset of Speech Emotion Recognition (SER) and Facial Emotion Recognition (FER). The challenges faced in designing the MER system are the extraction of the discriminative features. The features are strategically selected from speech and visual (facial) modalities and subsequently fused together. The base features extracted from the speech segment was Mel-Frequency Cepstral Coefficients (MFCC) while the Facial Landmarks Distances (FLD) was calculated from every frame of a video sequence as visual base features. Since these base features are static in nature, they hinder the performance of emotion recognition process. Hence, a novel algorithm of Shifted Delta Acceleration Linear Discriminant Analysis (SDA-LDA) has been proposed here to extract the most discriminative, dynamic and robust features of speech and video sequence. The Support Vector Classifier (SVM) was used for the classification of the emotions from fused multimodality features. The MER experiment was performed on the eNTERFACE, Surrey Audio-Visual Expressed Emotion (SAVEE), spontaneous BAUM-1s, Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) databases, which resulted into accuracy of 95.65%, 100%, 100% and 93.3% respectively. The accuracies obtained for MER for all the 4 databases considered have outperformed the accuracies achieved for SER, FER and state-of-the-art techniques.
... In recent years, some progress has been made in the research of cross-database SER, and some solutions have been put forward. For example, implicit factor analysis [12,13], wake-up and valence dimension mapping [14], feature transfer learning based on sparse automatic coding [15] and multi-core learning strategies for independent speakers [16]. All the techniques mentioned above use traditional linguistic features to construct SER systems. ...
Article
Full-text available
As the most natural language and emotional carrier, speech is widely used in intelligent furniture, vehicle navigation and other speech recognition technologies. With the continuous improvement of China's comprehensive national strength, the power industry has also ushered in a new stage of vigorous development. As the basis of production and life, it is a general trend to absorb voice processing technology. In order to better meet the actual needs of power grid dispatching, this paper applies voice processing technology to the field of smart grid dispatching. By testing and evaluating the recognition rate of the existing speech recognition system, a speech emotion recognition technology based on BiLSTM and CNN network dual attention model is proposed, which is suitable for the human–machine interaction system in the field of intelligent scheduling. Firstly, mel spectrum sequence of speech signal is extracted as input of BiLSTM network, and then, time context feature of speech signal is extracted by BiLSTM network. On this basis, the CNN network is used to extract the high-level emotional features from the low-level features and complete the emotional classification of speech signals. Emotional recognition tests were conducted on three different emotional databases, eNTERAFACE05, RML and AFW6.0. The experimental results show that the average recognition rates of this technology on three databases are 55.82%, 88.23% and 43.70%, respectively. In addition, the traditional speech emotion recognition technology is compared with the speech emotion recognition technology based on BiLSTM or CNN, which verifies the effectiveness of the technology.
Article
The evolution of internet resources has led to an increase in the flow of data and, consequently, the need for classification or forecasting models that support online learning. The Fuzzy ARTMAP neural network has been used in the most areas of knowledge; however, few have explored real-time applications that require continuous training. In this work, a Fuzzy ARTMAP neural network with continuous training is proposed. This new network can acquire knowledge via classification or prediction. Modifications made to the architecture and learning algorithm enable online learning from the first sample of data and perform the classification or forecasting at any time during training. To validate the proposed model, three experiments were performed, one for forecasting and two for classification. Each experiment used benchmark databases and compared its final results with the results of the original Fuzzy ARTMAP neural network. The results demonstrate the ability of the proposed model to acquire knowledge from the first data samples in a stable and efficient way. Thus, this study contributes to the evolution of the Fuzzy ARTMAP neural network and introduces the continuous training method as an effective alternative to real-time applications.
Conference Paper
Full-text available
This paper explores the recognition of expressed emotion from speech and facial gestures for the speaker-dependent case. Experiments were performed on an English audiovisual emotional database consisting of 480 utterances from 4 English male actors in 7 emotions. A total of 106 audio and 240 visual features were extracted and features were selected with Plus l-Take Away r algorithm based on Bhattacharyya distance criterion. Linear transformation methods, principal component analysis (PCA) and linear discriminant analysis (LDA), were applied to the selected features and Gaussian classifiers were used for classification. The performance was higher for LDA features compared to PCA features. The visual features performed better than the audio features and overall performance improved for the audiovisual features. In case of 7 emotion classes, an average recognition rate of 56 % was achieved with the audio features, 95 % with the visual features and 98 % with the audiovisual features selected by Bhat-tacharyya distance and transformed by LDA. Grouping emotions into 4 classes, an average recognition rate of 69 % was achieved with the audio features, 98 % with the visual features and 98 % with the audiovisual features fused at decision level. The results were comparable to the measured human recognition rate with this multimodal data set. 1
Poster
Full-text available
Speaker-dependent audio-visual emotion recognition was performed using an English database (4 male actors, 7 emotions). A total of 106 audio and 240 visual features were extracted, and feature selection was performed with Plus l-Take Away r algorithm based on Bhattacharyya distance criterion. The PCA and LDA were applied to the selected features for feature reduction, and Gaussian classifiers were used for classification. Higher recognition performance was achieved for the visual and audio-visual features compared to the audio features.
Article
Full-text available
A comparative analysis of filter and wrapper approaches of feature selection has been presented for the audio, visual and audio-visual human emotion recognition. A large set of audio and visual features were extracted, followed by speaker normalization. In filter approach, feature selection was performed using the Plus l-Take Away r algorithm based on Bhattacharyya distance criterion. In wrapper method, features were selected based on their classification performance using support vector machine (SVM) classifier. Finally, an SVM classifier was used for human emotion recognition. The filter approach provided a slight better performance in comparison to wrapper approach for seven emotions on SAVEE database.
Article
Disturbed processing of emotional faces and voices is typically observed in schizophrenia. This deficit leads to impaired social cognition and interactions. In this study, we investigated whether impaired processing of emotions also affects musical stimuli, which are widely present in daily life and known for their emotional impact. Thirty schizophrenic patients and 30 matched healthy controls evaluated the emotional content of musical, vocal and facial stimuli. Schizophrenic patients are less accurate than healthy controls in recognizing emotion in music, voices and faces. Our results confirm impaired recognition of emotion in voice and face stimuli in schizophrenic patients and extend this observation to the recognition of emotion in musical stimuli. Copyright © 2015. Published by Elsevier Ireland Ltd.
Article
This paper presents the evaluation of the synchronization of three emotional measurement methods (automatic facial expression recognition, self-report, electrodermal activity) and their agreement regarding learners’ emotions. Data were collected from 67 undergraduates enrolled at a North American University whom learned about a complex science topic while interacting with MetaTutor, a multi-agent computerized learning environment. Videos of learners’ facial expressions captured with a webcam were analyzed using automatic facial recognition software (FaceReader 5.0). Learners’ physiological arousal was recorded using Affectiva’s Q-Sensor 2.0 electrodermal activity measurement bracelet. Learners’ self-reported their experience of 19 different emotional states on five different occasions during the learning session, which were used as markers to synchronize data from FaceReader and Q-Sensor. We found a high agreement between the facial and self-report data (75.6%), but low levels of agreement between them and the Q-Sensor data, suggesting that a tightly coupled relationship does not always exist between emotional response components.
Article
Vocal media has become a popular method of communication in today's social networks. While conveying semantic information, vocal messages usually also contain abundant emotional information; this emotional information represents a new focus for data mining in social media analytics. This paper proposes a computational method for emotion recognition and affective computing on vocal social media to estimate complex emotion as well as its dynamic changes in a three-dimensional PAD (Position-Arousal-Dominance) space; furthermore, this paper analyzes the propagation characteristics of emotions on the vocal social media site WeChat.
Article
This paper describes the use of a decision-based fusion frame-work to infer emotion from audiovisual feeds, and investi-gates the effect of noise on the fusion system. Facial expres-sion features are constructed from linear binary patterns, and are processed independently of the prosodic features. A linear support vector machine is used for the fusion of the two channels. The results show that the recognition accuracy of the bimodal system improves on the individual channels; moreover, the system maintains a reasonably good performance in the presence of noise.