ArticlePDF Available

Extraction of Novel Features Based on Histograms of MFCCs Used in Emotion Classification from Generated Original Speech Dataset


Abstract and Figures

This paper introduces two significant contributions: one is a new feature based on histograms of MFCC (Mel-Frequency Cepstral Coefficients) extracted from the audio files that can be used in emotion classification from speech signals, and the other – our new multi-lingual and multi-personal speech database, which has three emotions. In this study, Berlin Database (BD) (in German) and our custom PAU database (in English) created from YouTube videos and popular TV shows are employed to train and evaluate the test results. Experimental results show that our proposed features lead to better classification of results than the current state-of-the-art approaches with Support Vector Machine (SVM) from the literature. Thanks to our novel feature, this study can outperform a number of MFCC features and SVM classifier based studies, including recent researches. Due to the lack of our novel feature based approaches, one of the most common MFCC and SVM framework is implemented and one of the most common database Berlin DB is used to compare our novel approach with these kind of approaches.
Content may be subject to copyright.
1AbstractThis paper introduces two significant
contributions: one is a new feature based on histograms of
MFCC (Mel-Frequency Cepstral Coefficients) extracted from
the audio files that can be used in emotion classification from
speech signals, and the other our new multi-lingual and
multi-personal speech database, which has three emotions. In
this study, Berlin Database (BD) (in German) and our custom
PAU database (in English) created from YouTube videos and
popular TV shows are employed to train and evaluate the test
results. Experimental results show that our proposed features
lead to better classification of results than the current state-of-
the-art approaches with Support Vector Machine (SVM) from
the literature. Thanks to our novel feature, this study can
outperform a number of MFCC features and SVM classifier
based studies, including recent researches. Due to the lack of
our novel feature based approaches, one of the most common
MFCC and SVM framework is implemented and one of the
most common database Berlin DB is used to compare our novel
approach with these kind of approaches.
Index TermsEmotion classification; MFCC; SVM; Speech
Human-computer interaction systems have been drawing
attention increasingly in recent years. Understanding the
emotions of humans plays a significant role in these
systems, since human feelings provide a better
understanding of human behaviours. Furthermore, in order
to increase the accuracy of recognition of the words spoken
by human, many of the state-of-the-art automatic speech
recognition systems are dedicated to natural language
understanding. Emotion classification has a key role in
performance improvements for natural language
understanding. The other areas, in which an emotion
classification system can be used are as follows: voice
search tagging, word search with specific emotions, and
emotion based advertisement placement [1].
Manuscript received 30 April, 2019; accepted 12 October, 2019.
In this study, MFCCs are calculated for all audio files in
both of the utilized databases. Then, these are classified
based on the type of emotions. In [2], Plutchik claims that
emotions are categorized as the Primary Emotions and
Secondary Emotions. Primary emotions are anger, fear,
sadness, disgust, surprise, anticipation, trust, and joy. In this
study, emotions of sadness, happiness, and neutral can be
recognized by our designed system. We focused only on
these three emotions as the amount of the train data is
generally not large enough for the remaining ones to arrive
at statistically robust conclusions. There are two main
contributions in this study. One is our novel feature, which
is MFCCs representation based on their histograms and
other contribution is PAU speech data, whose emotions are
labelled and cross-checked by PhD students.
Section II covers academic studies related to this paper. In
Section III, experimental framework and its steps are
elaborated. Section IV mentions our novel feature and
classical MFCCs feature of academic literature in detail.
Section V describes speech data and their characteristics.
Finally, Section VI exhibits the experimental results and
Section VII draws conclusions.
Various types of classifiers have been used for the task of
speech emotion classification: Hidden Markov Model
(HMM), Gaussian Mixture Model (GMM), Support Vector
Machine (SVM), Artificial Neural Networks (ANN), k-
Nearest Neighbors (k-NN), and many others. In fact, there
has been no agreement, which classifier is the most suitable
one for emotion classification. It also seems that each
classifier has its own advantages and limitations.
Many recent studies show that DNN based approaches
outperforms SVM in many areas, such as image, speech,
and text studies within abundant data [3]. In recent papers
[4][6], these two R&D groups independently have
established closely related DNN architectures with multi-
Extraction of Novel Features Based on
Histograms of MFCCs Used in Emotion
Classification from Generated Original Speech
Muhammet Pakyurek1, *, Mahir Atmis2, Selman Kulac1, Umut Uludag3
1Department of Electrical - Electronics Engineering, Faculty of Engineering, Duzce University,
Duzce, Turkey
2Department of Computer Engineering, Faculty of Engineering, Ozyegin University,
Cekmekoy/Istanbul, Turkey
Baris Mah. Dr. Zeki Acar Cad. No:1, 41470 Gebze/Kocaeli, Turkey
task learning capabilities for multilingual speech
recognition. On the other hand, although the conventional
deep learning-based method can outperform the SVM
classifier, it requires plenty of training samples to construct
models of DNN [7], [8]. Therefore, we cannot implement
DNN due to the limited data.
In study [9], the authors have leveraged MFCC for
extraction of features and multiple Support Vector Machine
(SVM) as a number of classifiers. Their extensive
experiments are based on happiness, anger, sadness, disgust,
surprise, and neutral emotion sound database. Performance
analysis of multiple SVM reveales that non-linear kernel
SVM achieves greater accuracy than linear SVM [10]. As
the authors mention, their best performance on Berlin DB is
75 % accuracy.
Dahake et al. [11] has two main contributions: one is
feature extraction using pitch, formants, and MFCC, and the
other is to improve speaker dependent SER by comparing
the results with different kernels of SVM classifier [12]. The
highest accuracy is obtained with the feature combination of
MFCC +Pitch+ Energy on both Malayalam emotional
database (95.83 %) and Berlin emotional database (75 %),
tested using SVM with linear kernel.
In [13], three emotional states are recognized: happiness,
sadness, and neutral. Explored features include: energy,
pitch, Linear Prediction Cepstral Coefficients (LPCC),
MFCC, and Mel-Energy spectrum Dynamic Coefficients
(MEDC). Berlin Database and self- built Chinese emotional
databases are used for training the specified classifiers.
In [14], the basic emotion comparing speech features are
being recognised. The authors use similar methodology with
the study in this paper to recognize emotions. However,
their database and features for recognition are quite different
from ours.
In order to combine the merits of several classifiers,
aggregating a group of them has also been recently
employed [15], [16]. Based on several studies [17][22], we
can conclude that SVM is one of the most popular classifiers
in emotion classification probably because it had been
widely used in almost all speech applications up to 2012. As
shown in Table I [23], the average success rate of SVM for
speech emotion classification is in the range of 75.45
81.29 %.
In [24], Kamruzzaman and Karim report on speaker
identification for authentication and verification in security
areas. This kind of identification is mainly divided into text-
dependent and text-independent approaches. Even if many
studies utilize the text-dependent approach based on a
variety of predefined certain utterances, this study employs a
text-independent methodology. Basically, the
implementation part of this study is composed of feature
generation and classification. MFCC coefficients are
calculated as a foundation of our informative features and
SVM utilizes these features in order to classify the speech
In [25], Demircan and Kahramanli extract MFCC’s from
the speech data obtained from Berlin Database [26] (Berlin
Database of Emotional Speech, 2014). Seven statistical
values are calculated from the MFCC: minimum value,
maximum value, means, variance, median, skewness, and
kurtosis. Using those values, k-Nearest Neighbor algorithm
is used to classify the data. Their contribution is to reduce
the dimension of the data to 7 different values.
Average classification accuracy (%)
Average training time
Sensitivity to model initialization
In order to carry out various experiments to show the
performance of our novel emotion classification feature, we
elaborate a framework with details. The steps of this
emotion classification framework (Fig. 1) are as follows
Fig. 1. Process flow of our emotion classification Framework.
A. Collect Speech Data
Collecting speech data plays a significant role in speech
recognition studies due to the lack of comprehensive speech
data. Therefore, speech data collection constitutes a major
part of this study. The details of data properties and how to
generate them are explained in Section V-C.
B. Preprocessing
Due to the fact that noise in speech breaks down speech
data, removing outliers plays a significant role in the state-
of-the-art classification system. In order to filter them out,
Interquartile range method of John Tukey [27] is employed.
Furthermore, min-max normalization is employed in feature
wise for the sake of removing out the high variance
sensitivity on features.
C. Feature Extraction
The extraction of suitable features that efficiently
represent different emotions is one of the most important
issues in the design of a speech emotion classification
system. A proper group of features significantly affects the
classification results, since pattern recognition techniques
are rarely independent of the problem domain. In this study,
MFCCs are selected as a group of features. More
specifically, in the first feature, the first and second
derivation of average MFCCs and the average of them are
calculated. As the second feature, which is our novelty,
weighted values of MFCCs combining MFCCs values and
their corresponding Probability Density Function’s (PDF)
values. In the third feature, concatenation of the first and
second features is leveraged to get higher performance.
1. Mel-Frequency Cepstrum Coefficients (MFCC)
MFCCs are calculated based on the known variation of
the human ear’s critical bandwidths with frequency. The
main point to understand speech is that the sounds generated
by a human are filtered by the shape of the vocal tract,
including tongue, teeth, etc. This shape determines what
sound comes out. If the shape is accurately determined, this
should result in an accurate representation of the phoneme
being produced. The shape of the vocal tract manifests itself
in the envelope of the short time power spectrum, and the
purpose of MFCCs is to represent this envelope accurately
In order to get a statistically stationary mean of data, the
audio signal is divided into 25 ms of frames. If the frame is
too short, it may not be possible to have enough samples to
get a reliable spectral estimate. If it is too long, the signal
changes too much throughout the frame. Each frame can be
converted into 12 MFCCs plus a normalized energy
parameter. The first and second derivatives (Delta and
Delta-Delta, respectively) of MFCCs and energy can be
calculated as extra features resulting in 39 numbers
representing each frame. However, the derivation of the
MFCC parameters is generally implemented when the
original MFCC does not provide the necessary amount of
information that leads to a good classification.
The MFCC algorithm steps are shown in Figure 2.
Fig. 2. Block diagram of the MFCC Algorithm [1].
D. Classification
A speech emotion classification system consists of two
stages: (1) feature extraction from the available (speech)
data and (2) classification of the emotion in the speech
utterance. In fact, the most recent researches in speech
emotion classification have focused on this step. A number
of advanced machine learning algorithms have been
developed for many different research areas. On the other
hand, traditional classifiers have been used in almost all
proposed speech emotion classification systems [23]. In this
study, SVM is used to classify speech utterances by
optimizing and training data set and presenting performance
results on the test sets.
SVM is a supervised machine learning classifier
technique used primarily for large databases to categorize
new samples. The algorithm searches for the optimal
hyperplane, which separates different classes with maximum
margin between them. The libSVM [17], a scholarly
accepted support vector library, is used to train and test the
dataset. The data is separated into two parts 90 % for
training and 10 % for testing. On the training part, the
validation sets for each fold are generated using 10-fold
cross validation methodology. A Gaussian radial base
function kernel is used to classify data, since it gives better
approximations on data. The best SVM parameters C and
gamma (γ) are obtained using 10-fold cross-validations on
train dataset with validation data. Those parameters are
determined using a mesh-grid search over the values
suggested by [28].
E. Software Toolbox
LibSVM [28] library for Matlab is used for SVM
routines. Matlab’s TreeBagger class is utilized for RF
classification. MFCC library of Wojcicki for Matlab [29] is
used to calculate MFCC.
F. Algorithm
In the main part, firstly all datasets are acquired to
calculate the MFCCs for each individual file. Then, the first,
second, and third features of each file are calculated using
MFCCs values for each element. More elaborately, each file
is divided into the number of 25 ms. of frames. Then,
MFCCs are calculated for each frame. After calculating the
MFCCs, average value and their corresponding the first and
second derivatives are counted. Then, a histogram of each
MFCCs is created dividing to 10 equal distant bin for each
MFCCs in min-max range. These counts of histograms are
divided by total count to get the PDF of MFCCs. Then, in
order to leverage PDF value and corresponding MFCC
value, these two values are multiplied for each of the
MFCCs PDF. Finally, all MFCCs values, their average, and
first and second derivatives of each MFCCs are stored for
each frame. At the end of the file, the histogram and PDF
are calculated using each frame of MFCCs. The Covariance
matrix and a label vector for the output emotion classes are
generated by SVM. After the SVM analysis, Accuracy and
Confusion Matrices are calculated as a mean value for all
In SVM analysis part, train and test data are randomly
selected. Then, 10-folds cross validation is performed on the
train data. Accuracy results of SVM prediction are obtained
by using the best parameters resulting from the cross
In this study, 12 coefficients of MFCC + the energy of
each frame are calculated for each individual’s audio file
[29]. The details of the MFCC are explained in Section III-
C1. For the feature extraction, three features are generated
using MFCC. These features are as follows.
1. Feature Set 1
Average of MFCCs, its Delta (first order derivative) and
Delta_Delta (second order derivative): The average of
MFCCs is calculated for all frames of each speech data.
Delta and Delta-Delta (first and second derivatives) are
calculated by subtracting the consecutive frames and
consecutive Deltas correspondingly.
2. Feature Set 2
Weighted MFCCs values wrt (with respect to) their
probability distribution: The PDF of each coefficient are
calculated building the histogram of each of the MFCCs of
all frames. During this calculation, different value interval
for each MFCC is obtained considering min-max values of
them. The second feature is calculated by the multiplication
of values in this interval and corresponding PDF values:
[ , ], 1, 2...13,
i i i
c a b i
( ),
i i i
v c PDF c
where ai and bi are min and max values of each of MFCCs.
In that case, ci is the internal value within [ai, bi]. As shown
in Fig. 3, ci discrete feature value and pdf(ci) are non-
normalized probability values. In (2), * operation is the
element-wise multiplication. In this case, we encode the
histogram just multiplying these two values. So, only a
number of bins data is used to represent the histogram.
Otherwise, bin values and corresponding probability values
must be used separately to describe the histogram. Thanks to
this approach, the number of features is decreased, while the
computational performance is increased because of halving
the size of histogram representation.
Fig. 3. PDF of one of MFCCs without normalization [30].
3. Feature Set 3
Concatenation of Feature Set 1 and Feature Set 2: In this
feature set, Feature Set 1 and Feature Set 2 are assembled
without any modification on both features.
The details of databases utilized in this study are as
1. The Berlin Database: This is a database frequently used
by emotion classification researchers, which contains
speech data in German language [23], [31]. Burkhardt et
al. [26] show the details about the Berlin Database.
2. The PAU Database: We have collected English speech
samples from YouTube video collections and videos of
popular TV shows.
Figure 4 and Figure 5 illustrate histograms of the length
of the audio files for the Berlin Database and our custom
database, respectively. Bins of the histograms represent
audio file length in seconds. Total number of files is 312 for
the Berlin Database, and 320 for the PAU database. Total
time for the Berlin database is 16 minutes, and for PAU - 10
Fig. 4. Berlin Database file length histogram.
Fig. 5. PAU Database file length histogram.
A. Database Features
In this study, genders (male & female) of the associated
individuals are noted as database metadata. Also, age
categories are classified as “Young” (age between 12 and
30) and “Mature” (age between 31 and 60). Sadness,
happiness, and neutrality are chosen as target emotions to
predict. Audio files are in wav format and their duration
varies from 1 to 9 seconds. Acted and neutral speech types
are also available.
B. Labelling
Labelling the audio file plays a significant role in
categorization of the data. In this study, all speech data are
labelled with gender, emotion, and age data. Table II
compares both databases according to their features.
C. PAU Database Generation
The PAU database is produced from the sources
described in Table III by 4 (male) students, who are doing
their PhDs in computer and electrical engineering
departments. All speech data are inserted into the PAU
database after the independent control steps. In this control
step, each member checks other members’ data sets also,
which must be consistent with their corresponding label. It
took nearly three months to collect and process the data,
which is approximately 102 MB in size (the database files
will be provided free of charge to the academic and research
Age Group
Speech Type
Sad, Happy, Neutral
5 Male, 5 Female
Young, Mature
Sad, Happy, Neutral
195 Male, 72 Female
Young, Mature
Acted, Natural
Age Group
Speech Type
How I Met Your Mother
Sad, Happy, Neutral
16 Male
1 Female
Young, Mature
Sherlock Holmes
Sad, Happy, Neutral
2 Male
Thrones Youtube
16 Male
8 Female
Young, Mature
YouTube Best Cry Videos
63 Male
Young, Mature
1 Male
3 Female
The Man From Uncle
22 Male
7 Female
Young, Mature
Youtube News Compilation
50 Male
9 Female
Young, Mature
Youtube Videos Compilation
25 Male
44 Female
Young, Mature
The database consists of 632 audio samples in total.
Experiments are conducted for the German Berlin database,
PAU English database, and a combination of both. For each
case, train and test data are selected from their own datasets.
The number of audio files per database is shown in Table
Emotions (number of audio files)
Berlin DB
The accuracy results of SVM, shown in Table V, Table
VI, and Table VII, are the average accuracy results of 60
runs. More specifically, all experiments are repeated 60
times. The peak (non-average) accuracy result obtained
during the tests was 95 %. One of the models used in the
paper [13] by Yixiong et al. consists of MFCC + MEDC +
Energy triple.
That model has the highest accuracy rate (91.3043 %)
among all their models on the Berlin Database, but it is not
clear, whether that is a peak accuracy or a mean accuracy.
In [26], Burkhardt et al. did not mention how to separate
train and test data. Their best neutral, happiness, and sadness
recognition rates are 88.2 %, 83.7 %, and 80.7 %,
respectively, while ours are 84.8 %, 85.29 %, 88.5 % for the
third feature in the Berlin Database (in German). The results
reveal that our features results in better performance for
identifying emotions of happiness and sadness.
First Feature
Second Feature
Third Feature
83.78 %
86.00 %
88.33 %
First Feature
Second Feature
Third Feature
76.35 %
78.27 %
79.81 %
First Feature
Second Feature
Third Feature
76.27 %
81.86 %
83.81 %
Even though DNN has better results (performs better)
than SVM, in this study, SVM is carried out as a classifier
because of the lack of huge size speech data.
Better results were obtained, because of distributions of
all MFCCs have more information to represent the emotion
rather than using only average of MFCCs. This novel
feature provides smaller size of data for histogram
representation and requires less computational power. We
can clearly conclude that using this feature has two main
advantages: feature representation size and computational
Best results are achieved by the Berlin Database
compared to PAU (English) database because the sentences
for the speech in Berlin Data are the same for each
individual and they are performed in the same framework as
well (in studio environment). Procedural preferences during
the speech, such as stressing words, mood, and mouth
gesture, are almost the same.
As shown in Table V and Table VI, we have an
approximately 8.5 % decrease of accuracy for the English
database (Table VI) compared to Berlin Database (Table V)
because the sentences in every sample are quite different
from one another for the former database. Furthermore,
some additional noise resulting from the environment of
speech has a great impact on audio files. All Berlin speech
data are generated in indoor studio environment, while our
database has different environment speech utterance.
Therefore, the procedures of data generation are quite
different from our methodology. As a conclusion, we should
note that our framework for audio generation is more
appropriate for the real-life conditions. Our study has better
results than average classification accuracy of SVM for the
speech emotion classification studies. The accuracy results
obtained by SVM on PAU database for the first, second, and
third feature are 70 %, 71 %, and 73 %, respectively. Those
numbers are 75 %, 78 %, and 81 % for Berlin Database. The
results obtained are the average accuracy results of 60 runs.
Those results support that the third feature helps us to obtain
a better classification result.
The authors declare that they have no conflicts of interest.
[1] R. Jang, Audio Signal Processing and Recognition, 2011. [Online].
[2] R. Plutchik, The nature of emotions: Human emotions have deep
evolutionary roots”, American Scientist, vol. 89, no. 4, pp. 344350,
Jul.Aug. 2001. DOI:10.1511/2001.4.344.
[3] L. Deng, D. Yu et al., Deep learning: methods and applications”,
Foundations and Trends® in Signal Processing, vol. 7, no. 34, pp.
197387, 2014. DOI: 10.1561/2000000039.
[4] L. Deng, J. Li, J. T. Huang, K. Yao, D. Yu, F. Seide, M. L. Seltzer, G.
Zweig, X. He, J. D. Williams et al., Recent advances in deep
learning for speech research at Microsoft”, in Proc. of IEEE
International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), 2013, pp. 86048608. DOI:
[5] J. T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, Cross-language
knowledge transfer using multilingual deep neural network with
shared hidden layers”, in Proc. of 2013 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
IEEE, 2013, pp. 73047308. DOI: 10.1109/ICASSP.2013.6639081.
[6] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M.
Devin, and J. Dean, Multilingual acoustic models using distributed
deep neural networks”, in Proc. of 2013 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
IEEE, 2013, pp. 86198623. DOI: 10.1109/ICASSP.2013.6639348.
[7] P. Liu, K.-K. R. Choo, L. Wang, and F. Huang, SVM or deep
learning? A comparative study on remote sensing image
classification”, Soft Computing, vol. 21, no. 23, pp. 70537065, 2017.
DOI: 10.1007/s00500-016-2247-2.
[8] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, Deep convolutional
neural networks for hyperspectral image classification”, Journal of
Sensors, vol. 2015, 2015. DOI: 10.1155/2015/258619.
[9] A. Sonawane, M. Inamdar, and K. B. Bhangale, Sound based human
emotion recognition using MFCC & multiple SVM”, in Proc. of 2017
International Conference on Information, Communication,
Instrumentation and Control (ICICIC), IEEE, 2017, pp. 14. DOI:
[10] K. Aida-zade, A. Xocayev, and S. Rustamov, Speech recognition
using Support Vector Machines”, in Proc. of 2016 IEEE 10th
International Conference on Application of Information and
Communication Technologies (AICT), IEEE, 2016, pp. 14. DOI:
[11] P. P. Dahake, K. Shaw, and P. Malathi, Speaker dependent speech
emotion recognition using MFCC and Support Vector Machine”, in
Proc. of International Conference on Automatic Control and Dynamic
Optimization Techniques (ICACDOT), IEEE, 2016, pp. 10801084.
DOI: 10.1109/ICACDOT.2016.7877753.
[12] M. Sinith, E. Aswathi, T. Deepa, C. Shameema, and S. Rajan,
Emotion recognition from audio signals using Support Vector
Machine”, in Proc. of 2015 IEEE Recent Advances in Intelligent
Computational Systems (RAICS), IEEE, 2015, pp. 139144. DOI:
[13] Y. Pan, P. Shen, and L. Shen, “Speech emotion recognition using
Support Vector Machine”, International Journal of Smart Home, vol.
6, no. 2, pp. 101107, 2012. DOI: 10.5120/431-636.
[14] S. S. Shambhavi, “Emotion speech recognition using MFCC and
SVM”, International Journal of Engineering Research and
Technology, vol. 4, no. 6, pp. 10671070, 2015. DOI:
[15] B. Schuller, M. Lang, and G. Rigoll, Robust acoustic speech emotion
recognition by ensembles of classifiers”, in Proc. of Jahrestagung für
Akustik, DAGA, 2005, vol. 31.
[16] M. Lugger, M. Janoir, and B. Yang, “Combining classifiers with
diverse feature sets for robust speaker independent emotion
recognition, in Proc. of 2009 17th European Signal Processing
Conference, 2009. pp. 12251229.
[17] A. Shirani and A. R. N. Nilchi, “Speech emotion recognition based on
SVM as both feature selector and classifier”, International Journal of
Image, Graphics & Signal Processing, vol. 8, no. 4, pp. 3945, 2016.
DOI: 10.5815/ijigsp.2016.04.05.
[18] J. Zhou, Y. Yang, P. Chen, and G. Wang, “Speech emotion
recognition based on rough set and SVM”, in Proc. of 5th IEEE Int.
Conf. on Cognitive Informatics (ICCI’06), 2006, pp. 5361. DOI:
[19] B. Schuller, G. Rigoll, and M. Lang, Speech emotion recognition
combining acoustic features and linguistic information in a hybrid
support vector machine - belief network architecture”, in Proc. of.
2004 IEEE International Conference on Acoustics, Speech, and
Signal Processing, 2004, pp. 577580. DOI:
[20] O. Pierre-Yves, The production and recognition of emotions in
speech: Features and algorithms”, International Journal of Human-
Computer Studies, vol. 59, no. 12, pp. 157183, 2003. DOI:
[21] C. M. Lee, S Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng,
S. Lee, and S. Narayanan, Emotion recognition based on phoneme
classes”, in Proc. of. INTERSPEECH 2004 - ICSLP, 8th International
Conference on Spoken Language Processing, Jeju Island, Korea,
2004, pp. 889892.
[22] O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee, Emotion recognition
by speech signals”, in Proc. of 8th European Conference on Speech
Communication and Technology, EUROSPEECH 2003 -
INTERSPEECH 2003, Geneva, Switzerland, 2003, pp. 125128. DOI:
[23] M. El Ayadi, M. S. Kamel, and F. Karray, Survey on speech emotion
recognition: Features, classification schemes, and databases”, Pattern
Recognition, vol. 44, no. 3, pp. 572587, 2011. DOI:
[24] S. M. Kamruzzaman, A. N. M. Rezaul Karim, M. Saiful Islam, and
M. Emdadul Haque, “Speaker identification using MFCC-domain
Support Vector Machine”, International Journal of Electrical and
Power Engineering, vol. 1, no. 3, pp. 274278, 2007. DOI:
[25] S. Demircan and H. Kahramanlı, “Feature extraction from speech data
for emotion recognition”, Journal of Advances in Computer Networks,
vol. 2, pp. 2, 2830, 2014. DOI: 10.7763/JACN.2014.V2.76.
[26] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B.
Weiss, A database of German emotional speech”, in Proc. of 9th
European Conference on Speech Communication and Technology,
2005. DOI:
[27] R. McGill, J. W. Tukey, and W. A. Larsen, “Variations of box plots”,
The American Statistician, vol. 32, no. 1, pp. 1216, 1978. DOI;
[28] Ch.-Ch. Chang, Ch.-J. Lin, “LIBSVM: A library for support vector
machines”, ACM Transactions on Intelligent Systems and Technology
(TIST), vol. 2, no. 3, pp. 27:127:27, 2011. DOI:
[29] K. Wojcicki, HTK MFCC MATLAB, MATLAB, File Exchange,
[30] A. N. Iyer, U. O. Ofoegbu, R. E. Yantorno, and S. J. Wenndt,
“Speaker recognition in adverse conditions”, in Proc. of 2007 IEEE
Aerospace Conference, 2007, pp. 18. DOI:
[31] Y. Chavhan, M. L. Dhore, and P. Yesaware, “Speech emotion
recognition using Support Vector Machine”, International Journal of
Computer Applications, vol. 1, pp. 811, 2010. DOI: 10.5120/431-
... • Definition 4: The following equation can be obtained when the number of Mel filters is M and the number of discrete frequencies is K : where C [l] is the l th feature extracted using MFCCs and L is the total number of extracted features for each examined frame. In practice, L < M. According to the procedures described in [33], [41], [46], and [48], the present study set L and M to 12 and 26, respectively. 7) Delta Cepstrum Coefficient: A voice signal changes over time, similar to the slope of a formant at its transitions. ...
... The addition of acceleration features such as Delta-Delta cepstral coefficients, which are obtained through double partial differentiation with respect to time, generally leads to superior classification performance [41]. 8) Logarithmic Energy: The energy of each frame is a crucial feature that represents the variation in amplitude and provides acoustic information [33], [41], and [48]. ...
... ) Quantified Feature Vector: This study combined 13 features derived through MFCC processing with prosodic features of voice signals which are widely used for speaker recognition [32], [36], [38], [48], and [49]. The prosodic features, namely voice pitch contours and intensity, were extracted using Praat (version 6.0). ...
Full-text available
As customer satisfaction explicitly leads to repurchase behavior, the level of customer satisfaction affects both sales performance and enterprise growth. Traditionally, measuring satisfaction requires customers spending extra time to fill out a post-purchase questionnaire survey. Recently, ASR (Automatic Speech Recognition) is utilized to extract spoken words from conversation to measure customer satisfaction. However, as oriental people tend to use vague words to express emotion, the approach has its limitation. To solve the problem, this study strived to complete following tasks: devising a process to collect customer voice expressing satisfaction and corresponding verifiable ground truth; a dataset of 150 customer voices speaking in Mandarin was collected; MFCCs were extracted from the voice data as features; as the size of dataset was limited, Auto Encoder was utilized to further reduce the features of voices; models based on Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) and Support Vector Machine (SVM) were constructed to predict satisfaction. With nested cross validation, the average accuracy of LSTM and SVM could reach 71.95% and 73.97%, respectively.
... In this work, the BES (Berlin Emotional Speech) dataset [14] has selected for evaluation purposes. The BES dataset is a standard dataset that is frequently used by emotion classification researchers [18,27,45,58]. The BES dataset contains 533 emotional speech utterances from 10 professional German actors (5 males and 5 females), with 7 emotions (Neutral, Happiness, Boredom, Anxiety, Sadness, Anger, And Disgust). ...
... Consequently, the performance of the OGA-ELM was very impressive in the four different scenarios, with an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD, GD-Male, and GD-Female scenarios, respectively. Furthermore, the proposed approach OGA-ELM will be compared with some recent works [9,16,18,27,45,55,58,60] in terms of accuracy based on four different scenarios (i.e., SI, SD, GD-Male, and GD-Female scenarios). All these methods have been used the BSE dataset in their experiments. ...
Full-text available
Automatic Emotion Speech Recognition (ESR) is considered as an active research field in the Human-Computer Interface (HCI). Typically, the ESR system is consisting of two main parts: Front-End (features extraction) and Back-End (classification). However, most previous ESR systems have been focused on the features extraction part only and ignored the classification part. Whilst the classification process is considered an essential part in ESR systems, where its role is to map out the extracted features from audio samples to determine its corresponding emotion. Moreover, the evaluation of most ESR systems has been conducted based on Subject Independent (SI) scenario only. Therefore, in this paper, we are focusing on the Back-End (classification), where we have adopted our recent developed Extreme Learning Machine (ELM), called Optimized Genetic Algorithm- Extreme Learning Machine (OGA-ELM). In addition, we used the Mel Frequency Cepstral Coefficients (MFCC) method in order to extract the features from the speech utterances. This work proves the significance of the classification part in ESR systems, where it improves the ESR performance in terms of achieving higher accuracy. The performance of the proposed model was evaluated based on Berlin Emotional Speech (BES) dataset which consists of 7 emotions (neutral, happiness, boredom, anxiety, sadness, anger, and disgust). Four different evaluation scenarios have been conducted such as Subject Dependent (SD), SI, Gender Dependent Female (GD-Female), and Gender Dependent Male (GD-Male). The highest performance of the OGA-ELM was very impressive in the four different scenarios and achieved an accuracy of 93.26%, 100.00%, 96.14% and 97.10% for SI, SD, GD-Male, and GD-Female scenarios, respec-tively. Besides, the proposed ESR system has shown a fast execution time in all experiments to identify the emotions.
... Because of person's increasing demand for smart technologies and computer' improving data processing performance and reliability, emotion detection has become more common in human-computer interaction. Essentially, independent speech emotion detection methods have used a system to imitate human feelings, and use spectrally properties to match aspects like accentuation, inflection, and pausing to the desired emotions (Krishnan et al. 2021;Bhaskar and Rao 2014;Pakyurek et al. 2020). ...
Full-text available
The challenge of identifying the emotional qualities of voice, regardless of the semantic meaning, is known as speech emotion recognition (SER). While people are capable of performing this activity efficiently as a natural aspect of voice communication, the capacity to do so autonomously through programmed technologies is indeed a work in progress. As it offers perspective on human mental processes, emotion identification from speech signals is a frequently investigated topic in the construction of human–computer interface (HCI) models. In HCI, it is frequently necessary to determine the emotion of persons as mental feedback. An attempt is made in this study to distinguish seven different emotions using speech signals: sadness, anger, disgusted, pleased, surprised, enjoyable, and neutrality mood. For the identification of emotion, the suggested method uses a signals preprocessing method based on the randomness measure. The signals are first normalized to reduce noise. Due to the obvious changing length and continual form of voice signals, emotions identification requires both locally and globally information. Local features depict dynamic behavior, while feature points reveal statistic factors such as standard error, median, and lowest and maximum values. The SER system includes several features, including spectrum characteristics, sound quality characteristics, and Teager energy operator-based characteristics. Prosodic features are those that are based on the human perception, such as rhythm and inflection. These characteristics are based on three factors: power, length, and frequency response. From of the heavily processed signals, a features vector is generated that evaluates the random feature for all of the emotional responses. Then, using mutual information (MI), the feature vector is utilized to choose from the entire set. The feature vectors are then categorized using the BOAT method and association rule mining. Experiments were carried out on the TESS dataset for several metrics, and the performance of the suggested method outperformed the state-of-the-art methods.
... First introduced by Davis and Mermelstein (1980), MFCCs proved to be particularly promising and computationally efficient for recognizing patterns in speech (Dave 2013;Fraser et al. 2016;Gupta et al. 2018;Logan 2000). Additionally, MFCCs are increasingly used in speech emotion recognition (SER) as the basis for analyzing an individual's emotions (Kishore and Satish 2013;Lalitha et al. 2015;Pakyurek et al. 2020). ...
Full-text available
Customers' emotions play a vital role in the service industry. The better frontline personnel understand the customer, the better the service they can provide. As human emotions generate certain (unintentional) bodily reactions, such as increase in heart rate, sweating, dilation, blushing and paling, which are measurable, artificial intelligence (AI) technologies can interpret these signals. Great progress has been made in recent years to automatically detect basic emotions like joy, anger etc. Complex emotions, consisting of multiple interdependent basic emotions, are more difficult to identify. One complex emotion which is of great interest to the service industry is difficult to detect: whether a customer is telling the truth or just a story. This research presents an AI-method for capturing and sensing emotional data. With an accuracy of around 98 %, the best trained model was able to detect whether a participant of a debating challenge was arguing for or against her/his conviction, using speech analysis. The data set was collected in an experimental setting with 40 participants. The findings are applicable to a wide range of service processes and specifically useful for all customer interactions that take place via telephone. The algorithm presented can be applied in any situation where it is helpful for the agent to know whether a customer is speaking to her/his conviction. This could, for example, lead to a reduction in doubtful insurance claims, or untruthful statements in job interviews. This would not only reduce operational losses for service companies, but also encourage customers to be more truthful.
... It worth mention that the SVM classifier has reimplemented for comparison purpose with the proposed OGA-ELM classifier. More details about SVM can find in [51,52]. Table 9 provides the experiment results of SVM (linear kernel) and SVM (precomputed kernel). ...
Full-text available
The coronavirus disease (COVID-19), is an ongoing global pandemic caused by severe acute respiratory syndrome. Chest Computed Tomography (CT) is an effective method for detecting lung illnesses, including COVID-19. However, the CT scan is expensive and time-consuming. Therefore, this work focus on detecting COVID-19 using chest X-ray images because it is widely available, faster, and cheaper than CT scan. Many machine learning approaches such as Deep Learning, Neural Network, and Support Vector Machine; have used X-ray for detecting the COVID-19. Although the performance of those approaches is acceptable in terms of accuracy, however, they require high computational time and more memory space. Therefore, this work employs an Optimised Genetic Algorithm-Extreme Learning Machine (OGA-ELM) with three selection criteria (i.e., random, K-tournament, and roulette wheel) to detect COVID-19 using X-ray images. The most crucial strength factors of the Extreme Learning Machine (ELM) are: (i) high capability of the ELM in avoiding overfit-ting; (ii) its usability on binary and multi-type classifiers; and (iii) ELM could work as a kernel-based support vector machine with a structure of a neural network. These advantages make the ELM efficient in achieving an excellent learning performance. ELMs have successfully been applied in many domains, including medical domains such as breast cancer detection, pathological brain detection, and ductal carcinoma in situ detection, but not yet tested on detecting COVID-19. Hence, this work aims to identify the effectiveness of employing OGA-ELM in detecting COVID-19 using chest X-ray images. In order to reduce the dimensionality of a histogram oriented gradient features, we use principal component analysis. The performance of OGA-ELM is evaluated on a benchmark dataset containing 188 chest X-ray images with two classes: a healthy and a COVID-19 infected. The experimental result shows that the OGA-ELM achieves 100.00% accuracy with fast computation time. This demonstrates that OGA ELM is an efficient method for COVID-19 detecting using chest X-ray images.
In this study, novel Spectro-Temporal Energy Ratio features based on the formants of vowels, linearly spaced low-frequency, and logarithmically spaced high-frequency parts of the human auditory system are introduced to implement single- and cross-corpus speech emotion recognition experiments. Since the underlying dynamics and characteristics of speech recognition and speech emotion recognition differ too much, designing an emotion-recognition-specific filter bank is mandatory. The proposed features will formulate a novel filter bank strategy to construct 7 trapezoidal filter banks. These novel filter banks differ from Mel and Bark scales in shape and frequency regions and are targeted to generalize the feature space. Cross-corpus experimentation is a step forward in speech emotion recognition, but the researchers are usually chagrined at its results. Our goal is to create a feature set that is robust and resistant to cross-corporal variations using various feature selection algorithms. We will prove this by shrinking the dimension of the feature space from 6984 down to 128 while boosting the accuracy using SVM, RBM, and sVGG (small-VGG) classifiers. Although RBMs are considered no longer fashionable, we will show that they can achieve outstanding jobs when tuned properly. This paper discloses a striking 90.65% accuracy rate harnessing STER features on EmoDB.
Deep convolutional neural network (CNN), which is widely applied in image tasks, can also achieve excellent performance in acoustic tasks. However, activation data in convolutional neural network is usually indicated in floating format, which is both time-consuming and power-consuming when be computed. Quantization method can turn activation data into fixed-point, replacing floating computing into faster and more energy-saving fixed-point computing. Based on this method, this article proposes a design space searching method to quantize a binary weight neural network. A specific accelerator is built on FPGA platform, which has layer-by-layer pipeline design, higher throughput and energy-efficiency compared with CPU and other hardware platforms.
Full-text available
In this digital world, there are many applications to secure and legalize their data and all of these emissions by various techniques and there are many algorithms and methods to process their data. Some extensive method used is biometric authentication and voice recognition is better, since it paves the convenient manner to the user and it merely acquires the voice from the user. Also the background noise is in crisis with Mel Frequency Cepstrum Coefficient (MFCC) which is recognition algorithm where overcome by other tools such as smoothening filter etc. The main focus of this project is to investigate the feature extraction scheme.
Conference Paper
Full-text available
The purpose of speech emotion recognition system is to differentiate the speaker's utterances into four emotional states namely happy, sad, anger and neutral. Automatic speech emotion recognition is an active research area in the field of human computer interaction (HCI) with wide range of applications. Extracted features of our project work are mainly related to statistics of pitch and energy as well as spectral features. Selected features are fed as input to Support Vector Machine (SVM) classifier. Two kernels linear and Gaussian radial basis function are tested with binary tree, one against one and one versus the rest classification strategies. The proposed speaker-independent experimental protocol is tested on the Berlin emotional speech database for each gender separately and combining both, SAVEE database as well as with a self made database in Malayalam language containing samples of female only. Finally results for different combination of the features and on different databases are compared and explained. The highest accuracy is obtained with the feature combination of MFCC +Pitch+ Energy on both Malayalam emotional database (95.83%) and Berlin emotional database (75%) tested with binary tree using linear kernel.
Full-text available
With constant advancements in remote sensing technologies resulting in higher image resolution, there is a corresponding need to be able to mine useful data and information from remote sensing images. In this paper, we study auto-encoder (SAE) and support vector machine (SVM), and to examine their sensitivity, we include additional umber of training samples using the active learning frame. We then conduct a comparative evaluation. When classifying remote sensing images, SVM can also perform better than SAE in some circumstances, and active learning schemes can be used to achieve high classification accuracy in both methods.
Full-text available
The aim of this paper is to utilize Support Vector Machine (SVM) as feature selection and classification techniques for audio signals to identify human emotional states. One of the major bottlenecks of common speech emotion recognition techniques is to use a huge number of features per utterance which could significantly slow down the learning process, and it might cause the problem known as ―the curse of dimensionality‖. Consequently, to ease this challenge this paper aims to achieve high accuracy system with a minimum set of features. The proposed model uses two methods, namely ―SVM features selection‖ and the common ―Correlation-based Feature Subset Selection (CFS)‖ for the feature dimensions reduction part. In addition, two different classifiers, one Support Vector Machine and the other Neural Network are separately adopted to identify the six emotional states of anger, disgust, fear, happiness, sadness and neutral. The method has been verified using Persian (Persian ESD) and German (EMO-DB) emotional speech databases, which yield high recognition rates in both databases. The results show that SVM feature selection method provides better emotional speech-recognition performance compared to CFS and baseline feature set. Moreover, the new system is able to achieve a recognition rate of (99.44%) on the Persian ESD and (87.21%) on Berlin Emotion Database for speaker-dependent classification. Besides, promising result (76.12%) is obtained for speaker-independent classification case; which is among the best-known accuracies reported on the mentioned database relative to its little number of features.
Full-text available
In recent years the workings which requires human-machine interaction such as speech recognition, emotion recognition from speech recognition is increasing. Not only the speech recognition also the features during the conversation is studied like melody, emotion, pitch, emphasis. It has been proven with the research that it can be reached meaningful results using prosodic features of speech. In this paper we performed pre-processing necessary for emotion recognition from speech data. We extract features from speech signal. To recognize emotion it has been extracted Mel Frequency Cepstral Coefficients (MFCC) from the signals. And we classified with k-NN algorithm.
Full-text available
Recently, convolutional neural networks have demonstrated excellent performance on various visual tasks, including the classification of common two-dimensional images. In this paper, deep convolutional neural networks are employed to classify hyperspectral images directly in spectral domain. More specifically, the architecture of the proposed classifier contains five layers with weights which are the input layer, the convolutional layer, the max pooling layer, the full connection layer, and the output layer. These five layers are implemented on each spectral signature to discriminate against others. Experimental results based on several hyperspectral image data sets demonstrate that the proposed method can achieve better classification performance than some traditional methods, such as support vector machines and the conventional deep learning-based methods.
Conference Paper
In human computer interaction, speech emotion recognition is playing a pivotal part in the field of research. Human emotions consist of being angry, happy, sad, disgust, neutral. In this paper the features are extracted with hybrid of pitch, formants, zero crossing, MFCC and its statistical parameters. The pitch detection is done by cepstral algorithm after comparing it with autocorrelation and AMDF. The training and testing part of the SVM classifier is compared with different kernel function like linear, polynomial, quadratic and RBF. The polish database is used for the classification. The comparison between the different kernels is obtained for the corresponding feature vector.
Speech Emotion Recognition (SER) is a hot research topic in the field of Human Computer Interaction (HCI). In this paper, we recognize three emotional states: happy, sad and neutral. The explored features include: energy, pitch, linear predictive spectrum coding (LPCC), mel-frequency spectrum coefficients (MFCC), and mel-energy spectrum dynamic coefficients (MEDC). A German Corpus (Berlin Database of Emotional Speech) and our self-built Chinese emotional databases are used for training the Support Vector Machine (SVM) classifier. Finally results for different combination of the features and on different databases are compared and explained. The overall experimental results reveal that the feature combination of MFCC+MEDC+ Energy has the highest accuracy rate on both Chinese emotional database (91.3%) and Berlin emotional database (95.1%).