Conference Paper

Emotion recognition in Arabic speech

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The general objective of this paper is to build a system in order to automatically recognize emotion in speech. The linguistic material used is a corpus of Arabic expressive sentences phonetically balanced. The dependence of the system on speaker is an encountered problem in this field; in this work we will study the influence of this phenomenon on our result. The targeted emotions are joy, sadness, anger and neutral. After an analytical study of a large number of speech acoustic parameters, we chose the cepstral parameters, their first and second derivatives, the Shimmer, the Jitter and the duration of the sentence. A classifier based on a multilayer perceptron neural network to recognize emotion on the basis of the chosen feature vector that has been developed. The recognition rate could reach more than 98% in the case of an intra-speaker classification and 54.75% in inter-speaker classification. We can see the system’s dependence on speaker clearly.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The shift in fundamental frequency throughout an utterance triggers its fundamental frequency contour, which has statistical traits that may be utilized as features (Akçay & Oğuz, 2020). Tajalsir et al. (2022), Hadjadji et al. (2019), and Poorna and Nair (2019) used this category of vocal characteristics in their works. ...
... Jitter is a frequency instability metric, whereas shimmer is an amplitude instability metric (Akçay & Oğuz, 2020). Several works invested voice quality features, such as Poorna and Nair, (2019) and Hadjadji et al. (2019). ...
... SMO also got the greatest increased accuracy of 95.95% when applying the ensemble models to 19 single classifiers. In addition, Hadjadji et al. (2019) identified the Shimmer, Jitter, and sentence duration as features. A classifier based on a multilayer perceptron neural network (NLP) can classify emotions based on the produced feature vector. ...
Article
Full-text available
Emotions represent a fundamental aspect when evaluating user satisfaction or collecting customer feedback in human interactions, as well as in the realm of human–computer interface (HCI) technologies. Moreover, as human beings, we possess a distinctive capacity for communication through spoken language. Recently, the realm of Speech Emotion Recognition (SER) has garnered substantial interest and gained significant traction within the domain of Natural Language Processing (NLP). Its primary objective remains the identification of various emotions, such as sadness, neutrality, and anger, from audio speech using a diverse array of classifiers. This paper conducts a comprehensive critical analysis of the existing Arabic SER studies. Furthermore, this research delves into the performance and constraints associated with these previous works. It also sheds light on the current promising trends aimed at enhancing methods for recognizing emotions in speech. To the best of our knowledge, this research stands as a pioneering contribution to the SER field, particularly in the context of reviewing existing Arabic studies.
... While these tasks are interrelated-emotion can influence speech patterns differently for males and females, and gender-specific vocal characteristics can impact emotion perception-most existing approaches fail to exploit these relationships [6]. Furthermore, the scarcity of high-quality Arabic-language datasets limits progress in this field, leaving critical gaps in developing culturally and linguistically diverse solutions [7]. ...
... The study extracts Mel Frequency Cepstral Coefficients (MFCCs) to capture speech spectral properties for emotion and gender recognition. While features like pitch and formants offer potential benefits, our previous work [7] showed minimal improvement with higher computational costs. Spectrograms and wavelet transforms also performed poorly with LSTMs, making MFCCs the best balance between performance and efficiency. ...
Conference Paper
This paper proposes a multitask learning framework for simultaneous emotion and gender recognition from Arabic speech. Leveraging a hybrid Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) architecture, the model captures both spectral and temporal characteristics of speech. By using Mel Frequency Cepstral Coefficients (MFCCs) as features, the model achieves an emotion classification accuracy of 86.05% and a gender classification accuracy of 97.80%. This approach provides insights into the dynamic interplay between gender and emotion in speech, highlighting how gender influences emotion recognition and vice versa. The results suggest that combining CNN-LSTM architectures for emotion and gender classification can improve performance over traditional methods, particularly for languages with underrepresented datasets like Arabic. The paper contributes a unique dataset and demonstrates the effectiveness of multitask learning in addressing the challenges of emotion and gender recognition in speech.
... Commonly used features in SER include prosodic features such as pitch, intensity, and duration, which capture variations in speech rhythm, loudness, and frequency [17]. Formant frequencies, which are resonant frequencies in the vocal tract, also provide valuable insight into emotional states. ...
Conference Paper
Speech Emotion Recognition (SER) has garnered increasing attention in recent years due to its wide-ranging applications in human-computer interaction, affective computing, and healthcare. Despite significant progress, developing robust SER systems for Arabic speech remains a challenge, primarily due to the linguistic diversity and distinct acoustic characteristics of the Arabic language. This paper presents a comprehensive review of recent advancements in Arabic SER, emphasizing key aspects such as datasets, feature extraction methods, classification techniques, and emotion detection strategies specific to Arabic speech. We address the challenges associated with the scarcity of large-scale Arabic emotional speech datasets, highlighting notable contributions like ArabEmo, QCRI-ARABIC, and Tashkeela that aim to bridge this gap. Additionally, we analyze the role of prosodic and spectral features-including pitch, energy, duration, and MFCCs-and the increasing adoption of deep learning approaches, such as CNNs and RNNs, in enhancing emotion classification performance. The paper also explores the impact of Arabic dialectal variations and advocates for the development of dialect-specific models to improve recognition accuracy. Lastly, we discuss future directions for Arabic SER, underscoring the critical need for diverse and representative datasets, as well as the integration of advanced machine learning techniques, to achieve robust and scalable performance across various Arabic dialects.
... It has been used in several studies and has provided interesting recognition rates [33,35]. MSA was first evaluated on [42], in order to improve the recognition rate, we performed a second study using Content courtesy of Springer Nature, terms of use apply. Rights reserved. ...
Article
Full-text available
The present study focuses on the evaluation of the degradation of emotional expression in speech generated by a wireless telephone network. Two assessment approaches emerged: an objective one, deploying convolutional neural networks (CNNs) fed with spectrograms across three scales (Linear, Logarithmic, Mel), and a subjective method grounded in human perception. The study gathered expressive phrases in two different languages: from novice Arabic and proficient German speakers. These utterances underwent transmission on a real 4G network a rarity, as usual focus lies on bandwidth (BW) reduction or compression. Our innovation lies in utilizing the complete 4G infrastructure, accounting for all possible impairments. The results obtained indeed reveal a significant impact of transmission via the real 4G network on emotion recognition. Prior to transmission, the highest recognition rates, measured by the objective method using the Mel frequency scale, were 76% for Arabic and 91% for German. After transmission, these rates significantly decreased, reaching 70% for Arabic and 82% for German (a degradation of 6% and 9%), respectively. As for the subjective method, the recognition rates were 75% for Arabic and 70% for German before transmission and dropped to 67% for Arabic and 68% for German after transmission (a degradation of 8% and 2%). Our results were also compared to those found in the literature that used the same database.
... In total, we have a feature vector with 38 elements. This method has been developed in a previous work [17], and it provides a competitive result in the case of German language (an accuracy of 80%) but decreases for the MSA(an accuracy of 47.41%). ...
Conference Paper
Abstract—This work is a contribution to the enhancement of the emotional regardless of speaker recognition rate for the Arabic language, which is a poorly endowed language in this field. This problem is still relevant, especially for databases created by non-professional speakers(the emotion produced by professionals is standard whereas that produced by naive people is more like a spontaneous emotion because the emotion depends very much on the speaker). This is why a comparative study between two databases: one in Modern Standards Arabic (MSA) which was carried out by non-professionals recorders (students) and the other in German simulated by actors. The study was carried out using statistical methods (K-nearest neighbors, Support Vector Machine, Extratress, Random Forest, Gradient Boosting) as well as a Neural Network approaches which are the Convolutional Neural Network and Multi-Layer Perceptron, to identify four emotions (Anger, Happiness, Neutral and Sadness). The latter showed a clear improvement in terms of accuracy from 47.4% to 64% for MSA. Index Terms—Emotion recognition, Spectral features, Acoustic features, CNN, KNN, SVM, Extratrees, Random Forest, Gradient Boosting.
Article
Full-text available
Automatic Speech Emotion Recognition (ASER) has recently garnered attention across various fields including artificial intelligence, pattern recognition, and human–computer interaction. However, ASER encounters numerous challenges such as a shortage of diverse datasets, appropriate feature selection, and suitable intelligent recognition techniques. To address these challenges, a systematic literature review (SLR) was conducted following established guidelines. A total of 60 primary research papers spanning from 2011 to 2023 were reviewed to investigate, interpret, and analyze the related literature by addressing five key research questions. Despite being an emerging area with applications in real-life scenarios, ASER still grapples with limitations in existing techniques. This SLR provides a comprehensive overview of existing techniques, datasets, and feature extraction tools in the ASER domain, shedding light on the weaknesses of current research studies. Additionally, it outlines a list of limitations for consideration in future work.
Article
Full-text available
The existence of a mapping between emotions and speech prosody is commonly assumed. We propose a Bayesian modelling framework to analyse this mapping. Our models are fitted to a large collection of intended emotional prosody, yielding more than 3,000 minutes of recordings. Our descriptive study reveals that the mapping within corpora is relatively constant, whereas the mapping varies across corpora. To account for this heterogeneity, we fit a series of increasingly complex models. Model comparison reveals that models taking into account mapping differences across countries, languages, sexes and individuals outperform models that only assume a global mapping. Further analysis shows that differences across individuals, cultures and sexes contribute more to the model prediction than a shared global mapping. Our models, which can be explored in an online interactive visualization, offer a description of the mapping between acoustic features and emotions in prosody. A mapping between emotions and speech prosody is commonly assumed. This study shows, using Bayesian modelling, that differences across individuals, cultures and sexes contribute more to the model prediction than a shared global mapping.
Article
Full-text available
Recognizing emotions from speech using machine learning algorithms has become an active research topic lately as a result of the demand for more human interactive applications. Emotion recognition systems are mostly implemented in German, English, Spanish, Dutch, Danish, and other European and Asian languages due to the availability of datasets for these languages. However, for Arabic, there is an extremely limited number of available speech emotion datasets. Therefore, in this paper studies emotion recognition based on Arabic Saudi dialect spoken data. The dataset was created from freely available YouTube videos and labeled using four perceived emotions: anger, happiness, sadness, and neutral. Various spectral features such as the mel-frequency cepstral coefficient (MFCC) and mel spectrogram, were extracted, and then the classification methods support vector machine (SVM), multi-layer perceptron (MLP), and k-nearest neighbor (KNN) were applied. The results were discussed, analyzed, and compared between the three models using different feature extractions. Experiments showed that SVM obtained the best accuracy result with 77.14%, demonstrating improvement in Arabic speech emotion recognition for this classification method.
Article
Full-text available
The overall recognition rate will reduce due to the increase of emotional confusion in multiple speech emotion recognition. To solve the problem, we propose a speech emotion recognition method based on the decision tree support vector machine (SVM) model with Fisher feature selection. At the stage of feature selection, Fisher criterion is used to filter out the feature parameters of higher distinguish ability. At the emotion classification stage, an algorithm is proposed to determine the structure of decision tree. The decision tree SVM can realize the two-step classification of the first rough classification and the fine classification. Thus the redundant parameters are eliminated and the performance of emotion recognition is improved. In this method, the decision tree SVM framework is firstly established by calculating the confusion degree of emotion, and then the features with higher distinguish ability are selected for each SVM of the decision tree according to Fisher criterion. Finally, speech emotion recognition is realized based on this model. The decision tree SVM with Fisher feature selection on CASIA Chinese emotion speech corpus and Berlin speech corpus are constructed to validate the effectiveness of our framework. The experimental results show that the average emotion recognition rate based on the proposed method is 9% higher than traditional SVM classification method on CASIA, and 8.26% higher on Berlin speech corpus. It is verified that the proposed method can effectively reduce the emotional confusion and improve the emotion recognition rate.
Article
Full-text available
In this article we present the methodology employed for the design and evaluation of a Basic Arabic Expressive Speech corpus (BAES-DB). The corpus, which has a total length of approximately 150 minutes, is constituted of 13 speakers uttering a set of 10 sentences while simulating 3 emotions (joy, anger and sadness) in addition to a neutral utterance. The 10 sentences have been selected to meet phonetic equilibrium and absence of emotional content criteria, from a corpus of sentences proposed in Boudraa [1]. The corpus was evaluated through various tests of guided categorization, performed through a website. The overall good recognition rate is currently at 83.03%, with sadness being the most well recognized type of expressive speech (90.93%) and joy being the least well recognized (73.87%).
Article
Full-text available
Recently, studies have been performed on harmony features for speech emotion recognition. It is found in our study that the first- and second-order differences of harmony features also play an important role in speech emotion recognition. Therefore, we propose a new Fourier parameter model using the perceptual content of voice quality and the first- and second-order differences for speaker-independent speech emotion recognition. Experimental results show that the proposed Fourier parameter (FP) features are effective in identifying various emotional states in speech signals. They improve the recognition rates over the methods using Mel frequency cepstral coefficient (MFCC) features by 16.2, 6.8 and 16.6 points on the German database (EMODB), Chinese language database (CASIA) and Chinese elderly emotion database (EESDB). In particular, when combining FP with MFCC, the recognition rates can be further improved on the aforementioned databases by 17.5, 10 and 10.5 points, respectively.
Conference Paper
Full-text available
In this paper, we evaluate the use of appended jitter and shimmer speech features for the classification of human speaking styles and of animal vocalization arousal levels. Jitter and shimmer features are extracted from the fundamental frequency contour and added to baseline spectral features, specifically Mel-frequency cepstral coefficients (MFCCs) for human speech and Greenwood function cepstral coefficients (GFCCs) for animal vocalizations. Hidden Markov models (HMMs) with Gaussian mixture models (GMMs) state distributions are used for classification. The appended jitter and shimmer features result in an increase in classification accuracy for several illustrative datasets, including the SUSAS dataset for human speaking styles as well as vocalizations labeled by arousal level for African elephant and Rhesus monkey species
Article
Full-text available
In emotion classification of speech signals, the popular features employed are statistics of fundamental frequency, energy contour, duration of silence and voice quality. However, the performance of systems employing these features degrades substantially when more than two categories of emotion are to be classified. In this paper, a text independent method of emotion classification of speech is proposed. The proposed method makes use of short time log frequency power coefficients (LFPC) to represent the speech signals and a discrete hidden Markov model (HMM) as the classifier. The emotions are classified into six categories. The category labels used are, the archetypal emotions of Anger, Disgust, Fear, Joy, Sadness and Surprise. A database consisting of 60 emotional utterances, each from twelve speakers is constructed and used to train and test the proposed system. Performance of the LFPC feature parameters is compared with that of the linear prediction Cepstral coefficients (LPCC) and mel-frequency Cepstral coefficients (MFCC) feature parameters commonly used in speech recognition systems. Results show that the proposed system yields an average accuracy of 78% and the best accuracy of 96% in the classification of six emotions. This is beyond the 17% chances by a random hit for a sample set of 6 categories. Results also reveal that LFPC is a better choice as feature parameters for emotion classification than the traditional feature parameters.
Chapter
Ensemble classification model has been widely used in the area of machine learning to enhance the performance of single classifiers. In this paper, we study the effect of employing five ensemble models, namely Bagging, Adaboost, Logitboost, Random Subspace and Random Committee, on a vocal emotion recognition system. The system recognizes happy, angry, and surprise emotion from Arabic natural speech where the highest accuracy among single classifiers is obtained by SMO 95.52%. After applying the ensemble models on 19 single classifiers, the best enhanced accuracy is 95.95% achieved by SMO as well. The highest improvement in accuracy was 19.09%. It was achieved by the Boosting technique having the Naïve Bayes Multinomial as base classifier.
Conference Paper
Majority of Speech Emotion Recognition results refer to full-band uncompressed speech signals. Potential applications of SER on various types of speech platforms pose important questions about potential effects of bandwidth limitations and compression techniques used by speech communication systems on the accuracy of SER. The current study provides answers to these questions based on SER experiments with a band-limited speech as well as compressed speech. Compression techniques included AMR, AMR-WB, AMR-WB+ and mp3 methods. The modelling and classification of speech emotions was achieved using a benchmark approach based on the GMM classifier and speech features including MFCCs, TEO and glottal time and frequency domain parameters. The tests used the Berlin Emotional Speech database with speech signals sampled at 16 kHz. The results indicated that the low frequency components (0–1 kHz) of speech as well as, the high frequency components (above 4 kHz) play an important role in SER. The mp3 compression worked better with the MFCC features than with the TEO and glottal parameters. The AMR-WB and AMR-WB+ outperformed the AMR.
Article
recognition from speech has developed as a recent research area in Human-Computer Interaction. The objective of this paper is to use a 3-stage Support Vector Machine classifier to classify seven different emotions present in the Berlin Emotional Database. For the purpose of classification, MFCC features from all the 535 files present in the database are extracted. Nine statistical measurements are performed over these features from each frame of a sentence. The linear and RBF kernels are employed in hierarchical SVM with RBF sigma value equal to one. For training and testing of data, 10- fold cross-validation is used. Performance analysis is done by using the confusion matrix and the accuracy obtained is 68%.
Article
This paper presents the principal phase of extraction and recognition of the basic emotions in the Arabic speech applied to five emotional states were taken into effect; neutral, sadness, fear, anger and happiness. Emotional speech database REGIM_TES was created and evaluated to provide all practical experiences of extraction. The database described in section 3 has been recorded and processed in this vein. The selected descriptors in our study are; Pitch of voice, Energy, MFCCs, Formant, LPC and the spectrogram. Descriptors showed the importance of the Arabic language on the physiological events and the influence of culture on emotional behavior. Results performed in this work showed that pooling together features extracted at different sites indeed improved classification performance. A comparative study between the kernel functions has enabled us to promote the RBF kernel SVMs multiclass classifier performing the classification phase.
Conference Paper
A Philippine LID system has not been previously created because of the limited amount of recorded speech data. This research initiates the LID research using the Philippine Language Database (PLD) collected by the Digital Signal Processing Laboratory of the University of the Philippines Diliman (DSP-UPD). Mel Frequency Cepstral Coefficients (MFCC), Perceptual Linear Prediction (PLP), Shifted Delta Cepstra (SDC) and Linear Predictive Cepstral Coefficients (LPCC) features are extracted from the speech segments. Gaussian Mixture Model (GMM) using Expectation Maximization (EM) and Universal Background Model (UBM) approach is used to model the acoustic characteristics of the language. Maximum a Posteriori (MAP) probability is then used to determine the language of a speech utterance based on the language GMMs. PLP using a 16 Mixture GMM-EM has been found to produce the best performance among the four feature vectors in discriminating the languages.
Article
This paper presents twenty lists of ten phonetically balanced Arabic sentences. This reference corpus fills a bibliographic void met when assessing objective and subjective speech quality using Arabic (quality of coders and speech synthesis, assessment of pitch detectors, characterization of Arabic sounds, psycho linguistic tests, etc.). This corpus is based on Combescure's works for French sentences [1]. Statistical studies of Moussa [2] and Mrayati [3] about Arabic roots and Arabic words are used. The lists account for all possible CV syllable and respect their occcurence frequencies in Arabic.
Etat de l'art en reconnaissance des émotions dans la parole (These Chloe Claveletat art reconemotionp96)
  • C Claveletat
C. Claveletat, Etat de l'art en reconnaissance des émotions dans la parole (These Chloe Claveletat art reconemotionp96).
Emotion Recognition in Speech by MFCC and SVM
  • A Watile
  • V Alagdeve
  • S Jain
A. Watile, V. Alagdeve, S. Jain, Emotion Recognition in Speech by MFCC and SVM, International Journal of Science, Engineering and Technology Research (IJSETR) Volume 6, Issue 3, March 2017.
Les descripteurs audio pour la parole
  • M Tahon
M. Tahon, Les descripteurs audio pour la parole, Master ATAL, Traitement de la parole, 7 novembre 2017.
AFFECTIVE COMPUTING: CHALLENGES MIT Media Laboratory Cambridge
  • picard
Enhancement of an Arabic Speech Emotion Recognition System
  • S Klaylat
  • Z Osman
  • L Hamandi
  • R Zantout
  • R Hariri
S.Klaylat, Z.Osman, L.Hamandi, R.Zantout and R.Hariri, Enhancement of an Arabic Speech Emotion Recognition System, International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, pp. 2380-2389, Number 5 (2018).
Etat de l’art en reconnaissance des émotions dans la parole (These Chloe Claveletat art reconemotionp96)
  • claveletat
Les descripteurs audio pour la parole, Master ATAL
  • tahon
Automated Extraction of Features from Arabic Emotional Speech Corpus, International Journal of Computer Information Systems and Industrial Management Applications
  • meddeb
Emotion Recognition in Speech by MFCC and SVM, International Journal of Science, Engineering and Technology Research (IJSETR)
  • watile