Conference Paper
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Emotion state tracking is an important aspect of human-computer and human-robot interaction. It is important to de-sign task specific emotion recognition systems for real-world applications. In this work, we propose a hierarchical structure loosely motivated by Appraisal Theory for emotion recognition. The levels in the hierarchical structure are carefully designed to place the easier classification task at the top level and de-lay the decision between highly ambiguous classes to the end. The proposed structure maps an input utterance into one of the five-emotion classes through subsequent layers of binary classi-fications. We obtain a balanced recall on each of the individual emotion classes using this hierarchical structure. The perfor-mance measure of the average unweighted recall percentage on the evaluation data set improves by 3.3% absolute (8.8% rela-tive) over the baseline model.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Models used in SER are classified into two types: traditional machine learning (ML) approaches and deep learning (DL) approaches. Commonly used traditional ML approaches are support vector machine [24,25], the hidden Markov model [26], the Gaussian mixture model [27], the K-nearest neighbor method [28,29], and decision tree [30]. Each of these traditional ML approaches has its own inherent advantages and disadvantages, but they share the commonality of requiring prior feature extraction from speech data [31]. ...
... The speech data contain 7 types of emotional voices performed in English by 10 actors (5 males and 5 females), segmented into improvised as well as scripted performances. Here, we employed the improvised speech data performed with four emotions, i.e., "anger", "happiness", "sadness", and "neutrality", to match the experimental conditions with previous studies [16,30,[39][40][41] and to compare their results with ours. ...
... Lee et al. [30] applied a hierarchical binary decision tree to 384-dimensional acoustic parameters and obtained WA and UA of 0.5638 and 0.5846, respectively. Neuman et al. [41] applied a CNN with attention structure to MelSpec and achieved WA of 0.6195 and UA of 0.6211. ...
Article
Full-text available
The existing research on emotion recognition commonly uses mel spectrogram (MelSpec) and Geneva minimalistic acoustic parameter set (GeMAPS) as acoustic parameters to learn the audio features. MelSpec can represent the time-series variations of each frequency but cannot manage multiple types of audio features. On the other hand, GeMAPS can handle multiple audio features but fails to provide information on their time-series variations. Thus, this study proposes a speech emotion recognition model based on a multi-input deep neural network that simultaneously learns these two audio features. The proposed model comprises three parts, specifically, for learning MelSpec in image format, learning GeMAPS in vector format, and integrating them to predict the emotion. Additionally, a focal loss function is introduced to address the imbalanced data problem among the emotion classes. The results of the recognition experiments demonstrate weighted and unweighted accuracies of 0.6657 and 0.6149, respectively, which are higher than or comparable to those of the existing state-of-the-art methods. Overall, the proposed model significantly improves the recognition accuracy of the emotion “happiness”, which has been difficult to identify in previous studies owing to limited data. Therefore, the proposed model can effectively recognize emotions from speech and can be applied for practical purposes with future development.
... In the past, several acoustic parameter feature sets were developed, which are now being used with linear/non-linear ML classifiers and complex Deep Neural Networks. For instance, Support Vector Machine (SVM), Random Forest (RF), k-Nearest Neighbors (KNN), Decision Tree (DT) (i.e., "C4.5", see [15]), Linear Regression and Naïve-Bayes [16][17][18][19][20] are among the most popular classical machine-learning algorithms usually showing good overall accuracy within the databases with relatively low computational costs. These algorithms contribute to the aim of real-time raw emotion recognition in speech. ...
... Even though accuracy was low in our C4.5 classical DT algorithm, it should be noted that the decision-tree concept seems to be successfully implemented in several other algorithms. For instance, Lee et al. [18] managed to reach up to 89.6% accuracy by implementing the Extreme Learning Machine (ELM) technique and the SVM binary decision-tree (DTSVM) algorithm with a correlational feature selection approach. Moreover, in a study with a new approach called DNN-decision tree, SVM algorithm reached 75.8% accuracy on the EMO-DB database [44]. ...
Article
Full-text available
Vocal emotion recognition (VER) in natural speech, often referred to as speech emotion recognition (SER), remains challenging for both humans and computers. Applied fields including clinical diagnosis and intervention, social interaction research or Human Computer Interaction (HCI) increasingly benefit from efficient VER algorithms. Several feature sets were used with machine-learning (ML) algorithms for discrete emotion classification. However, there is no consensus for which low-level-descriptors and classifiers are optimal. Therefore, we aimed to compare the performance of machine-learning algorithms with several different feature sets. Concretely, seven ML algorithms were compared on the Berlin Database of Emotional Speech: Multilayer Perceptron Neural Network (MLP), J48 Decision Tree (DT), Support Vector Machine with Sequential Minimal Optimization (SMO), Random Forest (RF), k-Nearest Neighbor (KNN), Simple Logistic Regression (LOG) and Multinomial Logistic Regression (MLR) with 10-fold cross validation using four openSMILE feature sets (i.e., IS-09, emobase, GeMAPS and eGeMAPS). Results indicated that SMO, MLP and LOG show better performance (reaching to 87.85%, 84.00% and 83.74% accuracies, respectively) compared to RF, DT, MLR and KNN (with minimum 73.46%, 53.08%, 70.65% and 58.69% accuracies, respectively). Overall, the emobase feature set performed best. We discuss the implications of these findings for applications in diagnosis, intervention or HCI.
... So-called digitization is the process of converting an analog signal into a discrete digital signal, and the two most important links are sampling and quantization. The reason for pre-emphasis is that the speech signal will cause partial attenuation of high frequencies after passing through the lips and nostrils [27]. Compared with low frequency, the spectrum analysis of high frequency is harder to obtain. ...
Article
Full-text available
Speech emotion recognition is a crucial work direction in speech recognition. To increase the performance of speech emotion detection, researchers have worked relentlessly to improve data augmentation, feature extraction, and pattern formation. To address the concerns of limited speech data resources and model training overfitting, A-CapsNet, a neural network model based on data augmentation methodologies, is proposed in this research. In order to solve the issue of data scarcity and achieve the goal of data augmentation, the noise from the Noisex-92 database is first combined with four different data division methods (emotion-independent random-division, emotion-dependent random-division, emotion-independent cross-validation and emotion-dependent cross-validation methods, abbreviated as EIRD, EDRD, EICV and EDCV, respectively). The database EMODB is then used to analyze and compare the performance of the model proposed in this paper under different signal-to-noise ratios, and the results show that the proposed model and data augmentation are effective.
... These four basic emotions are considered to be universally recognizable emotions. Many speech emotion recognition studies (e.g., [4,24,[50][51][52][53]) have been conducted using these basic emotions, and therefore they were also studied in the current paper. In the perceptual evaluation of the dimensional approach, the corresponding emotion clips were rated for the following four combinations: AP (active-positive), AN (active-negative), PP (passive-positive), and PN (passive-negative), and the neutral samples were labeled as NN (neutral-neutral). ...
Article
Full-text available
Understanding of the perception of emotions or affective states in humans is important to develop emotion-aware systems that work in realistic scenarios. In this paper, the perception of emotions in naturalistic human interaction (audio-visual data) is studied using perceptual evaluation. For this purpose, a naturalistic audio-visual emotion database collected from TV broadcasts such as soap-operas and movies, called the IIIT-H Audio-Visual Emotion (IIIT-H AVE) database, is used. The database consists of audio-alone, video-alone, and audio-visual data in English. Using data of all three modes, perceptual tests are conducted for four basic emotions (angry, happy, neutral, and sad) based on category labeling and for two dimensions, namely arousal (active or passive) and valence (positive or negative), based on dimensional labeling. The results indicated that the participants' perception of emotions was remarkably different between the audio-alone, video-alone, and audio-video data. This finding emphasizes the importance of emotion-specific features compared to commonly used features in the development of emotion-aware systems.
... From the above, it can be concluded that the purchase of a product conditioned by the emotion of fear in customers reduces the cognitive evaluation of a product's characteristics, and they most often buy a product without thinking about the long-term consequences of the purchase decision. Furthermore, in scientific literature, emotions are observed in relation to the duration to ad exposure (Olney et al. 1991), to ad memorability (Amber and Burne 1999) and to the consistency of consumer preferences (Lee et al. 2009). The results show that emotions have a positive effect on all observed categories and encourage them. ...
Article
Full-text available
The focus of this paper is placed on the role of emotions in consumer behavior, specifically in the process of purchasing dietary supplements during the COVID-19 pandemic. The theoretical part is based on current knowledge from relevant Croatian and foreign scientific and professional literature on dietary supplements, the COVID-19 pandemic, consumer behavior, decision-making and the impact of emotions on it, while the empirical research portion of this paper details the attitudes of consumers who buy food supplements, the role and importance of different emotions that have a greater or a lesser impact on the purchase of food supplements, with special reference to the timing of the COVID-19 pandemic, and the factors that make consumers decide to purchase food supplements. This research was conducted in the form of a survey that included 257 respondents who were actual users of dietary supplements. It showed that the main drive for buying dietary supplements during the COVID-19 pandemic is the emotion of fear, as the consumers perceived this new disease as a threat to their health and life.
... As a comprehensive technology, emotional computing is a key step of artificial intelligence's emotionalization, including emotion recognition, expression, and decision [8], [9]. "Recognition" is to let the machine accurately recognize human emotions and eliminate uncertainty and ambiguity; "Expression" means that artificial intelligence expresses emotions with suitable information carriers, such as language, sound, posture, and expression; "Decision-making" mainly studies how to use emotional mechanism to make better decisions [10]. ...
Article
Full-text available
Enabling machines to emotion recognition in conversation is challenging, mainly because information in human dialogue innately conveys emotions by long-term experience, abundant knowledge, context, and the intricate patterns between the affective states. We address the task of emotion recognition in conversations using external knowledge to enhance semantics. We propose KES model, a new framework that incorporates different elements of external knowledge and conversational semantic role labeling, where build upon them to learn interactions between interlocutors participating in a conversation. We design a self-attention layer specialized for enhanced semantic text features with external commonsense knowledge. Then, two different networks composed of LSTM are responsible for tracking individual internal state and context external state. In addition, the proposed model has experimented on three datasets in emotion detection in conversation. The experimental results show that our model outperforms the state-of-the-art approaches on most of the tested datasets.
... Previous studies have identified various classification algorithms to generate robust emotional features. Researchers have proposed support vector machines [13,14], Gaussian mixture models [15], hidden Markov models [16], artificial neural networks [17], the K-nearest neighbor [18], and binary decision trees [19]. These classifiers have demonstrated good emotion classification performance. ...
Article
Full-text available
Convolutional neural networks (CNNs) are a state-of-the-art technique for speech emotion recognition. However, CNNs have mostly been applied to noise-free emotional speech data, and limited evidence is available for their applicability in emotional speech denoising. In this study, a cascaded denoising CNN (DnCNN)–CNN architecture is proposed to classify emotions from Korean and German speech in noisy conditions. The proposed architecture consists of two stages. In the first stage, the DnCNN exploits the concept of residual learning to perform denoising; in the second stage, the CNN performs the classification. The classification results for real datasets show that the DnCNN–CNN outperforms the baseline CNN in overall accuracy for both languages. For Korean speech, the DnCNN–CNN achieves an accuracy of 95.8%, whereas the accuracy of the CNN is marginally lower (93.6%). For German speech, the DnCNN–CNN has an overall accuracy of 59.3–76.6%, whereas the CNN has an overall accuracy of 39.4–58.1%. These results demonstrate the feasibility of applying the DnCNN with residual learning to speech denoising and the effectiveness of the CNN-based approach in speech emotion recognition. Our findings provide new insights into speech emotion recognition in adverse conditions and have implications for language-universal speech emotion recognition.
... The model could be used in call center applications for recognizing emotions over the telephone. Binary classification is implemented in [74] instead of a single multiclass classifier for emotion recognition. ...
Article
Full-text available
During the last decade, Speech Emotion Recognition (SER) has emerged as an integral component within Human-computer Interaction (HCI) and other high-end speech processing systems. Generally, an SER system targets the speaker’s existence of varied emotions by extracting and classifying the prominent features from a preprocessed speech signal. However, the way humans and machines recognize and correlate emotional aspects of speech signals are quite contrasting quantitatively and qualitatively, which present enormous difficulties in blending knowledge from interdisciplinary fields, particularly speech emotion recognition, applied psychology, and human-computer interface. The paper carefully identifies and synthesizes recent relevant literature related to the SER systems’ varied design components/methodologies, thereby providing readers with a state-of-the-art understanding of the hot research topic. Furthermore, while scrutinizing the current state of understanding on SER systems, the research gap’s prominence has been sketched out for consideration and analysis by other related researchers, institutions, and regulatory bodies.
... In addition to feature extraction and selection, classification stage also plays an important role in order to build a successful SER system. In most of the SER studies, classifiers such as knearest neighbor (kNN) [17], decision trees [36] and SVM [37,38] are trained using acoustic features. Recently, deep learning approaches have also been used in SER studies for feature extraction and classification [39][40][41][42][43]. ...
Article
Feature selection plays an important role to build a successful speech emotion recognition system. In this paper, a feature selection approach which modifies the initial population generation stage of metaheuristic search algorithms, is proposed. The approach is evaluated on two metaheuristic search algorithms, a nondominated sorting genetic algorithm-II (NSGA-II) and Cuckoo Search in the context of speech emotion recognition using Berlin emotional speech database (EMO-DB) and Interactive Emotional Dyadic Motion Capture (IEMOCAP) database. Results show that the presented feature selection algorithms reduce the number of features significantly and are still effective for emotion classification from speech. Specifically, in speaker-dependent experiments of the EMO-DB, recognition rates of 87.66% and 87.20% are obtained using selected features by modified Cuckoo Search and NSGA-II respectively, whereas, for the IEMOCAP database, the accuracies of 69.30% and 68.32% are obtained using SVM classifier. For the speaker-independent experiments, we achieved comparable results for both databases. Specifically, recognition rates of 76.80% and 76.82% for EMO-DB and 59.37% and 59.52% for IEMOCAP using modified NSGA-II and Cuckoo Search respectively.
Conference Paper
Machine learning (ML) and deep learning (DL) techniques have been used to study the changes in human physiological and non-physiological properties. DL has proven his efficiency when perceiving positive emotions (joy, surprise, pride, emotion) and negative emotions (anger, sadness, fear, disgust). Furthermore, the DL is used to identify the emotions accordingly. First, this paper describes the different DL and ML algorithms applied in the emotion recognition field. Then, as a perspective, it proposes a three-layered emotion recognition architecture that leverages the massive data generated by IoT devices such as mobile phones, smart homes, and health monitoring. Finally, the potential of emerging technologies, such as 5G and 6G communication systems in a parallel Big Data infrastructure, were discussed.
Article
Although speech emotion recognition is challenging, it has broad application prospects in human-computer interaction. Building a system that can accurately and stably recognize emotions from human languages can provide a better user experience. However, the current unimodal emotion feature representations are not distinctive enough to accomplish the recognition, and they do not effectively simulate the inter-modality dynamics in speech emotion recognition tasks. This paper proposes a multimodal method that utilizes both audio and semantic content for speech emotion recognition. The proposed method consists of three parts: two high-level feature extractors for text and audio modalities, and an autoencoder-based feature fusion. For audio modality, we propose a structure called Temporal Global Feature Extractor (TGFE) to extract the high-level features of the time-frequency domain relationship from the original speech signal. Considering that text lacks frequency information, we use only a Bidirectional Long Short-Term Memory network (BLSTM) and attention mechanism to simulate an intra-modal dynamic. Once these steps have been accomplished, the high-level text and audio features are sent to the autoencoder in parallel to learn their shared representation for final emotion classification. We conducted extensive experiments on three public benchmark datasets to evaluate our method. The results on Interactive Emotional Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD) outperform the existing method. Additionally, the results of CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) are competitive. Furthermore, experimental results show that compared to unimodal information and autoencoder-based feature level fusion, the joint multimodal information (audio and text) improves the overall performance and can achieve greater accuracy than simple feature concatenation.
Conference Paper
Emotions have always played a crucial role in human evolution, improving not only social contact but also their ability to adapt and react to a changing environment. In the field of social robotics, providing robots with the ability to recognize human emotions through the interpretation of non-verbal signals may represent the key to more effective and engaging interaction. However, the problem of emotion recognition has usually been addressed in limited and static scenarios, by classifying emotions using sensory data such as facial expressions, body postures, and voice. This work proposes a novel emotion recognition framework, based on the appraisal theory of emotion. According to the theory, the expected person's appraisal of a given situation depending on their needs and goals (henceforth referred to as ``appraisal information'') is combined with sensory data. A pilot experiment was designed and conducted: participants were involved in spontaneous verbal interaction with the humanoid robot Pepper, programmed to elicit different emotions in various moments. Then, a Random Forest classifier was trained to classify positive and negative emotions using: (i) sensor data only; (ii) sensor data supplemented by appraisal information. Preliminary results confirm a performance improvement in emotion classification when appraisal information is considered.
Article
The most important factor on the success of the student is the student's readiness for the lesson, motivation, cognitive and emotional state. In face-to-face education, the educator can follow the student visually throughout the lesson and can observe his emotional state. One of the most important disadvantages of distance learning is that the emotional state of the student cannot be followed instantly. In addition, the processing time of emotion detection, in which real-time emotion detection will be performed, should be short. In this study, a method for emotion recognition is proposed by using distance and slope information between facial landmarks. In addition, the feature size was reduced by detecting only those that are effective for emotion recognition among the distance and slope information with statistical analysis. According to the results obtained, the proposed method and feature set achieved 86.11% success. In addition, the processing time is at a level that can be used in distance learning and can detect real-time emotion.
Article
Full-text available
Identifying the massage techniques of the masseuse is a prerequisite for guiding robotic massage. It is difficult to recognize multiple consecutive massage maps with a time series for current human action recognition algorithms. To solve the problem, a method combining a convolutional neural network, long-term neural network, and attention mechanism is proposed to identify the massage techniques in this paper. First, the pressure distribution massage map is collected by a massage glove, and the data are enhanced by the conditional variational auto-encoder. Then, the features of the massage map group in the spatial domain and timing domain are extracted through the convolutional neural network and the long- and short-term memory neural network, respectively. The attention mechanism is introduced into the neural network, giving each massage map a different weight value to enhance the network extraction of data features. Finally, the massage haptic dataset is collected by a massage data acquisition system. The experimental results show that a classification accuracy of 100% is achieved. The results demonstrate that the proposed method could identify sequential massage maps, improve the network overfitting phenomenon, and enhance the network generalization ability effectively.
Article
Full-text available
The advancements of the Internet of Things (IoT) and voice-based multimedia applications have resulted in the generation of big data consisting of patterns, trends and associations capturing and representing many features of human behaviour. The latent representations of many aspects and the basis of human behaviour is naturally embedded within the expression of emotions found in human speech. This signifies the importance of mining audio data collected from human conversations for extracting human emotion. Ability to capture and represent human emotions will be an important feature in next-generation artificial intelligence, with the expectation of closer interaction with humans. Although the textual representations of human conversations have shown promising results for the extraction of emotions, the acoustic feature-based emotion detection from audio still lags behind in terms of accuracy. This paper proposes a novel approach for feature extraction consisting of Bag-of-Audio-Words (BoAW) based feature embeddings for conversational audio data. A Recurrent Neural Network (RNN) based state-of-the-art emotion detection model is proposed that captures the conversation-context and individual party states when making real-time categorical emotion predictions. The performance of the proposed approach and the model is evaluated using two benchmark datasets along with an empirical evaluation on real-time prediction capability. The proposed approach reported 60.87% weighted accuracy and 60.97% unweighted accuracy for six basic emotions for IEMOCAP dataset, significantly outperforming current state-of-the-art models.
Article
Full-text available
Recently, domain adversarial neural networks (DANN) have delivered promising results for out of domain data. This paper exploits DANN for speaker independent emotion recognition, where the domain corresponds to speakers, i.e. the training and testing datasets contain different speakers. The result is a speaker adversarial neural network (SANN). The proposed SANN is used for extracting speaker-invariant and emotion-specific discriminative features for the task of speech emotion recognition. To extract speaker-invariant features, multi-tasking adversarial training of a deep neural network (DNN) is employed. The DNN framework consists of two sub-networks: one for emotion classification (primary task) and the other for speaker classification (secondary task). The gradient reversal layer (GRL) was introduced between (a) the layer common to both the primary and auxiliary classifiers and (b) the auxiliary classifier. The objective of the GRL layer is to reduce the variance among speakers by maximizing the speaker classification loss. The proposed framework jointly optimizes the above two sub-networks to minimize the emotion classification loss and mini-maximize the speaker classification loss. The proposed network was evaluated on the IEMOCAP and EMODB datasets. A total of 1582 features were extracted from the standard library openSMILE. A subset of these features was eventually selected using a genetic algorithm approach. On the IEMOCAP dataset, the proposed SANN model achieved relative improvements of +6.025% (weighted accuracy) and +5.62% (unweighted accuracy) over the baseline system. Similar results were observed for the EMODB dataset. Further, in spite of differences with respect to models and features with state-of-the-art methods, significant improvement in accuracy values was also obtained over them.
Article
Multi-modal speech emotion recognition is a study to predict emotion categories by combining speech data with other types of data, such as video, speech text transcription, body action, or facial expression when speaking, which will involve the fusion of multiple features. Most of the early studies, however, directly spliced multi-modal features in the fusion layer after single-modal modeling, resulting in ignoring the connection between speech and other modal features. As a result, we propose a novel multi-modal speech emotion recognition model based on multi-head attention fusion networks, which employs transcribed text and motion capture (MoCap) data involving facial expression, head rotation, and hand action to supplement speech data and perform emotion recognition. In unimodal, we use a two-layer Transformer’s encoder combination model to extract speech and text features separately, and MoCap is modeled using a deep residual shrinkage network. Simultaneously, We innovated by changing the input of the Transformer encoder to learn the similarities between speech and text, speech and MoCap, and then output text and MoCap features that are more similar to speech features, and finally, predict the emotion category using combined features. In the IEMOCAP dataset, our model outperformed earlier research with a recognition accuracy of 75.6%.
Article
Full-text available
Every human being has emotion for every item related to them. For every customer, their emotion can help the customer representative to understand their requirement. So, speech emotion recognition plays an important role in the interaction between humans. Now, the intelligent system can help to improve the performance for which we design the convolution neural network (CNN) based network that can classify emotions in different categories like positive, negative, or more specific. In this paper, we use the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) audio records. The Log Mel Spectrogram and Mel-Frequency Cepstral Coefficients (MFCCs) were used to feature the raw audio file. These properties were used in the classification of emotions using techniques, such as Long Short-Term Memory (LSTM), CNNs, Hidden Markov models (HMMs), and Deep Neural Networks (DNNs). For this paper, we have divided the emotions into three sections for males and females. In the first section, we divide the emotion into two classes as positive. In the second section, we divide the emotion into three classes such as positive, negative, and neutral. In the third section, we divide the emotions into 8 different classes such as happy, sad, angry, fearful, surprise, disgust expressions, calm, and fearful emotions. For these three sections, we proposed the model which contains the eight consecutive layers of the 2D convolution neural method. The purposed model gives the better-performed categories to other previously given models. Now, we can identify the emotion of the consumer in better ways.
Article
Full-text available
Facial emotion recognition extracts the human emotions from the images and videos. As such, it requires an algorithm to understand and model the relationships between faces and facial expressions and to recognize human emotions. Recently, deep learning models are utilized to improve the performance of facial emotion recognition. However, the deep learning models suffer from the overfitting issue. Moreover, deep learning models perform poorly for images which have poor visibility and noise. Therefore, in this paper, an efficient deep learning-based facial emotion recognition model is proposed. Initially, contrast-limited adaptive histogram equalization (CLAHE) is applied to improve the visibility of input images. Thereafter, a modified joint trilateral filter is applied to the obtained enhanced images to remove the impact of impulsive noise. Finally, an efficient deep convolutional neural network is designed. Adam optimizer is also utilized to optimize the cost function of deep convolutional neural networks. Experiments are conducted by using the benchmark dataset and competitive human emotion recognition models. Comparative analysis demonstrates that the proposed facial emotion recognition model performs considerably better compared to the competitive models
Article
In recent years, speech emotion recognition technology is of great significance in widespread applications such as call centers, social robots and health care. Thus, the speech emotion recognition has been attracted much attention in both industry and academic. Since emotions existing in an entire utterance may have varied probabilities, speech emotion is likely to be ambiguous, which poses great challenges to recognition tasks. However, previous studies commonly assigned a single-label or multi-label to each utterance in certain. Therefore, their algorithms result in low accuracies because of the inappropriate representation. Inspired by the optimally interacting theory, we address the ambiguous speech emotions by proposing a novel multi-classifier interactive learning (MCIL) method. In MCIL, multiple different classifiers first mimic several individuals, who have inconsistent cognitions of ambiguous emotions, and construct new ambiguous labels (the emotion probability distribution). Then, they are retrained with the new labels to interact with their cognitions. This procedure enables each classifier to learn better representations of ambiguous data from others, and further improves the recognition ability. The experiments on three benchmark corpora (MAS, IEMOCAP, and FAU-AIBO) demonstrate that MCIL does not only improve each classifier’s performance, but also raises their recognition consistency from moderate to substantial.
Chapter
This chapter on multi speaker independent emotion recognition encompasses the use of perceptual features with filters spaced in Equivalent rectangular bandwidth (ERB) and BARK scale and vector quantization (VQ) classifier for classifying groups and artificial neural network with back propagation algorithm for emotion classification in a group. Performance can be improved by using the large amount of data in a pertinent emotion to adequately train the system. With the limited set of data, this proposed system has provided consistently better accuracy for the perceptual feature with critical band analysis done in ERB scale.
Conference Paper
Wearable sensors have made an impact on healthcare and medicine by enabling out-of-clinic health monitoring and prediction of pathological events. Further advancements made in the analysis of multimodal signals have been in emotion recognition which utilizes peripheral physiological signals captured by sensors in wearable devices. There is no universally accepted emotion model, though multidimensional methods are often used, the most popular of which is the two-dimensional Russell's model based on arousal and valence. Arousal and valence values are discrete, usually being either binary with low and high labels along each dimension creating four quadrants or 3-valued with low, neutral, and high labels. In day-to-day life, the neutral emotion class is the most dominant leaving emotion datasets with the inherent problem of class imbalance. In this study, we show how the choice of values in the two-dimensional model affects the emotion recognition using multiple machine learning algorithms. Binary classification resulted in an accuracy of 87.2% for arousal and up to 89.5% for valence. Maximal 3-class classification accuracy was 80.9% for arousal and 81.1% for valence. For the joined classification of arousal and valence, the four-quadrant model reached 87.8%, while the nine-class model had an accuracy of 75.8%. This study can be used as a basis for further research into feature extraction for better overall classification performance.
Article
Full-text available
Speech production can be regarded as a process where a time-varying vocal tract system (filter) is excited by a time-varying excitation. In addition to its linguistic message, the speech signal also carries information about, for example, the gender and age of the speaker. Moreover, the speech signal includes acoustical cues about several speaker traits, such as the emotional state and the state of health of the speaker. In order to understand the production of these acoustical cues by the human speech production mechanism and utilize this information in speech technology, it is necessary to extract features describing both the excitation and the filter of the human speech production mechanism. While the methods to estimate and parameterize the vocal tract system are well established, the excitation appears less studied. This article provides a review of signal processing approaches used for the extraction of excitation information from speech. This article highlights the importance of excitation information in the analysis and classification of phonation type and vocal emotions, in the analysis of nonverbal laughter sounds, and in studying pathological voices. Furthermore, recent developments of deep learning techniques in the context of extraction and utilization of the excitation information are discussed.
Chapter
During the last years, the field of affective computing that deals with identification, recording, interpreting, and processing of emotion and affective state of an individual has won ground in the scientific community. Thus, the incorporation of affective computing and the corresponding emotional intelligence in homecare services, which entail the social and healthcare delivery at the home of the patient via the utilization of information and communication technology, seems quite important. Among the available means of expression, speech constitutes one of the most natural mechanisms, providing adequate information for recognizing emotion. In this paper, we describe the design and implementation of an affective recognition service integrated in a holistic electronic homecare management system, covering the entire lifecycle of doctor-patient interaction, incorporating speech emotion recognition (SER)-oriented methods. Within this context, we evaluate the performance of several SER techniques deployed in the homecare system, from well-established machine learning algorithms to Deep Learning architectures and we report the corresponding results.
Article
In this paper, the intrinsic characteristics of speech modulations are estimated to propose the instant modulation spectral features for efficient emotion recognition. This feature representation is based on single frequency filtering (SFF) technique and higher order nonlinear energy operator. The speech signal is decomposed into frequency sub-bands using SFF, and associated nonlinear energies are estimated with higher order nonlinear energy operator. Then, the feature vector is realized using cepstral analysis. The high-resolution property of SFF technique is exploited to extract the amplitude envelope of the speech signal at a selected frequency with good time-frequency resolution. The fourth order nonlinear energy operator provides noise robustness in estimating the modulation components. The proposed feature set is tested for the emotion recognition task using the i-vector model with the probabilistic linear discriminant scoring scheme, support vector machine and random forest classifiers. The results demonstrate that the performance of this feature representation is better than the widely used spectral and prosody features, achieving detection accuracy of 85.75%, 59.88%, and 65.78% on three emotional databases, EMODB, FAU-AIBO, and IEMOCAP, respectively. Further, the proposed features are found to be robust in the presence of additive white Gaussian and vehicular noises.
Chapter
This chapter on multi speaker independent emotion recognition encompasses the use of perceptual features with filters spaced in Equivalent rectangular bandwidth (ERB) and BARK scale and vector quantization (VQ) classifier for classifying groups and artificial neural network with back propagation algorithm for emotion classification in a group. Performance can be improved by using the large amount of data in a pertinent emotion to adequately train the system. With the limited set of data, this proposed system has provided consistently better accuracy for the perceptual feature with critical band analysis done in ERB scale.
Chapter
Land surface temperature (LST) is an important climate indicator that shows the relationship between the atmosphere and land. Due to environmental problems, including global warming, determining the relationship between natural factors and LST is urgent. This study aimed to model LST using a random forest-RF model and several independent factors, that is, altitude, slope, aspect, distance to major roads, parks, water bodies, waterways, farmlands, grasslands, land use, and the “Normalized Difference Vegetation Index (NDVI)” in Shiraz City, the capital of Fars Province, Iran. For this purpose, a series of Landsat and eight satellite images were used to extract LST data in the summer and winter of 2019. In addition, the importance of each factor was also investigated using the RF model. The Results indicated that distance from roads, parks, and water bodies were the most important factors affecting LST spatial variations in summer, whereas NDVI, distance to roads, and altitude were the most effective factors in winter. Performance evaluation of the studied model was 0.53 and 0.48 for R2 and 2.61 and 2.58 for “Root Mean Square Error (RMSE)” in seasons of summer and winter, respectively. This study helps us to understand which factors increase or decrease LST in Shiraz City. In general, green spaces have a main role in decreasing the LST; in contrast, the bare lands had substantially higher temperatures than residential areas. Therefore, this research is crucial for understanding and monitoring the surface thermal environment in the study of climate change.
Article
Full-text available
Speech Emotion Recognition (SER) is a popular topic in academia and industry. Feature engineering plays a pivotal role in building an efficient SER. Although researchers have done a tremendous amount of work in this field, there are still the issues of speech feature choice and the correct application of feature engineering that remains to be solved in the domain of SER. In this research, a feature optimization approach that uses a clustering-based genetic algorithm is proposed. Instead of randomly selecting the new generation, clustering is applied at the fitness evaluation level to detect outliers for exclusion to be part of the next generation. The approach is compared with the standard Genetic Algorithm in the context of audio emotion recognition using Berlin Emotional Speech Database (EMO-DB), Ryerson Audio-Visual Database of Speech and Song (RAVDESS) and, Surrey Audio-Visual Expressed Emotion Dataset (SAVEE). Results signify that the proposed technique effectively improved the emotion classification in speech. The recognition rate of 89.6% for general speakers (both male and female), 86.2% for male speakers, and 88.3% for female speakers on EMO-DB, 82.5% for general speakers, 75.4% for male speakers, and 91.1% for female speaker on RAVDESS, and 77.7% for general speakers on SAVEE is obtained in speaker-dependent experiments. For speaker-independent experiments, we achieved the recognition rate of 77.5% on EMO-DB, 76.2% on RAVDESS and, 69.8 % on SAVEE. All the experiments were performed on MATLAB and the Support Vector Machine (SVM) was used for classification. Results confirm that the proposed method is capable of discriminating emotions effectively and performed better than the other approaches used for comparison in terms of performance measures.
Conference Paper
With the procession of technology, the human-machine interaction research field is in growing need of robust automatic emotion recognition systems. Building machines that interact with humans by comprehending emotions paves the way for developing systems equipped with human-like intelligence. Previous architecture in this field often considers RNN models. However, these models are unable to learn in-depth contextual features intuitively. This paper proposes a transformer-based model that utilizes speech data instituted by previous works, alongside text and mocap data, to optimize our emotional recognition system’s performance. Our experimental result shows that the proposed model outperforms the previous state-of-the-art. The IEMOCAP dataset supported the entire experiment.
Article
Speech emotion recognition (SER) plays a crucial role in improving the quality of man–machine interfaces in various fields like distance learning, medical science, virtual assistants, and automated customer services. A deep learning-based hierarchical approach is proposed for both unimodal and multimodal SER systems in this work. Of these, the audio-based unimodal system proposes using a combination of 33 features, which include prosody, spectral, and voice quality-based audio features. Further, for the multimodal system, both the above-mentioned audio features and additional textual features are used. Embeddings from Language Models v2 (ELMo v2) is implemented to extract word and character embeddings which helped to capture the context-dependent aspects of emotion in text. The proposed models’ performances are evaluated on two audio-only unimodal datasets – SAVEE and RAVDESS, and one audio-text multimodal dataset – IEMOCAP. The proposed hierarchical models offered SER accuracies of 81.2%, 81.7%, and 74.5% on the RAVDESS, SAVEE, and IEMOCAP datasets, respectively. Further, these results are also benchmarked against recently reported techniques, and the reported performances are found to be superior. Therefore, based on the presented investigations, it is concluded that the application of a deep learning-based network in a hierarchical manner significantly improves SER over generic unimodal and multimodal systems.
Conference Paper
Recognizing the patient's emotions using deep learning techniques has attracted significant attention recently due to technological advancements. Automatically identifying the emotions can help build smart healthcare centers that can detect depression and stress among the patients in order to start the medication early. Using advanced technology to identify emotions is one of the most exciting topics as it defines the relationships between humans and machines. Machines learned how to predict emotions by adopting various methods. In this survey, we present recent research in the field of using neural networks to recognize emotions. We focus on studying emotions' recognition from speech, facial expressions, and audio-visual input and show the different techniques of deploying these algorithms in the real world. These three emotion recognition techniques can be used as a surveillance system in healthcare centers to monitor patients. We conclude the survey with a presentation of the challenges and the related future work to provide an insight into the applications of using emotion recognition.
Article
Full-text available
Emotion recognition is critical in dealing with everyday interpersonal human interactions. Understanding a person’s emotions through his speech can do wonders for shaping social interactions. Because of the rapid development of social media, single-modal emotion recognition is finding it difficult to meet the demands of the current emotional recognition system. A multimodal emotion recognition model from speech and text was proposed in this paper to optimize the performance of the emotion recognition system. This paper, explore the comprehensive analysis of speech emotion recognition using text and audio. The results show that enhancement of accuracy compared to either audio or text. Here, results were obtained using the deep learning model I.e. LSTM. The experiment analysis is done for RAVDESS and SAVEE datasets. This implementation is done by python programming.
Article
Stress responses vary drastically for a given set of stimuli, individuals, or points in time. A potential source of this variance that is not well characterized arises from the theory of stress as a dynamical system, which implies a complex, nonlinear relationship between environmental/situational inputs and the development/experience of stress. In this framework, stress vs. non-stress states exist as attractor basins in a physiologic phase space. Here, we develop a model of stress as a dynamical system by coupling closed loop physiologic control to a dynamic oscillator in an attractor landscape. By characterizing the evolution of this model through phase space, we demonstrate strong sensitivity to the parameters controlling the dynamics and demonstrate multiple features of stress responses found in current research, implying that these parameters may contribute to a significant source of variability observed in empiric stress research.
Article
Full-text available
Speech emotion recognition (SER) is a difficult task because emotions are subjective and recognizing the affective state of the speaker is challenging. To tackle this issue, Broad Learning System is presented to balance the training of networks that are substantially faster than those used previously. Furthermore, we performed experiments on the standard IEMOCAP dataset and achieved the state-of-the-art performance in terms of weighted accuracy and unweighted accuracy. Taken together, the experimental results demonstrated that applying Broad Learning System to SER is reasonable and useful.
Chapter
Emotion recognition and analysis is an essential part of affective computing which plays a vital role nowadays in healthcare, security systems, education, etc. Numerous scientific researches have been conducted developing various types of strategies, utilizing methods in different areas to identify human emotions automatically. Different types of emotions are distinguished through the combination of data from facial expressions, speech, and gestures. Also, physiological signals, e.g., EEG (Electroencephalogram), EMG (Electromyogram), EOG (Electrooculogram), blood volume pulse, etc. provide information on emotions. The main idea of this paper is to identify various emotion recognition techniques and denote relevant benchmark data sets and specify algorithms with state-of-the-art results. We have also given a review of multimodal emotion analysis, which deals with various fusion techniques of the available emotion recognition modalities. The results of the existing literature show that emotion recognition works best and gives satisfactory accuracy if it uses multiple modalities in context. At last, a survey of the rest of the problems, challenges, and corresponding openings in this field is given.
Article
Full-text available
Background Although visual locomotion scoring is inexpensive and simplistic, it is also time consuming and subjective. Automated lameness detection methods have been developed to replace the visual locomotion scoring and aid in early and accurate detection. Several types of sensors are measuring traits such as activity, lying behavior or temperature. Previous studies on automatic lameness detection have been unable to achieve high accuracy in combination with practical implementation in a on farm commercial setting. The objective of our research was to develop a prediction model for lameness in dairy cattle using a combination of remote sensor technology and other animal records that will translate sensor data into easy to interpret classified locomotion information for the farmer. During an 11-month period, data from 164 Holstein-Friesian dairy cows were gathered, housed at an Irish research farm. A neck-mounted accelerometer was used to gather behavioral metrics, additional automatically recorded data consisted of milk production and live weight. Locomotion scoring data were manually recorded, using a one-to-five scale (1 = non-lame, 5 = severely lame). Locomotion scores where then used to label the cows as sound (locomotion score 1) or unsound (locomotion score ≥ 2). Four supervised classification models, using a gradient boosted decision tree machine learning algorithm, were constructed to investigate whether cows could be classified as sound or unsound. Data available for model building included behavioral metrics, milk production and animal characteristics. Results The resulting models were constructed using various combinations of the data sources. The accuracy of the models was then compared using confusion matrices, receiver-operator characteristic curves and calibration plots. The model which achieved the highest performance according to the accuracy measures, was the model combining all the available data, resulting in an area under the curve of 85% and a sensitivity and specificity of 78%. Conclusion These results show that 85% of this model’s predictions were correct in identifying cows as sound or unsound, showing that the use of a neck-mounted accelerometer, in combination with production and other animal data, has potential to replace visual locomotion scoring as lameness detection method in dairy cows.
Article
Full-text available
In this paper an attempt has been made to prepare an automatic tonal and non-tonal pre-classification-based Indian language identification (LID) system using multi-level prosody and spectral features. Languages are first categorized into tonal and non-tonal groups, and then, from among the languages of the respective groups, individual languages are identified. The system uses syllable, word (tri-syllable) and phrase level (multi-word) prosody (collectively called multi-level prosody) along with spectral features, namely Mel-frequency cepstral coefficients (MFCCs), Mean Hilbert envelope coefficients (MHEC), and shifted delta cepstral coefficients of MFCCs and MHECs for the pre-classification task. Multi-level analysis of spectral features has also been proposed and the complementarity of the syllable, word and phrase level (spectral + prosody) has been examined for pre-classification-based LID task. Four different models, particularly, Gaussian Mixture Model (GMM)-Universal Background Model (UBM), Artificial Neural Network (ANN), i-vector based support vector machine (SVM) and Deep Neural Network (DNN) have been developed to identify the languages. Experiments have been carried out on National Institute of Technology Silchar language database (NITS-LD) and OGI Multi-language Telephone Speech corpus (OGI-MLTS). The experiments confirm that both prosody and (spectral + prosody) obtained from syllable-, word- and phrase-level carry complementary information for pre-classification-based LID task. At the pre-classification stage, DNN models based on multi-level (prosody + MFCC) features, coupled with score combination technique results in the lowest EER value of 9.6% for NITS-LD. For OGI-MLTS database, the lowest EER value of 10.2% is observed for multi-level (prosody + MHEC). The pre-classification module helps to improve the performance of baseline single-stage LID system by 3.2% and 4.2% for NITS-LD and OGI-MLTS database respectively.
Article
Recent advances have demonstrated that a machine learning technique known as “reservoir computing” is a significantly effective method for modelling chaotic systems. Going beyond short-term prediction, we show that long-term behaviors of an observed chaotic system are also preserved in the trained reservoir system by virtue of network measurements. Specifically, we find that a broad range of network statistics induced from the trained reservoir system is nearly identical with that of a learned chaotic system of interest. Moreover, we show that network measurements of the trained reservoir system are sensitive to distinct dynamics and can in turn detect the dynamical transitions in complex systems. Our findings further support that rather than dynamical equations, reservoir computing approach in fact provides an alternative way for modelling chaotic systems.
Article
While speech emotion recognition (SER) has been an active research field since the last three decades, the techniques that deal with the natural environment have only emerged in the last decade. These techniques have reduced the mismatch in the distribution of the training and testing data, which occurs due to the difference in speakers, texts, languages, and recording environments between the training and testing datasets. Although a few good surveys exist for SER, they either don't cover all aspects of SER in natural environments or don't discuss the specifics in detail. This survey focuses on SER in a natural environment, discussing SER techniques for natural environment along with their advantages and disadvantages in terms of speaker, text, language, and recording environments. In the recent past, the deep learning techniques have become very popular due to minimal speech processing and enhanced accuracy. Special attention has been given to deep-learning techniques and the related issues in this survey. Recent databases, features, and feature selection algorithms for SER, which have not been discussed in the existing surveys and can be promising for SER in a natural environment, have also been discussed in this paper.
Article
Full-text available
Training a support vector machine (SVM) leads to a quadratic optimization problem with bound constraints and one linear equality constraint. Despite the fact that this type of problem is well understood, there are many issues to be considered in designing an SVM learner. In particular, for large learning tasks with many training examples, off-the-shelf optimization techniques for general quadratic programs quickly become intractable in their memory and time requirements. SVMLight is an implementation of an SVM learner which addresses the problem of large tasks. This chapter presents algorithmic and computational results developed for SVMlight V2.0, which make large-scale SVM training more practical. The results give guidelines for the application of SVMs to large domains. Also published in: 'Advances in Kernel Methods - Support Vector Learning', Bernhard Schölkopf, Christopher J. C. Burges, and Alexander J. Smola (eds.), MIT Press, Cambridge, USA, 1998. The paper is written in English.
Article
Full-text available
Takayuki Kanda is a computer scientist with interests in intelligent robots and human-robot interaction; he is a researcher in the Intelligent Robotics and Communication Laboratories at ATR (Advanced Telecommunications Re-search Institute), Kyoto, Japan. Takayuki Hirano is a computer scientist with an interest in human–robot interaction; he is an intern researcher in the Intelli-gent Robotics and Communication Laboratories at ATR, Kyoto, Japan. Daniel Eaton is a computer scientist with an interest in human–robot interaction; he is an intern researcher in the Intelligent Robotics and Communication Labora-tories at ATR, Kyoto, Japan. Hiroshi Ishiguro is a computer scientist with in-terests in computer vision and intelligent robots; he is Professor of Adaptive Machine Systems in the School of Engineering at Osaka University, Osaka, Ja-pan, and a visiting group leader in the Intelligent Robotics and Communication Laboratories at ATR, Kyoto, Japan. ABSTRACT Robots increasingly have the potential to interact with people in daily life. It is believed that, based on this ability, they will play an essential role in human society in the not-so-distant future. This article examined the proposition that robots could form relationships with children and that children might learn from robots as they learn from other children. In this article, this idea is studied in an 18-day field trial held at a Japanese elementary school. Two English-speak-ing "Robovie" robots interacted with first-and sixth-grade pupils at the perime-ter of their respective classrooms. Using wireless identification tags and sensors, these robots identified and interacted with children who came near them. The robots gestured and spoke English with the children, using a vocabulary of about 300 sentences for speaking and 50 words for recognition.
Article
Full-text available
This paper describes analyses of a corpus of speech recorded during psychotherapy. The therapy sessions were focused on addressing unresolved anger towards an attachment figure. Speech from the therapy sessions of 22 young adult females was initially recorded, from which 283 stimuli were extracted and submitted for evaluation of emotional content by 14 judges. The emotional content was rated on three scales: Activation, Valence and Dominance. A set of acoustic features was then extracted: statistic features, F0 features based on the Fujisaki model and perceptual speech rate features. The relationship between acoustics and emotional content was examined through correlation analysis and automatic classification. Results of the model-based analysis shows significant correlations between the strength and frequency of accents and Activation, as well between base F0 and dominance. Automatic classification showed that the acoustic features were better at predicting Activation rather than Valence and Dominance, and that the dominant features were those based on F0.
Article
Full-text available
With the widespread use of technologies directed towards children, child-machine interactions have become a topic of great interest. Computers must interpret relevant contextual user cues in order to provide a more natural interactive environment. Our focus in this paper is analyzing audio-visual user uncertainty cues using spontaneous conversations between a child and computer in a problem-solving setting. We hypothesize that we can predict when a child is uncertain in a given turn using a combination of acoustic, lexical, and visual gestural cues. First, we carefully annotated the audio-visual uncertainty cues. Next, we trained decision trees using leave-one-speaker-out cross-validation to find the more universal uncertainty cues across different children, attaining 0.494 kappa agreement with ground-truth uncertainty labels. Lastly, we trained decision trees using leave-one-turn-out cross-validation for each child to determine which cues had more intra-child predictive power and attained 0.555 kappa agreement. Both of these results were significantly higher than a voting baseline method but worse than average human kappa agreement of 0.744. We explain which annotated features produced the best results, so that future research can concentrate on automatically recognizing these uncertainty cues from the audio/video signal.
Article
Full-text available
Logistic regression analysis of high-dimensional data, such as natural language text, poses computational and statistical challenges. Maximum likelihood estimation often fails in these applications. We present a simple Bayesian logistic regression approach that uses a Laplace prior to avoid overfitting and produces sparse predictive models for text data. We apply this approach to a range of document classification problems and show that it produces compact predictive models at least as effective as those produced by support vector machine classifiers or ridge logistic regression combined with feature selection. We describe our model fitting algorithm, our open source implementations (BBR and BMR), and experimental results.
Conference Paper
Full-text available
The last decade has seen a substantial body of literature on the recognition of emotion from speech. However, in comparison to related speech processing tasks such as Automatic Speech and Speaker Recognition, practically no standardised corpora and test-conditions exist to compare performances under exactly the same conditions. Instead a multiplicity of evaluation strategies employed - such as cross-validation or percentage splits without proper instance definition - prevents exact reproducibility. Further, in order to face more realistic scenarios, the community is in desperate need of more spontaneous and less prototypical data. This INTERSPEECH 2009 Emotion Challenge aims at bridging such gaps between excellent research on human emotion recognition from speech and low compatibility of results. The FAU Aibo Emotion Corpus [1] serves as basis with clearly defined test and training partitions incorporating speaker independence and different room acoustics as needed in most real-life settings. This paper introduces the challenge, the corpus, the features, and benchmark results of two popular approaches towards emotion recognition from speech.
Article
Full-text available
During expressive speech, the voice is enriched to convey not only the intended semantic message but also the emotional state of the speaker. The pitch contour is one of the important properties of speech that is affected by this emotional modulation. Although pitch features have been commonly used to recognize emotions, it is not clear what aspects of the pitch contour are the most emotionally salient. This paper presents an analysis of the statistics derived from the pitch contour. First, pitch features derived from emotional speech samples are compared with the ones derived from neutral speech, by using symmetric Kullback-Leibler distance. Then, the emotionally discriminative power of the pitch features is quantified by comparing nested logistic regression models. The results indicate that gross pitch contour statistics such as mean, maximum, minimum, and range are more emotionally prominent than features describing the pitch shape. Also, analyzing the pitch statistics at the utterance level is found to be more accurate and robust than analyzing the pitch statistics for shorter speech regions (e.g., voiced segments). Finally, the best features are selected to build a binary emotion detection system for distinguishing between emotional versus neutral speech. A new two-step approach is proposed. In the first step, reference models for the pitch features are trained with neutral speech, and the input features are contrasted with the neutral model. In the second step, a fitness measure is used to assess whether the input speech is similar to, in the case of neutral speech, or different from, in the case of emotional speech, the reference models. The proposed approach is tested with four acted emotional databases spanning different emotional categories, recording settings, speakers and languages. The results show that the recognition accuracy of the system is over 77% just with the pitch features (baseline 50%). When compared to conventional classification schemes, th- - e proposed approach performs better in terms of both accuracy and robustness.
Conference Paper
Full-text available
Emotion expression is an essential part of human interaction. Rich emotional information is conveyed through the human face. In this study, we analyze detailed motion-captured facial information of ten speakers of both genders during emotional speech. We derive compact facial representations using methods motivated by Principal Component Analysis and speaker face normalization. Moreover, we model emotional facial movements by conditioning on knowledge of speech-related movements (articulation). We achieve average classification accuracies on the order of 75% for happiness, 50-60% for anger and sadness and 35% for neutrality in speaker independent experiments. We also find that dynamic modeling and the use of viseme information improves recognition accuracy for anger, happiness and sadness, as well as for the overall unweighted performance.
Conference Paper
Full-text available
Emotion is expressed and perceived through multiple modalities. In this work, we model face, voice and head movement cues for emotion recognition and we fuse classifiers using a Bayesian framework. The facial classifier is the best performing followed by the voice and head classifiers and the multiple modalities seem to carry complementary information, especially for happiness. Decision fusion significantly increases the average total unweighted accuracy, from 55% to about 62%. Overall, we achieve average accuracy on the order of 65-75% for emotional states and 30-40% for neutral state using a large multi-speaker, multimodal database. Performance analysis for the case of anger and neutrality suggests a positive correlation between the number of classifiers that performed well and the perceptual salience of the expressed emotion.
Conference Paper
Full-text available
Social and emotional intelligence are aspects of human intelligence that have been argued to be better predictors than IQ for measuring aspects of success in life, especially in social interactions, learning, and adapting to what is important. When it comes to machines, not all of them will need such skills. Yet to have machines like computers, broadcast systems, and cars, capable of adapting to their users and of anticipating their wishes, endowing them with the ability to recognize user's affective states is necessary. This article discusses the components of human affect, how they might be integrated into computers, and how far are we from realizing affective multimodal human-computer interaction.
Conference Paper
Full-text available
In this paper, we report on classification results for emotional user states (4 classes, German database of children interacting with a pet robot). Six sites computed acoustic and linguistic features independently from each other, following in part dif- ferent strategies. A total of 4244 features were pooled together and grouped into 12 low level descriptor types and 6 functional types. For each of these groups, classification results using Sup- port Vector Machines and Random Forests are reported for the full set of features, and for 150 features each with the highest individual Information Gain Ratio. The performance for the different groups varies mostly between ≈ 50% and ≈ 60%. Index Terms: emotional user states, automatic classification, feature types, functionals
Conference Paper
Full-text available
This paper extends binary support vector machines to multi-class classification for recognising emotions from speech. We apply two standard schemes (one-versus-one and one-versus-rest) and two schemes that form a hierarchy of classifiers each making a distinct binary decision about class membership, on three publicly-available databases. Using the OpenEAR toolkit to extract more than 6000 features per speech sample, we have been able to outperform the state-of-the-art classification methods on all three databases.
Conference Paper
Full-text available
Interaction synchrony among interlocutors happens naturally as people adapt their speaking style gradually to promote efficient communication. In this work, we quantify one aspect of interaction synchrony - prosodic entrainment, specifically pitch and energy, in married couples' problem-solving interactions using speech signal-derived measures. Statistical testings demonstrate that some of these measures capture useful information; they show higher values in interactions with couples having high positive attitude compared to high negative attitude. Further, by using quantized entrainment measures employed with statistical symbol sequence matching in a maximum likelihood framework, we obtained 76% accuracy in predicting positive affect vs. negative affect.
Conference Paper
Full-text available
In this study, we investigate politeness and frustration behav- ior of children during their spoken interaction with computer characters in a game. We focus on automatically detecting frustrated, polite and neutral attitudes from the child's speech (acoustic and language) communication cues and study their differences as a function of age and gender. The study is based on a Wizard-of-Oz dialog corpus of 103 children playing a voice activated computer game. Statistical analysis revealed that there was a significant gender effect on politeness with girls in this data exhibiting more explicit politeness markers. The analysis also showed that there is a positive correlation between frus- tration and the number of dialog turns reflecting the fact that longer time spent solving the puzzle of the game led to a more frustrated child. By combining acoustic and language cues for the task of automatic detection of politeness and frustration, we obtain average accuracy of 84.7% and 71.3%, respectively, by using age dependent models and 85% and 72%, respectively, for gender dependent models.
Article
Full-text available
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Article
Full-text available
Since emotions are expressed through a combination of verbal and non-verbal channels, a joint analysis of speech and gestures is required to understand expressive human communication. To facilitate such investigations, this paper describes a new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California (USC). This database was recorded from ten actors in dyadic sessions with markers on the face, head, and hands, which provide detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios. The actors performed selected emotional scripts and also improvised hypothetical scenarios designed to elicit specific types of emotions (happiness, anger, sadness, frustration and neutral state). The corpus contains approximately 12 h of data. The detailed motion capture information, the interactive setting to elicit authentic emotions, and the size of the database make this corpus a valuable addition to the existing databases in the community for the study and modeling of multimodal and expressive human communication.
Article
Full-text available
In order to improve the recognition accuracy of speech emotion recognition, in this paper, a novel hierarchical method based on improved Decision Directed Acyclic Graph SVM (improved DDAGSVM) is proposed for speech emotion recognition. The improved DDAGSVM is constructed according to the confusion degrees of emotion pairs. In addition, a geodesic distance-based testing algorithm is proposed for the improved DDAGSVM to give the test samples differently distinguished many decision chances. Informative features and SVM optimized parameters used in each node of the improved DDAGSVM are gotten by Genetic Algorithm (GA) synchronously. On the Chinese Speech Emotion Database (CSED) and the Audio-Video Emotion Database (AVED) recorded by our workgroup, the recognition experiment results reveal that, compared with multi-SVM, binary decision tree and traditional DDAGSVM, the improved DDAGSVM has the higher recognition accuracy with few selected informative features and moderate time for 7 emotions.
Conference Paper
Full-text available
Speech emotion is high semantic information and its automatic analysis may have many applications such as smart human-computer interactions or multimedia indexing. As a pattern recognition problem, the feature selection and the structure of the classifier are two important aspects for automatic speech emotion classification. In this paper, we propose a novel feature selection scheme based on the evidence theory. Furthermore, we also present a new automatic approach for constructing a hierarchical classifier, which allows better performance than a global classifier as it is mostly used in the literature. Experimented on the Berlin database, our approach showed its effectiveness, scoring a recognition rate up to 78.64%.
Article
Full-text available
Creating conversational interfaces for children is challenging in several respects. These include acoustic modeling for automatic speech recognition (ASR), language and dialog modeling, and multimodal-multimedia user interface design. First, issues in ASR of children's speech are introduced by an analysis of developmental changes in the spectral and temporal characteristics of the speech signal using data obtained from 456 children, ages five to 18 years. Acoustic modeling adaptation and vocal tract normalization algorithms that yielded state-of-the-art ASR performance on children's speech are described. Second, an experiment designed to better understand how children interact with machines using spoken language is described. Realistic conversational multimedia interaction data were obtained from 160 children who played a voice-activated computer game in a Wizard of Oz (WoZ) scenario. Results of using these data in developing novel language and dialog models as well as in a unified maximum likelihood framework for acoustic decoding in ASR and semantic classification for spoken language understanding are described. Leveraging the lessons learned from the WoZ study and a concurrent user experience evaluation, a multimedia personal agent prototype for children was designed. Details of the architecture and application details are described. Informal evaluation by children was found positive especially for the animated agent and the speech interface
Chapter
In this chapter we consider bounds on the rate of uniform convergence. We consider upper bounds (there exist lower bounds as well (Vapnik and Chervonenkis, 1974); however, they are not as important for controlling the learning processes as the upper bounds).
Article
The main purpose of this chapter is to portray the evolution of the author's own approach to appraisal in respect first to psychological stress and then to the emotions. The author first discuss the originals and terminology of the approasial construct and his version of appraisal theory as applied to psychological stress. The author's analysis of stress and coping led to confusions about the differences between appraisal and coping and the way the process of appraising works. These questions led to the author's change in focus from stress to emotion. The author discusses his cognitive-motivational-relational theory of the emotions and examines what distinguished his approach from other appraisal theories. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Attention is drawn to three interrelated types of error that are committed with high frequencies in the description and analysis of studies of nonverbal behavior. The errors involve the calculation of inappropriate measures of accuracy, the use in statistical analyses of inappropriate chance levels, and misapplications ofX 2 and binomial statistical tests. Almost all papers published between 1979 and 1991 that reported performance separately for different stimulus and response classes suffer from one or more of these errors. The potential consequences of these errors are described, and a variety of proposed measures of performance is examined. Since all measures formerly proposed have weaknesses, a new and easily calculated measure, an unbiased hit rate (H u ), is proposed. This measure is the joint probability that a stimulus category is correctly identified given that it is presented at all and that a response is correctly used given that it is used at all. Two available data sets are reanalyzed using this measure, and the differences in the conclusions reached compared to those reached with an analysis of hit rates are described.
Article
Embodied computer agents are becoming an increasingly popular human–computer interaction technique. Often, these agents are programmed with the capacity for emotional expression. This paper investigates the psychological effects of emotion in agents upon users. In particular, two types of emotion were evaluated: self-oriented emotion and other-oriented, empathic emotion. In a 2 (self-oriented emotion: absent vs. present) by 2 (empathic emotion: absent vs. present) by 2 (gender dyad: male vs. female) between-subjects experiment (N=96), empathic emotion was found to lead to more positive ratings of the agent by users, including greater likeability and trustworthiness, as well as greater perceived caring and felt support. No such effect was found for the presence of self-oriented emotion. Implications for the design of embodied computer agents are discussed and directions for future research suggested.
Article
The aim of the experimental study described in this article is to investigate the effect of a life-like character with subtle expressivity on the affective state of users. The character acts as a quizmaster in the context of a mathematical game. This application was chosen as a simple, and for the sake of the experiment, highly controllable, instance of human–computer interfaces and software. Subtle expressivity refers to the character's affective response to the user's performance by emulating multimodal human–human communicative behavior such as different body gestures and varying linguistic style. The impact of em-pathic behavior, which is a special form of affective response, is examined by deliberately frustrating the user during the game progress. There are two novel aspects in this investigation. First, we employ an animated interface agent to address the affective state of users rather than a text-based interface, which has been used in related research. Second, while previous empirical studies rely on questionnaires to evaluate the effect of life-like characters, we utilize physiological information of users (in addition to questionnaire data) in order to precisely associate the occurrence of interface events with users’ autonomic nervous system activity. The results of our study indicate that empathic character response can significantly decrease user stress and that affective behavior may have a positive effect on users’ perception of the difficulty of a task.
Conference Paper
We propose a multi-sensor affect recognition system and evaluate it on the challenging task of classifying interest (or disinterest) in children trying to solve an educational puz- zle on the computer. The multimodal sensory information from facial expressions and postural shifts of the learner is combined with information about the learner's activity on the computer. We propose a unified approach, based on a mixture of Gaussian Processes, for achieving sensor fusion under the problematic conditions of missing channels and noisy labels. This approach generates separate class labels corresponding to each individual modality. The final classifi- cation is based upon a hidden random variable, which prob- abilistically combines the sensors. The multimodal Gaus- sian Process approach achieves accuracy of over 86%, sig- nificantly outperforming classification using the individual modalities, and several other combination schemes.
Article
Training a support vector machine SVM leads to a quadratic optimization problem with bound constraints and one linear equality constraint. Despite the fact that this type of problem is well understood, there are many issues to be considered in designing an SVM learner. In particular, for large learning tasks with many training examples on the shelf optimization techniques for general quadratic programs quickly become intractable in their memory and time requirements. SVM light is an implementation of an SVM learner which addresses the problem of large tasks. This chapter presents algorithmic and computational results developed for SVM light V 2.0, which make large-scale SVM training more practical. The results give guidelines for the application of SVMs to large domains.
The importance of automatically recognizing emotions from human speech has grown with the increasing role of spoken language interfaces in human-computer interaction applications. This paper explores the detection of domain-specific emotions using language and discourse information in conjunction with acoustic correlates of emotion in speech signals. The specific focus is on a case study of detecting negative and non-negative emotions using spoken language data obtained from a call center application. Most previous studies in emotion recognition have used only the acoustic information contained in speech. In this paper, a combination of three sources of information-acoustic, lexical, and discourse-is used for emotion recognition. To capture emotion information at the language level, an information-theoretic notion of emotional salience is introduced. Optimization of the acoustic correlates of emotion with respect to classification error was accomplished by investigating different feature sets obtained from feature selection, followed by principal component analysis. Experimental results on our call center data show that the best results are obtained when acoustic and language information are combined. Results show that combining all the information, rather than using only acoustic information, improves emotion classification by 40.7% for males and 36.4% for females (linear discriminant classifier used for acoustic information).
Speech and music interpretation by large-space extraction, Tech. rep., Institute for Human-Machine Communication
  • Eyben
Measuring performance in category judgment studies on nonverbal behavior
  • Wanger
Automatic classification of emotion-related user states in spontaneous children’s speech
  • S Steidl
Speech and music interpretation by large-space extraction
  • F Eyben
  • M Woellmer
  • B Schuller