Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Emotion recognition is one of the latest challenges in intelligent human/computer communication. Most of previous work on emotion recognition focused on extracting emotions from visual or audio information separately. A novel approach is presented in this paper, including both visual and audio from video clips, to recognize the human emotion. The Facial Animation Parameters (FAPs) compliant facial feature tracking based on GASM (GPU based Active Shape Model) is performed on the video to generate two vector streams which represent the expression feature and the visual speech one. To extract effective speech features, based on geodesic distance estimation, we develop an enhanced Lipschitz embedding to embed high dimensional acoustic features into low dimensional space. Combined with the visual vectors, the audio vector is extracted in terms of low dimensional features. Then, a tripled Hidden Markov Model is introduced to perform the recognition which allows the state asynchrony of the audio and visual observation sequences while preserving their natural correlation over time. The experimental results show that this approach outperforms the conventional approaches for emotion recognition.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Over the last few years, we have witnessed a lot of researches [109,110] in emotion recognition which address the problems of facial expression detection and/or audio affect recognition. Audio affect recognition of speech signal aims to identify the emotional states of humans by analyzing their voices. ...
... In addition to the above studies which only focused on individual audio or video modalities, there is a growing body of works that include both video and audio emotion recognition [109,110]. The features used by those methods are mainly low level features, such as tracking points for collection visual data, or extracting audio features at pitch level. ...
... Distance between right eye and left eye Distance between the inner and outer corner of the left eye Distance between the upper and lower line of the left eye Distance between the left iris corner and right iris corner of the left eye Distance between the inner and outer corner of the right eye Distance between the upper and lower line of the right eye Distance between the left iris corner and right iris corner of the right eye Distance between the left eyebrow inner and outer corner Distance between the right eyebrow inner and outer corner Distance between top of the mouth and bottom of the mouth [110]. A thread-based approach is applied to detect and sharpen the edge of contour of the face components. ...
Preprint
We utilize commonsense knowledge bases to address the problem of real- time multimodal analysis. In particular, we focus on the problem of multimodal sentiment analysis, which consists in the simultaneous analysis of different modali- ties, e.g., speech and video, for emotion and polarity detection. Our approach takes advantages of the massively parallel processing power of modern GPUs to enhance the performance of feature extraction from different modalities. In addition, in order to ex- tract important textual features from multimodal sources we generate domain-specific graphs based on commonsense knowledge and apply GPU-based graph traversal for fast feature detection. Then, powerful ELM classifiers are applied to build the senti- ment analysis model based on the extracted features. We conduct our experiments on the YouTube dataset and achieve an accuracy of 78% which outperforms all previous systems. In term of processing speed, our method shows improvements of several orders of magnitude for feature extraction compared to CPU-based counterparts.
... Over the last few years, we have witnessed a lot of researches [2,23] in emotion recognition which address the problems of facial expression detection and/or audio affect recognition. Audio affect recognition of speech signal aims to identify the emotional states of humans by analyzing their voices. ...
... Some common examples of approaches that use FACS to understand expressed facial expressions are active appearance model [22] and active shape model [12]. By employing the AUs as features (like k-nearest-neighbors, Bayesian networks, hidden Markov models (HMM), and artificial neural networks (ANN) [23]), many research works have successfully managed to infer emotions from facial expressions. ...
... In addition to the above studies which only focused on individual audio or video modalities, there is a growing body of works that include both video and audio emotion recognition [2,23]. The features used by those methods are mainly low level features, such as tracking points for collection visual data, or extracting audio features at pitch level. ...
Article
Full-text available
The enormous number of videos posted everyday on multimedia websites such as Facebook and YouTube makes the Internet an infinite source of information. Collecting and processing such information, however, is a very challenging task as it involves dealing with a huge amount of information that is changing at a very high speed. To this end, we leverage on the processing speed of extreme learning machine and graphics processing unit to overcome the limitations of standard learning algorithms and central processing unit (CPU) and, hence, perform real-time multimodal sentiment analysis, i.e., harvesting sentiments from web videos by taking into account audio, visual and textual modalities as sources of the information. For the sentiment classification, we leveraged on sentic memes, i.e., basic units of sentiment whose combination can potentially describe the full range of emotional experiences that are rooted in any of us, including different degrees of polarity. We used both feature and decision level fusion methods to fuse the information extracted from the different modalities. Using the sentiment annotated dataset generated from YouTube video reviews, our proposed multimodal system is shown to achieve an accuracy of 78%. In term of processing speed, our method shows improvements of several orders of magnitude for feature extraction compared to CPU-based counterparts.
... To deal with this problem, a model-level fusion strategy [31,32,50,51,[75][76][77] was proposed to emphasize the information of correlation among multiple modalities, and explore the temporal relationship between audio and visual signal streams (as shown in Fig. 6). There are several distinctive examples such as Coupled HMM (C-HMM) [77,98], Tripled HMM (T-HMM) [75], Multistream Fused HMM (MFHMM) [76], and Semi-Coupled HMM (SC-HMM) [51]. ...
... To deal with this problem, a model-level fusion strategy [31,32,50,51,[75][76][77] was proposed to emphasize the information of correlation among multiple modalities, and explore the temporal relationship between audio and visual signal streams (as shown in Fig. 6). There are several distinctive examples such as Coupled HMM (C-HMM) [77,98], Tripled HMM (T-HMM) [75], Multistream Fused HMM (MFHMM) [76], and Semi-Coupled HMM (SC-HMM) [51]. In C-HMM, which is a traditional example, two component HMMs are linked through cross-time and cross-chain conditional probabilities. ...
... Further, Lu et al. [77] designed an AdaBoost-CHMM strategy which boosts the performance of component C-HMM classifiers with the modified expectation maximization (EM) training algorithm to generate a strong ensemble classifier. Song et al. [75] extended C-HMM to T-HMM to collect three HMMs for two visual input sequences and one audio sequence. Similarly, Jiang et al. [32] proposed a Triple stream audio visual Asynchronous Dynamic Bayesian Network (T_AsyDBN) to combine the MFCC features, local prosodic features and visual emotion features in a reasonable manner. ...
Article
Full-text available
Emotion recognition is the ability to identify what people would think someone is feeling from moment to moment and understand the connection between his/her feelings and expressions. In today's world, human–computer interaction (HCI) interface undoubtedly plays an important role in our daily life. Toward harmonious HCI interface, automated analysis and recognition of human emotion has attracted increasing attention from the researchers in multidisciplinary research fields. In this paper, a survey on the theoretical and practical work offering new and broad views of the latest research in emotion recognition from bimodal information including facial and vocal expressions is provided. First, the currently available audiovisual emotion databases are described. Facial and vocal features and audiovisual bimodal data fusion methods for emotion recognition are then surveyed and discussed. Specifically, this survey also covers the recent emotion challenges in several conferences. Conclusions outline and address some of the existing emotion recognition issues. Copyright © The Authors, 2014 This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
... Multimodal systems can achieve higher recognition rate, compared with separate speech or visual systems [38]. Meza Kubo et al. proposed an emotion recognition method, using qualitative analysis, questionnaire survey and additional readme to improve the recognition accuracy [39]. ...
... The number used to identify human eye feature points is p_ (37)(38)(39)(40)(41)(42) and p_ (43-48) among 68 feature points, the equation for calculating the eye aspect ratio is shown in equation (4). ...
Article
Full-text available
The accessibility of online teaching makes it popular in various teaching scenarios. Most of these researches about online interaction mainly focuse on the communication network stability, facial expression/gesture recognition algorithm and statistical description analysis, and its expertise mainly comes from the fields of computer technology and algorithm engineering. The evaluation and analysis of the effect of online teaching is an important test for the adaptability of such tools in the field of education. However, from the perspective of teachers, there is still a lack of literature on data interpretation after the application of this technology. An experiment based on real online teaching was carried out in this paper. This study uses image recognition technology to process video and extract five kinds of head movement data from dozens of student samples, and then develop statistical description interpretation. Some novel and interesting conclusions indicate that diversified behaviors occurred in real-time online learning. This study obtained data of five high-frequency online learning behaviors, including blinking, yawning, nodding, shaking head and leaving. These behaviors are related to learning state and time. Teaching features, students’ personal characteristics and learning environment have a comprehensive impact on online learning behaviors. The result provides a basis for personalized learning and teaching scheme design in the future. It also helps to enrich online teaching evaluation methods and accelerate the construction of online education framework and rules.
... Speech is a complex signal comprising data about the message, the speaker, and his/her emotion while speaking. Human-Computer Interaction (HCI) [1] plays an important role in understanding and conveying each other's purposes more naturally. The main task that HCI accomplishes is to develop the capability to recognize the emotion of the speaker very precisely, which is usually very similar to the capability of human-robot interaction [2]. ...
... Additionally, the accuracy rate for speaker-dependent GMM was about 89.12%, whereas that for the speaker-independent case reached 75%. Similar experiments using ANN were conducted that yielded an accuracy of about 52.87% for the speaker-independent case, which is considerably less compared to that of other classification techniques, and of about 51.19% for speakerdependent classification Further classification experiments were conducted using the KNN classification technique, which had an accuracy rate of 64% for four different emotional states using feature vectors like energy contours and pitch [1], [2], [3], [5] [14]. ...
... Further improvement on emotion recognition can be achieved by the development of multimodal systems that integrate expressions and voice patterns. In ( Song et al., 2008 ) a multimodal system was built with Tripled HMMs (THMMs) for the recognition of the following emotions: Surprise, Happiness, Anger, Fear, Sadness and Neutral. The use of THMMs was performed to synchronize voice and facial features in the time domain. ...
... • In general, the recognition rates of the vision systems are higher than those of the speech systems ( Anagnostopoulos et al., 2015 ). However a multimodal system can achieve higher recognition rates than those of the individual vision or speech systems Song et al., 2008 ). • Most of the emotion recognition works on facial expressions considered specialized databases like FEEDTUM ( Filko & Martinovic, 2013;Pal & Hasan, 2014 ), JAFFE ( Gosavi & Khot, 2013;Kaur, Vashisht, & Neeru, 2010;Pooja & Kaur, 2010;Rasoulzadeh, 2012;Thuseethan & Kuhanesan, 2014 ), FACES ( Tayal & Vijay, 2012 ), CK+ ( Chaturvedi & Tripathi, 2014 ), and RaFD ( Ilbeygi & Hosseini, 2012 ). ...
Article
Service robotics is an important field of research for the development of assistive technologies. Particularly, humanoid robots will play an increasing and important role in our society. More natural assistive interaction with humanoid robots can be achieved if the emotional aspect is considered. However emotion recognition is one of the most challenging topics in pattern recognition and improved intelligent techniques have to be developed to accomplish this goal. Recent research has addressed the emotion recognition problem with techniques such as Artificial Neural Networks (ANNs)/Hidden Markov Models (HMMs) and reliability of proposed approaches has been assessed (in most cases) with standard databases. In this work we (1) explored on the implications of using standard databases for assessment of emotion recognition techniques, (2) extended on the evolutionary optimization of ANNs and HMMs for the development of a multimodal emotion recognition system, (3) set the guidelines for the development of emotional databases of speech and facial expressions, (4) rules were set for phonetic transcription of Mexican speech, and (5) evaluated the suitability of the multimodal system within the context of spoken dialogue between a humanoid robot and human users. The development of intelligent systems for emotion recognition can be improved by the findings of the present work: (a) emotion recognition depends on the structure of the database sub-sets used for training and testing, and it also depends on the type of technique used for recognition where a specific emotion can be highly recognized by a specific technique, (b) optimization of HMMs led to a Bakis structure which is more suitable for acoustic modeling of emotion-specific vowels while optimization of ANNs led to a more suitable ANN structure for recognition of facial expressions, (c) some emotions can be better recognized based on speech patterns instead of visual patterns, and (d) the weighted integration of the multimodal emotion recognition system optimized with these observations can achieve a recognition rate up to 97.00 % in live dialogue tests with a humanoid robot.
... Most of the existing studies of automatic emotion recognition focus on recognizing these basic emotions. These seven emotional states are common and have been used in the majority of previous works [5,7,14,21,30,31,37,38,46]. Our method is general and can be extended to more emotional states. ...
... Song et al. [46] used a tripled hidden Markov model (THMM) to model joint dynamics of the three signals perceived from the subject: a) pitch and energy as speech features, b) motion of eyebrow, eyelid, and cheek as facial expression features, and c) lips and jaw as visual speech signals. The proposed THMM architecture was tested for seven basic emotions (surprise, anger, joy, sadness, disgust, fear, and neutral), and its overall performance was 85 %. ...
Article
Full-text available
Humans use many modalities such as face, speech and body gesture to express their feeling. So, to make emotional computers and make the human-computer interaction (HCI) more naturally and friendly, computers should be able to understand human feelings using speech and visual information. In this paper, we recognize the emotions from audio and visual information using fuzzy ARTMAP neural network (FAMNN). Audio and visual systems fuse at decision and feature levels. Finally, the particle swarm optimization (PSO) is employed to determine the optimum values of the choice parameter (α), the vigilance parameters (ρ), and the learning rate (β) of the FAMNN. Experimental results showed that the feature-level and decision-level fusions improve the outcome of unimodal systems. Also PSO improved the recognition rate. By using the PSO-optimized FAMNN at feature level fusion, the recognition rate was improved by about 57 % with respect to the audio system and by about 4.5 % with respect to the visual system. The final emotion recognition rate on the SAVEE database was reached to 98.25 % using audio and visual features by using optimized FAMNN.
... Busso et al. [2], presents a system for recognizing emotions through facial expressions displayed in live video streams and video sequences. However, works such as those presented in [3] suggest that developing a good methodology of emotional states characterization based on facial expressions, leads to more robust recognition systems. Facial Action Coding System (FACS) proposed by Ekman et al. [1], is a comprehensive and anatomically based system that is used to measure all visually discernible facial movements in terms of atomic facial actions called Action Units (AUs). ...
... As AUs are independent of interpretation, they can be used for any high-level decision-making process, including the recognition of basic emotions according to Emotional FACS (EM-FACS), the recognition of various affective states according to the FACS Affect Interpretation Database (FACSAID) introduced by Ekman et al. [4], [5], [6], [7]. From the detected features, it is possible to estimate the emotion present in a particular subject, based on an analysis of estimated facial expression shape in comparision to a set of facial expressions of each emotion [3], [8], [9]. ...
Conference Paper
Full-text available
This work presents a framework for emotion recognition, based in facial expression analysis using Bayesian Shape Models (BSM) for facial landmarking localization. The Facial Action Coding System (FACS) compliant facial feature tracking based on Bayesian Shape Model. The BSM estimate the parameters of the model with an implementation of the EM algorithm. We describe the characterization methodology from parametric model and evaluated the accuracy for feature detection and estimation of the parameters associated with facial expressions, analyzing its robustness in pose and local variations. Then, a methodology for emotion characterization is introduced to perform the recognition. The experimental results show that the proposed model can effectively detect the different facial expressions. Outperforming conventional approaches for emotion recognition obtaining high performance results in the estimation of emotion present in a determined subject. The model used and characterization methodology showed efficient to detect the emotion type in 95.6% of the cases.
... while the architecture accepts compound input features, it adapts for emotion-specific modalities. Song et al.[14]used a Tripled Hidden Markov Model(THMM) to model joint dynamics of the three signals perceived from the subject; namely speech, facial expression and visual speech signals. the proposed HMM architecture allows the state asynchrony of the audio and visual observation sequences while preserving their natural correlation over time. ...
... Using multistream HMMs, Zeng et al.[8]improved the emotion recognition rate up to 6.5% over the unimodal systems. Song et al.[14]applied tripled HMM to the face and speech information and got an improvemnt about 6% over the best unimodal system(i.e. mean recognition rate by face was 87% and by the multimodal system was about 93%). ...
Article
Full-text available
A hybrid feature and decision level information fusion architecture is proposed for human emotion recognition from facial expression and speech prosody. An active buffer stores the most recent information extracted from face and speech. This buffer allows fusion of asynchronous informa-tion through keeping track of individual modality updates. The contents of the buffer will be fused at feature level; if their respective update times are close to each other. Based on the classifiers' reliability, a decision level fusion block combines results of the unimodal speech and face based systems and the feature level fusion based classifier. Ex-perimental results on a database of 12 people show that the proposed fusion architecture performs better than unimodal classification, pure feature level fusion and decision level fusion.
... Researchers have confirmed that speech can convey the psychological state of an interlocutor [8]∼ [10]. Therefore, researchers have proposed research on speech emotion recognition [11] [12]. Researchers have confirmed the application of speech emotion recognition in human-computer interaction (HCI). ...
Article
Full-text available
Speech emotion recognition has a huge contribution in human-computer interaction. Feature extraction is one of the focuses of speech emotion recognition. Influenced by the success of computer vision, this research proposes a new feature extraction method. This method uses Gram's Angles Field (GAF), converts the time series to images. Based on the property of symmetric matrix, extracted the feature values with the imaging data. Four algorithms are used to pre-processing the speech signal. The result displays, compared with VGG and ResNet network models, the new method has better classification effect. It illustrates that the method has great potential for identify speech emotion in the future.
... In most HMM-based emotion recognition schemes, the left-to-right topology of the HMM structure was used [4], [19][20][21], and has been proven useful in modeling the signal streams (i.e., audio or visual) for describing the temporal courses of emotional expressions. However, it may be invalid for utterance-based emotion recognition, especially in natural conversation. ...
... En [16] un reconocimiento multimodal de emociones para Enojo, Felicidad, Sopresa, Miedo, Tristeza y Neutro fue desarrollado usando FAPS (Facial Animation Parameters) y la técnica de Lipschitz para características acústicas. Modelos Ocultos de Markov Triples (Tripled Hidden Markov Models, THMMs) fueron implementados para realizar la sincronización del audio con las secuencias de patrones visuales y su clasificación. ...
... There are various ways for detection of emotional states of human such as direct asking from the user, tracking and investigation of implicit parameters, voice signal processing [4]- [5], vital signal processing [6]- [7], facial expression recognition [8] and gesture recognition [9]. Sometimes two or more techniques are used to form a hybrid method by fusing multi-modal emotional cues [10]- [12]. The most natural way for displaying human emotional states is facial expressions. ...
Article
Full-text available
Emotion recognition has many applications in relation between human and machine. A facial emotion recognition framework for 6 basic emotions of happiness, sadness, disgust, surprise, anger and fear is proposed in this paper. The proposed framework utilizes the histogram estimate of shape and textural characteristics of face image. Instead of direct processing on the original gray levels of face image which may have not significant information about facial expression, the processing is done on transformed images containing informative features. The shape features are extracted by morphological operators by reconstruction and the texture ones are acquired by computing the gray-level co-occurrence matrix (GLCM), and applying Gabor filters. The use of whole face image may provide non-informative and redundant information. So, the proposed emotion recognition method just uses the most important components of face such as eyes, nose and mouth. After textural and shape feature extraction, the histogram function is applied to the shape and texture features containing emotional states of face. The simple and powerful nearest neighbor classifier is used for classification of fused histogram features. The experiments show the good performance of the proposed framework compared to some state-of-the-art facial expression methods such as local linear embedding (LLE), Isomap, Morphmap and local directional pattern (LDP).
... Dentro do conjunto de estudos analisados, há autores que optam por fazer análise de um conjunto pequeno de EFs, e há aqueles que elaboram um conjunto maior, a depender do contexto de aplicação no qual os autores motivam o seu estudo, e do tipo de emoção ou comportamento humano a ser analisado. Dentre os estudos nos quais os autores optam por reduzir o conjunto de EFs analisadas estão: [Huang and Lin, 2008] no qual os autores analisaram somente as EFs relacionadasàs emoções de surpresa, felicidade e raiva, e a expressão neutra; [Tews et al., 2011] no qual os autores trabalharam com as expressões para felicidade, raiva e neutra; [Song et al., 2008] onde os autores analisaram as expressões neutra, alegria, raiva, surpresa, tristeza e medo, por meio de uma análise multimodal, ou seja, além de imagens, informação deáudio foi incorporadaà entrada dos modelos de análise; [Siddiqui et al., 2009] ...
Thesis
Full-text available
The facial expression recognition has attracted most of the researchers attention over the last years, because of that it can be very useful in many applications. The Sign Language is a spatio-visual language and it does not have the speech intonation support, so Facial Expression gain relative importance to convey grammatical information in a signed sentence and they contributed to morphological and/or syntactic level to a Sign Language. Those expressions are called Grammatical Facial Expression and they cooperate to solve the ambiguity between signs and give meaning to sentences. Thus, this research project aims to develop models that make possible to recognize automatically Grammatical Facial Expressions from Brazilian Sign Language (Libras)
... Classifier References Count Linear Discriminant Classifiers (LDC) [6] 1 k-Nearest Neighbors (k-NN) [6], [10], [24], [82], [94] 5 Decision Tree [57] 1 Bayesian Classifier [11], [22], [42], [90], [107] 5 Long Short-Term Memory (LSTM) Networks [107], [111], [112] [38], [42] 2 Fuzzy ARTMAP Neural Network (FAMNN) [53] 1 Gaussian Mixture Model (GMM) [10], [10], [92] 2 Multi-layer Perceptron's (MLPs) [12], [38], [84] 3 Fuzzy Logic [94] 1 Hidden Markov Model (HMM) [9], [12], [82], [90], [95], [115], [116] 7 It is seen that at Table 11 that; SVM, GMM, HMM, K-NN and Bayesian Classifier are the most common classifiers. ...
Article
Full-text available
Numerous researchers have conducted studies on the recognition of emotion from human speech with different study designs. Speech Emotion Recognition (SER) is a specific class of signal processing where the main goal is to identify the emotional state of people from voice. SER processes are extensively initialized with the extraction of acoustic features from speech signal via signal processing. Subsequent to selection of the most relevant speech features, a model explaining the relations between the emotions and the voice is searched. Effects of acoustic parameters, the validity of the data used, and performance of the classifiers have been the vital issues for emotion recognition research field. In this study, a content analysis of the studies on the SER based on acoustic parameters was performed. 81 articles (published in the indexed journals) have been assessed by the approaches used for emotion labelling, acoustic features and classifiers and the database used. In addition to that analysis, effect of the acoustic parameters on the status of emotion is also extracted as a summary. The main aim of this study is to: describe the features of the databases in use and to create a brief on the efficiency of acoustic parameters and the classifiers employed by the previous studies. Thereby, it is expected to shed light on the study design for the future studies.
... Even with all of the readily available databases out there, there is still a need for creating selfcollected databases for emotion recognition, as the existing ones don't always fulfil all of the criteria [130][131][132][133]. Table 7. 3D and RGB-D databases. ...
... To understand and convey each other's intentions in a natural way, human-computer interaction (HCI) has been paid more attentions in recent years [1] . The primary problem that the HCI faces is how to master the ability of identifying emotional information accurately, which is similar to the emotional intelligence capability in human-robot interaction [2] . ...
... Then, the best features were selected using different filter and wrapper methods. To reduce the influence of noise, the authors in [6,7] used a feature dimensionality reduction method, called enhanced Lipschitz embedding. ...
Article
Full-text available
Automatic recognition of speech emotional states in noisy conditions has become an important research topic in the emotional speech recognition area, in recent years. This paper considers the recognition of emotional states via speech in real environments. For this task, we employ the power normalized cepstral coefficients (PNCC) in a speech emotion recognition system. We investigate its performance in emotion recognition using clean and noisy speech materials and compare it with the performances of the well-known MFCC, LPCC, RASTA-PLP, and also TEMFCC features. Speech samples are extracted from the Berlin emotional speech database (Emo DB) and Persian emotional speech database (Persian ESD) which are corrupted with 4 different noise types under various SNR levels. The experiments are conducted in clean train/noisy test scenarios to simulate practical conditions with noise sources. Simulation results show that higher recognition rates are achieved for PNCC as compared with the conventional features under noisy conditions. © 2016, Iran University of Science and Technology. All Rights Reserved.
... Previous research [1], [24], [25] has demonstrated that a complete emotional expression can be divided into three sequential temporal phases, onset (application), apex (release), and offset (relaxation), which consider the manner and intensity of an expression. In most Hidden Markov Model (HMM)-based emotion recognition schemes, the single left-to-right topology of the HMM structure was used [11], [17], [35], [36] and was demonstrated to facilitate the modeling of a signal stream (i.e., audio or visual) to describe the temporal courses of emotional expressions. However, a single HMM with left-to-right topology may be invalid for recognizing utterance-based emotions, especially in a natural conversation. ...
Article
Full-text available
Determining how a speaker is engaged in a conversation is crucial for achieving harmonious interaction between computers and humans. In this study, a fusion approach was developed based on psychological factors to recognize Interaction Style (IS) in spoken conversation, which plays a key role in creating natural dialogue agents. The proposed Fused Cross-Correlation Model (FCCM) provides a unified probabilistic framework to model the relationships among the psychological factors of emotion, personality trait (PT), transient IS, and IS history, for recognizing IS. An emotional arousal-dependent speech recognizer was used to obtain the recognized spoken text for extracting linguistic features to estimate transient IS likelihood and recognize PT. A temporal course modeling approach and an emotional sub-state language model, based on the temporal phases of an emotional expression, were employed to obtain a better emotion recognition result. The experimental results indicate that the proposed FCCM yields satisfactory results in IS recognition and also demonstrate that combining psychological factors effectively improves IS recognition accuracy.
... Song et al. [14] used a tripled hidden Markov model (THMM) to model joint dynamics of the three signals perceived from the subject: pitch and energy as speech features; motion of eyebrow, eyelid and cheek as facial expression features; and lips and jaw as visual speech signals. The proposed THMM architecture was tested for seven basic emotions (surprise, anger, joy, sadness, disgust, fear and neutral), and its overall performance was 85 %. ...
Article
Full-text available
To make human–computer interaction more naturally and friendly, computers must enjoy the ability to understand human’s affective states the same way as human does. There are many modals such as face, body gesture and speech that people use to express their feelings. In this study, we simulate human perception of emotion through combining emotion-related information using facial expression and speech. Speech emotion recognition system is based on prosody features, mel-frequency cepstral coefficients (a representation of the short-term power spectrum of a sound) and facial expression recognition based on integrated time motion image and quantized image matrix, which can be seen as an extension to temporal templates. Experimental results showed that using the hybrid features and decision-level fusion improves the outcome of unimodal systems. This method can improve the recognition rate by about 15 % with respect to the speech unimodal system and by about 30 % with respect to the facial expression system. By using the proposed multi-classifier system that is an improved hybrid system, recognition rate would increase up to 7.5 % over the hybrid features and decision-level fusion with RBF, up to 22.7 % over the speech-based system and up to 38 % over the facial expression-based system.
... Yet, a large range of classifiers was used for speech emotion recognition. First of all, Hidden Markov Models (HMMs) represent a standard practice (Nwe et al., 2003;Song et al., 2008;Inoue et al., 2011). El Ayadi et al., 2011 state: "Based on several studies (...), we can conclude that HMM is the most used classifier in emotion classification probably because it is widely used in almost all speech applications." ...
Conference Paper
Full-text available
Emotion recognition from speech means to determine the emotional state of a speaker from his or her voice. Today's most used classifiers in this field are Hidden Markov Models (HMMs) and Support Vector Machines. Both architectures are not made to consider the full dynamic character of speech. However, HMMs are able to capture the temporal characteristics of speech on phoneme, word, or utterance level but fail to learn the dynamics of the input signal on short time scales (e.g., frame rate). The use of dynamical features (first and second derivatives of speech features) attenuates this problem. We propose the use of Segmented-Memory Recurrent Neural Networks to learn the full spectrum of speech dynamics. Therefore, the dynamical features can be removed form the input data. The resulting neural network classifier is compared to HMMs that use the reduced feature set as well as to HMMs that work with the full set of features. The networks perform comparable to HMMs while using significantly less features.
... Much research has been done on the recognition of human emotion expression. Song and colleagues present a system for a robust multi-modal approach to emotion recognition [18]. Here, Song outlines how the detection of human emotion is important for intelligent communication between a human and a computer. ...
Conference Paper
Full-text available
In this paper we present a socially interactive multi-modal robotic head, ERWIN - Emotional Robot With Intelligent Networks, capable of emotion expression and interaction via speech and vision. The model presented shows how a robot can learn to attend to the voice of a specific speaker, providing a relevant emotional expressive response based on previous interactions. We show three aspects of the system; first, the learning phase, allowing the robot to learn faces and voices from interaction. Second, recognition of the learnt faces and voices, and third, the emotion expression aspect of the system. We show this from the perspective of an adult and child interacting and playing a small game, much like an infant and caregiver situation. We also discuss the importance of speaker recognition in terms of human-robot-interaction and emotion, showing how the interaction process between a participant and ERWIN can allow the robot to prefer to attend to that person.
... psychology, computer science, linguistics, neuroscience, and related disciplines [1]. Multiple channels of information during communication are used to analyze emotional states of humans including speech, facial expression, body gesture, posture, and so on [2][3][4][5]. Few studies, however, concentrate on communication atmosphere which is easily felt but hard to define, and even more difficult to evaluate, and it is never finished, static or at rest [6,7]. In [8], human expressive communication is analyzed by exploiting interrelation between speech and gestures. ...
Conference Paper
Full-text available
Emotional states based three-dimensional (3-D) Fuzzy Atmosfield (FA) is proposed to express the feeling between humans and robots in communication, and is built with 3-D coordinates, i.e. "friendly-hostile", "lively-calm", and "casual formal" axes. The FA aims to be an effective tool to proceed with the communication paying attention to the atmosphere generated by individual emotional state which is calculated from bimodal communication cues namely emotional-speech and emotional gesture by using weighted fusion and fuzzy inference. A novel emotion recognition approach is presented which consists of two steps, i.e., classification of six basic emotions for initial emotional states in the Affinity-Pleasure-Arousal emotion space, and emotional transition from prior state by using fuzzy logic. Home party enjoying demonstration confirms FA's availability empirically by questionnaires from unrelated persons of authors' group, where smooth communication between four humans and five eye robots is realized in Mascot Robot System.
... Experiments have shown that SDLA performs better than DLA. It is worth emphasizing that the proposed DLA and SDLA algorithms can also be utilized to other interesting applications, e.g., pose estimation [17] emotion recognition [13], and 3D face modeling [12]. ...
Conference Paper
Full-text available
Fisher’s linear discriminant analysis (LDA), one of the most popular dimensionality reduction algorithms for classification, has three particular problems: it fails to find the nonlinear structure hidden in the high dimensional data; it assumes all samples contribute equivalently to reduce dimension for classification; and it suffers from the matrix singularity problem. In this paper, we propose a new algorithm, termed Discriminative Locality Alignment (DLA), to deal with these problems. The algorithm operates in the following three stages: first, in part optimization, discriminative information is imposed over patches, each of which is associated with one sample and its neighbors; then, in sample weighting, each part optimization is weighted by the margin degree, a measure of the importance of a given sample; and finally, in whole alignment, the alignment trick is used to align all weighted part optimizations to the whole optimization. Furthermore, DLA is extended to the semi-supervised case, i.e., semi-supervised DLA (SDLA), which utilizes unlabeled samples to improve the classification performance. Thorough empirical studies on the face recognition demonstrate the effectiveness of both DLA and SDLA.
... Whilst the architecture accepts compound input features, it adapts for emotion-specific modalities. Song et al. [28] used a tripled hidden Markov model (THMM) to model joint dynamics of the three signals perceived from the subject; namely speech, facial expression and visual speech signals. The proposed HMM architecture allows the state asynchrony of the audio and visual observation sequences while preserving their natural correlation over time. ...
Article
Full-text available
A multimedia content is composed of several streams that carry information in audio, video or textual channels. Classification and clustering multimedia contents require extraction and combination of information from these streams. The streams constituting a multimedia content are naturally different in terms of scale, dynamics and temporal patterns. These differences make combining the information sources using classic combination techniques difficult. We propose an asynchronous feature level fusion approach that creates a unified hybrid feature space out of the individual signal measurements. The target space can be used for clustering or classification of the multimedia content. As a representative application, we used the proposed approach to recognize basic affective states from speech prosody and facial expressions. Experimental results over two audiovisual emotion databases with 42 and 12 subjects revealed that the performance of the proposed system is significantly higher than the unimodal face based and speech based systems, as well as synchronous feature level and decision level fusion approaches.
Article
The rapid growth of the internet has reached the fourth generation, i.e. web 4.0, which supports Sentiment Analysis (SA) in many applications such as social media, marketing, risk management, healthcare, businesses, websites, data mining, e-learning, psychology, and many more. Sentiment analysis is a powerful tool for governments, businesses, and researchers to analyse users’ emotions and mental states in order to generate opinions and reviews about products, services, and daily activities. In the past years, several SA techniques based on Machine Learning (ML), Deep Learning (DL), and other soft computing approaches were proposed. However, growing data size, subjectivity, and diversity pose a significant challenge to enhancing the efficiency of existing techniques and incorporating current development trends, such as Multimodal Sentiment Analysis (MSA) and fusion techniques. With the aim of assisting the enthusiastic researcher to navigating the current trend, this article presents a comprehensive study of various literature to handle different aspects of SA, including current trends and techniques across multiple domains. In order to clarify the future prospects of MSA, this article also highlights open issues and research directions that lead to a number of unresolved challenges.
Chapter
Emotion is a significant aspect o the progress of human–computer interaction systems. To achieve best functionality through HCI, the computer is able to understand the emotions of human effectively. To do so, there is a need for designing an effective emotion recognition system by using the social behavior of human beings into the account. The signals through which human being tries to express the emotions are called as social signals, and the examples are facial expression, speech, and gestures. A vast research is carried out in earlier to achieve effective results in the emotion recognition system through social signal processing. This paper outlines the details of earlier developed approaches based on this aspect. Since there are number of social signals, the complete survey is categorized as audio-based and image-based. A further classification is based on the modality of input, i.e., single modal (single social signal) or multimodal (multiple social signals). Based on the methodology accomplished to achieve the objectives, this survey is further classified into different classes and details are provided more clearly. Brief details about the databases involved in the accomplishment are also explained clearly.
Article
Full-text available
Emotions play an important role in the learning process. Considering the learner's emotions is essential for electronic learning (e-learning) systems. Some researchers have proposed that system should induce and conduct the learner's emotions to the suitable state. But, at first, the learner's emotions have to be recognized by the system. There are different methods in the context of human emotions recognition. The emotions can be recognized by asking from the user, tracking implicit parameters, voice recognition, facial expression recognition, vital signals and gesture recognition. Moreover, hybrid methods have been also proposed which use two or more of these methods through fusing multi-modal emotional cues. In the e-learning systems, the system’s user is the learner. For some reasons, which have been discussed in this study, some of the user emotions recognition methods are more suitable in the e-learning systems and some of them are inappropriate. In this work, different emotion theories are reviewed. Then, various emotions recognition methods have been represented and their advantages and disadvantages of them have been discussed for utilizing in the e-learning systems. According to the findings of this research, the multi-modal emotion recognition systems through information fusion as facial expressions, body gestures and user’s messages provide better efficiency than the single-modal ones.
Article
Emotion recognition is challenging due to the emotional gap between emotions and audio-visual features. Motivated by the powerful feature learning ability of deep neural networks, this paper proposes to bridge the emotional gap by using a hybrid deep model, which first produces audio-visual segment features with Convolutional Neural Networks (CNN) and 3DCNN, then fuses audio-visual segment features in a Deep Belief Networks (DBN). The proposed method is trained in two stages. First, CNN and 3D-CNN models pre-trained on corresponding large-scale image and video classification tasks, are fine-tuned on emotion recognition tasks to learn audio and visual segment features, respectively. Second, the outputs of CNN and 3DCNN models are combined into a fusion network built with a DBN model. The fusion network is trained to jointly learn a discriminative audio-visual segment feature representation. After average-pooling segment features learned by DBN to form a fixedlength global video feature, a linear Support Vector Machine (SVM) is used for video emotion classification. Experimental results on three public audio-visual emotional databases, including the acted RML database, the acted eNTERFACE05 database, and the spontaneous BAUM-1s database, demonstrate the promising performance of the proposed method. To the best of our knowledge, this is an early work fusing audio and visual cues with CNN, 3D-CNN and DBN for audio-visual emotion recognition.
Chapter
This chapter introduces the current data fusion strategies among audiovisual signals for bimodal emotion recognition. Face detection, in the chapter, is performed based on the adaboost cascade face detector and can be used to provide initial facial position and reduce the time for error convergence in feature extraction. In the chapter, active appearance model (AAM) is employed to extract the 68 labeled facial feature points (FPs) from 5 facial regions including eyebrow, eye, nose, mouth, and facial contours for later facial animation parameters (FAPs) calculation. Three kinds of primary prosodic features are adopted, including pitch, energy, and formants F1-F5 in each speech frame for emotion recognition. Finally, a semi-coupled hidden Markov model (SC-HMM) is proposed for emotion recognition based on state-based alignment strategy for audiovisual bimodal features.
Article
Full-text available
Emotions deeply affect learning achievement. In the case of students with high-functioning autism (HFA), negative emotions such as anxiety and anger can impair the learning process due to the inability of these individuals to control their emotions. Attempts to regulate negative emotions in HFA students once they have occurred, subsequent regulation to HFA students is often ineffective because it is difficult to calm them down. Hence, detecting emotional transitions and providing adaptive emotional regulation strategies in a timely manner to regulate negative emotions can be especially important for students with HFA in an e-learning environment. In this study, a facial expression-based emotion recognition method with transition detection was proposed. An emotion elicitation experiment was performed to collect facial-based landmark signals for the purpose of building classifiers of emotion recognition. The proposed method used sliding window technique and support vector machine (SVM) to build classifiers in order to recognize emotions. For the purpose of determining robust features for emotion recognition, Information Gain (IG) and Chi-square were used for feature evaluations. The effectiveness of classifiers with different parameters of sliding windows was also examined. The experimental results confirmed that the proposed method has sufficient discriminatory capability. The recognition rates for basic emotions and transitional emotions were 99.13 and 92.40%, respectively. Also, through feature selection, training time was accelerated by 4.45 times, and the recognition rates for basic emotions and transitional emotions were 97.97 and 87.49%, respectively. The method was applied in an adaptive e-learning environment for mathematics to demonstrate its application effectiveness.
Article
Full-text available
People watching a video can almost always suppress their speech but they cannot suppress their body language and manage their physiological and behavioral parameters. Affects/emotions, sensory processing, actions/motor behavior and motivation link to the limbic system responsible for instinctive and instantaneous human reactions to their environment or to other people. Limbic reactions are immediate, sure, time-tested and occur among all people. Such reactions are highly spontaneous and reflect the video viewer's real feelings and desires, rather than deliberately calculated ones. The limbic system also links to emotions, usually conveyed by facial expressions and movements of legs, arms and/or other body parts. All physiological and behavioral parameters require consideration to determine a video viewer's emotions and wishes. This is the reason an Affect-based multimodal video recommendation system (ARTIST), developed by the authors of the article, is very suitable. The ARTIST was developed and fine-tuned during the course of conducting the TEMPUS project "Reformation of the Curricula on Built Environment in the Eastern Neighbouring Area". ARTIST can analyze the facial expressions and physiological parameters of a viewer while watching a video. An analysis of a video viewer's facial expressions and physiological parameters leads to better control over alternative sequences of film clips for a video clips. It can even prompt ending the video, if nothing suitable for the viewer is available in the database. This system can consider a viewer's emotions (happy, sad, angry, surprised, scared, disgusted and neutral) and choose rational video clips in real time. The analysis of a video viewer's facial expressions and physiological parameters can indicate possible offers to viewers for video clips they prefer at the moment.
Article
In a natural conversation, a complete emotional expression is typically composed of a complex temporal course representing temporal phases of onset, apex, and offset. In this study, subemotional states are defined to model the temporal course of an emotional expression in natural conversation. Hidden Markov Models (HMMs) are adopted to characterize the subemotional states; each represents one temporal phase. A subemotion language model, which considers the temporal transition between sub-emotional states (HMMs), is further constructed to provide a constraint on allowable temporal structures to determine an optimal emotional state. Experimental results show that the proposed approach yielded satisfactory results on the MHMC conversation-based affective speech corpus, and confirmed that considering the complex temporal structure in natural conversation is useful for improving the emotion recognition performance from speech.
Article
Emotion recognition is one of the latest challenges in intelligent human/computer communication. In this paper we present our framework for emotional state classification based on Ekman's study and facial expression analysis. Facial Action Coding System (FACS) on facial features tracking, based on Active Appearance Model is presented for facial expression analysis of features extracted from the parametric model Candide3. We describe the characterization methodology from parametric model to obtain the best set points of facial feature which improve the emotion recognition process. Also quantitatively evaluated the accuracy of both feature detection and estimation of the parameters associated with facial expressions, analyzing its robustness to variations in pose and local variations in the regions of interest. Then, a metodology of emotion characterization is introduced to perform the recognition. The experimental results show that the proposed model can effectively detect the different facial expressions. Also this approach outperforms the conventional approaches for emotion recognition obtaining high performance results in the estimation of emotion present in a determined subject. The model used and characterization methodology showed high accuracy to detect the emotion type in 95:6% of the cases.
Article
In this paper we present our framework for facial expression analysis using static models and kernel methods for classification. We describe the characterization methodology from parametric model. Also quantitatively evaluated the accuracy for feature detection and estimation of the parameters associated with facial expressions, analyzing its robustness to variations in pose. Then, a methodology of emotion characterization is introduced to perform the recognition. Furthermore, a cascade classifiers using kernel methods it is performed for emotion recognition. The experimental results show that the proposed model can effectively detect the different facial expressions. The model used and characterization methodology showed efficient to detect the emotion type in 93.4% of the cases.
Article
Full-text available
A complete emotional expression in natural face-to-face conversation typically contains a complex temporal course. In this paper, we propose a temporal course modeling-based error weighted cross-correlation model (TCM-EWCCM) for speech emotion recognition. In TCM-EWCCM, a TCM-based cross-correlation model (CCM) is first used to not only model the temporal evolution of the extracted acoustic and prosodie features individually but also construct the statistical dependencies among paired acoustic-prosodic features in different emotional states. Then, a Bayesian classifier weighting scheme named error weighted classifier combination is adopted to explore the contributions of the individual TCM-based CCM classifiers for different acoustic-prosodic feature pairs to enhance the speech emotion recognition accuracy. The results of experiments on the NCKU-CASC corpus demonstrate that modeling the complex temporal structure and considering the statistical dependencies as well as contributions among paired features in natural conversation speech can indeed improve the speech emotion recognition performance.
Chapter
Emotional states play an important role in Human-Computer Interaction. An emotion recognition framework is proposed to extract and fuse features from both video sequences and speech signals. This framework is constructed from two Hidden Markov Models (HMMs) represented to achieve emotional states with video and audio respectively; Artificial Neural Network (ANN) is applied as the whole fusion mechanism. Two important phases for HMMs are Facial Animation Parameters (FAPs) extraction from video sequences based on Active Appearance Model (AAM), and pitch and energy features extraction from speech signals. Experiments indicate that the proposed approach has better performance and robustness than methods using video or audio separately.
Article
Full-text available
The field of Affective Computing (AC) expects to narrow the communicative gap between the highly emotional human and the emotionally challenged computer by developing computational systems that recognize and respond to the affective states of the user. Affect-sensitive interfaces are being developed in number of domains, including gaming, mental health, and learning technologies. Emotions are part of human life. Recently, interest has been growing among researchers to find ways of detecting subjective information used in blogs and other online social media. This paper concerned with the automatic detection of emotions in Arabic text. This construction is based on a moderate sized Arabic emotion lexicon used to annotate Arabic children stories for the six basic emotions: Joy, Fear, Sadness, Anger, Disgust, and Surprise. Our approach achieves 65% accuracy for emotion detection in Arabic text.
Conference Paper
Emotion recognition is the ability to detect what people are feeling from moment to moment and to understand the connection between their feelings and verbal/non-verbal expressions. When you are aware of your emotions, you can think clearly and creatively, manage stress and challenges, communicate well with others, and display trust, empathy, and confidence. In today's world, human-computer interaction (HCI) interface undoubtedly plays an important role in our daily life. Toward harmonious HCI interface, automated analysis of human emotion has attracted increasing attention from the researchers in multidisciplinary research fields. In this paper, we presents a survey on theoretical and practical work offering new and broad views of the latest research in emotion recognition from multi-modal information including facial and vocal expressions. A variety of theoretical background and applications ranging from salient emotional features, emotional-cognitive models, to multi-modal data fusion strategies is surveyed for emotion recognition on these modalities. Conclusions outline some of the existing emotion recognition challenges.
Article
A complete emotional expression typically contains a complex temporal course in face-to-face natural conversation. To address this problem, a bimodal hidden Markov model (HMM)-based emotion recognition scheme, constructed in terms of sub-emotional states, which are defined to represent temporal phases of onset, apex, and offset, is adopted to model the temporal course of an emotional expression for audio and visual signal streams. A two-level hierarchical alignment mechanism is proposed to align the relationship within and between the temporal phases in the audio and visual HMM sequences at the model and state levels in a proposed semi-coupled hidden Markov model (SC-HMM). Furthermore, by integrating a sub-emotion language model, which considers the temporal transition between sub-emotional states, the proposed two-level hierarchical alignment-based SC-HMM (2H-SC-HMM) can provide a constraint on allowable temporal structures to determine an optimal emotional state. Experimental results show that the proposed approach can yield satisfactory results in both the posed MHMC and the naturalistic SEMAINE databases, and shows that modeling the complex temporal structure is useful to improve the emotion recognition performance, especially for the naturalistic database (i.e., natural conversation). The experimental results also confirm that the proposed 2H-SC-HMM can achieve an acceptable performance for the systems with sparse training data or noisy conditions.
Article
Emotion recognition in speech signals is currently a very active research topic and has attracted much attention within the engineering application area. This paper presents a new approach of robust emotion recognition in speech signals in noisy environment. By using a weighted sparse representation model based on the maximum likelihood estimation, an enhanced sparse representation classifier is proposed for robust emotion recognition in noisy speech. The effectiveness and robustness of the proposed method is investigated on clean and noisy emotional speech. The proposed method is compared with six typical classifiers, including linear discriminant classifier, K-nearest neighbor, C4.5 decision tree, radial basis function neural networks, support vector machines as well as sparse representation classifier. Experimental results on two publicly available emotional speech databases, that is, the Berlin database and the Polish database, demonstrate the promising performance of the proposed method on the task of robust emotion recognition in noisy speech, outperforming the other used methods.
Article
This paper proposes a type-2 fuzzy functional inference method which extends fuzzy sets of the antecedent parts to type-2 fuzzy sets. This paper next explains that the type-2 fuzzy functional inference method can be classified to the three models: Model 1 is most common method, and Model 2 is a special case of Model 1. We show that the inference results of both Models 1 and 2 can be easily obtained from the area of fuzzy sets of the consequent parts and center of gravity. Moreover, we state that Model 3 can obtain different results compared with the simplified fuzzy inference method and T-S inference method.
Conference Paper
We present a triple stream DBN model (T_AsyDBN) for audio visual emotion recognition, in which the two audio feature streams are synchronous, while they are asynchronous with the visual feature stream within controllable constraints. MFCC features and the principle component analysis (PCA) coefficients of local prosodic features are used for the audio streams. For the visual stream, 2D facial features as well 3D facial animation unit features are defined and concatenated, and the feature dimensions are reduced by PCA. Emotion recognition experiments on the eNERFACE’05 database show that by adjusting the asynchrony constraint, the proposed T_AsyDBN model obtains 18.73% higher correction rate than the traditional multi-stream state synchronous HMM (MSHMM), and 10.21% higher than the two stream asynchronous DBN model (Asy_DBN). Keywordstriple stream DBN model–asynchronous–MSHMM–Asy_DBN
Conference Paper
This paper presents an approach to bi-modal emotion recognition based on a semi-coupled hidden Markov model (SC-HMM). A simplified state-based bi-modal alignment strategy in SC-HMM is proposed to align the temporal relation of states between audio and visual streams. Based on this strategy, the proposed SC-HMM can alleviate the problem of data sparseness and achieve better statistical dependency between states of audio and visual HMMs in most real world scenarios. For performance evaluation, audio-visual signals with four emotional states (happy, neutral, angry and sad) were collected. Each of the invited seven subjects was asked to utter 30 types of sentences twice to generate emotional speech and facial expression for each emotion. Experimental results show the proposed bi-modal approach outperforms other fusion-based bi-modal emotion recognition methods.
Conference Paper
This paper presents an audio visual multi-stream DBN model (Asy_DBN) for emotion recognition with constraint asynchrony, in which audio state and visual state transit individually in their corresponding stream but the transition is constrained by the allowed maximum audio visual asynchrony. Emotion recognition experiments of Asy_DBN with different asynchrony constraints are carried out on an audio visual speech database of four emotions, and compared with the single stream HMM, state synchronous HMM (Syn_HMM) and state synchronous DBN model, as well the state asynchronous DBN model without asynchrony constraint. Results show that by setting the appropriate maximum asynchrony constraint between audio and visual streams, the proposed audio visual asynchronous DBN model gets the highest emotion recognition performance, with an improvement of 15% over Syn_HMM.
Conference Paper
Full-text available
This paper presents a speech emotion recognition system on nonlinear manifold. Instead of straight-line distance, geodesic distance was adopted to preserve the intrinsic geometry of speech corpus. Based on geodesic distance estimation, we developed an enhanced Lipschitz embedding to embed the 64-dimensional acoustic features into a six-dimensional space. In this space, speech data with the same emotional state were located close to one plane, which was beneficial to emotion classification. The compressed testing data were classified into six archetypal emotional states (neutral, anger, fear, happiness, sadness and surprise) by a trained linear support vector machine (SVM) system. Experimental results demonstrate that compared with traditional methods of feature extraction on linear manifold and feature selection, the proposed system makes 9%-26% relative improvement in speaker-independent emotion recognition and 5%-20% improvement in speaker-dependent
Conference Paper
Full-text available
In this paper, a novel system is proposed to recognize facial expression based on face sketch, which is produced by programmable graphics hardware-GPU(Graphics Processing Unit). Firstly, an expression subspace is set up from a corpus of images consisting of seven basic expressions. Secondly, by applying a GPU based edge detection algorithm, the real-time facial expression sketch extraction is performed. Subsequently, noise elimination is carried out by tone mapping operation on GPU. Then, an ASM instance is trained to track the facial feature points in the sketched face image more efficiently and precisely than that on a grey level image directly. Finally, by the normalized key feature points, Eigen expression vector is deduced to be the input of MSVM(Multi-SVMs) based expression recognition model, which is introduced to perform the expression classification. Test expression images are categorized by MSVM into one of the seven basic expression subspaces. Experiment on a data set containing 500 pictures clearly shows the efficacy of the algorithm.
Conference Paper
Full-text available
Automatic multimodal recognition of spontaneous affective expressions is a largely unexplored and challenging problem. In this paper, we explore audio-visual emotion recognition in a realistic human conversation setting - Adult Attachment Interview (AAI). Based on the assumption that facial expression and vocal expression be at the same coarse affective states, positive and negative emotion sequences are labeled according to Facial Action Coding System Emotion Codes. Facial texture in visual channel and prosody in audio channel are integrated in the framework of Adaboost multi-stream hidden Markov model (AMHMM) in which Adaboost learning scheme is used to build component HMM fusion. Our approach is evaluated in the preliminary AAI spontaneous emotion recognition experiments.
Conference Paper
Full-text available
FEELTRACE is an instrument developed to let observers track the emotional content of a stimulus as they perceive it over time, allowing the emotional dynamics of speech episodes to be examined. It is based on activation-evaluation space, a representation derived from psychology. The activation dimension measures how dynamic the emotional state is; the evaluation dimension is a global measure of the positive or negative feeling associated with the state. Research suggests that the space is naturally circular, i.e. states which are at the limit of emotional intensity define a circle, with alert neutrality at the centre. To turn those ideas into a recording tool, the space was represented by a circle on a computer screen, and observers described perceived emotional state by moving a pointer (in the form of a disc) to the appropriate point in the circle, using a mouse. Prototypes were tested, and in the light of results, refinements were made to ensure that outputs were as consistent and meaningful as possible. They include colour coding the pointer in a way that users readily associate with the relevant emotional state; presenting key emotion words as 'landmarks' at the strategic points in the space; and developing an induction procedure to introduce observers to the system. An experiment assessed the reliability of the developed system. Stimuli were 16 clips from TV programs, two showing relatively strong emotions in each quadrant of activation- evaluation space, each paired with one of the same person in a relatively neural state. 24 raters took part. Differences between clips chosen to contrast were statistically robust. Results were plotted in activation-evaluation space as ellipses, each with its centre at the mean co-ordinates for the clip, and its width proportional to standard deviation across raters. The size of the ellipses meant that about 25 could be fitted into the space, i.e. FEELTRACE has resolving power comparable to an emotion vocabulary of 20 non-overlapping words, with the advantage of allowing intermediate ratings, and above all, the ability to track impressions continuously.
Article
Full-text available
J. A. Russell (1994) misrepresents what universality means, misinterprets the evidence from past studies, and fails to consider or report findings that disagree with his position. New data are introduced that decisively answer the central question that Russell raises about the use of a forced-choice format in many of the past studies. This article also shows that his many other qualms about other aspects of the design of the studies of literate cultures have no merit. Russell's critique of the preliterate cultures is inaccurate; he does not fully disclose what those who studied preliterate subjects did or what they concluded that they had found. Taking account of all of Russell's qualms, my analysis shows that the evidence from both literate and preliterate cultures is overwhelming in support of universals in facial expressions.
Article
Full-text available
Cross-cultural research on facial expression and the developments of methods to measure facial expression are briefly summarized. What has been learned about emotion from this work on the face is then elucidated. Four questions about facial expression and emotion are discussed: What information does an expression typically convey? Can there be emotion without facial expression? Can there be a facial expression of emotion without emotion? How do individuals differ in their facial expressions of emotion?
Conference Paper
Full-text available
Advances in computer processing power and emerging algorithms are allowing new ways of envisioning human computer interaction. This paper focuses on the development of a computing algorithm that uses audio and visual sensors to detect and track a user's affective state to aid computer decision making. Using our multi-stream fused hidden Markov model (MFHMM), we analyzed coupled audio and visual streams to detect 11 cognitive/emotive states. The MFHMM allows the building of an optimal connection among multiple streams according to the maximum entropy principle and the maximum mutual information criterion. Person-independent experimental results from 20 subjects in 660 sequences show that the MFHMM approach performs with an accuracy of 80.61% which outperforms face-only HMM, pitch-only HMM, energy-only HMM, and independent HMM fusion.
Conference Paper
Full-text available
Emotion recognition is one of the latest challenges in intelligent human/computer communication. Most of the previous work on emotion recognition focused on extracting emotions from visual or audio information separately. A novel approach is presented in this paper, including both visual and audio from video clips, to recognize the human emotion. The facial animation parameters (FAPs) compliant facial feature tracking based on active appearance model is performed on the video to generate two vector stream which represent the expression feature and the visual speech one. Combined with the visual vectors, the audio vector is extracted in terms of low level features. Then, a tripled hidden Markov model is introduced to perform the recognition which allows the state asynchrony of the audio and visual observation sequences while preserving their natural correlation over time. The experimental results show that this approach outperforms only using visual or audio separately.
Conference Paper
Full-text available
Understanding human emotions is one of the necessary skills for the computer to interact intelligently with human users. The most expressive way humans display emotions is through facial expressions. In this paper, we report on several advances we have made in building a system for classification of facial expressions from continuous video input. We use Bayesian network classifiers for classifying expressions from video. One of the motivating factor in using the Bayesian network classifiers is their ability to handle missing data, both during inference and training. In particular, we are interested in the problem of learning with both labeled and unlabeled data. We show that when using unlabeled data to learn classifiers, using correct modeling assumptions is critical for achieving improved classification performance. Motivated by this, we introduce a classification driven stochastic structure search algorithm for learning the structure of Bayesian network classifiers. We show that with moderate size labeled training sets and large amount of unlabeled data, our method can utilize unlabeled data to improve classification performance. We also provide results using the Naive Bayes (NB) and the Tree-Augmented Naive Bayes (TAN) classifiers, showing that the two can achieve good performance with labeled training sets, but perform poorly when unlabeled data are added to the training set.
Conference Paper
Full-text available
This paper describes the use of statistical techniques and hidden Markov models (HMM) in the recognition of emotions. The method aims to classify 6 basic emotions (anger, dislike, fear, happiness, sadness and surprise) from both facial expressions (video) and emotional speech (audio). The emotions of 2 human subjects were recorded and analyzed. The findings show that the audio and video information can be combined using a rule-based system to improve the recognition rate
Conference Paper
Full-text available
We have developed a computer vision system, including both facial feature extraction and recognition, that automatically discriminates among subtly different facial expressions. Expression classification is based on Facial Action Coding System (FACS) action units (AUs), and discrimination is performed using Hidden Markov Models (HMMs). Three methods are developed to extract facial expression information for automatic recognition. The first method is facial feature point tracking using a coarse-to-fine pyramid method. This method is sensitive to subtle feature motion and is capable of handling large displacements with sub-pixel accuracy. The second method is dense flow tracking together with principal component analysis (PCA) where the entire facial motion information per frame is compressed to a low-dimensional weight vector. The third method is high gradient component (i.e., furrow) analysis in the spatio-temporal domain, which exploits the transient variation associated with the facial expression. Upon extraction of the facial information, non-rigid facial expression is separated from the rigid head motion component, and the face images are automatically aligned and normalized using an affine transformation. This system also provides expression intensity estimation, which has significant effect on the actual meaning of the expression
Conference Paper
Full-text available
Eigen-points estimates the image-plane locations of fiduciary points on an object. By estimating multiple locations simultaneously, eigen-points exploits the interdependence between these locations. This is done by associating neighboring, inter-dependent control-points with a model of the local appearance. The model of local appearance is used to find the feature in new unlabeled images. Control-point locations are then estimated from the appearance of this feature in the unlabeled image. The estimation is done using an affine manifold model of the coupling between the local appearance and the local shape. Eigen-points uses models aimed specifically at recovering shape from image appearance. The estimation equations are solved non-iteratively, in a way that accounts for noise in the training data and the unlabeled images and that accounts for uncertainty in the distribution and dependencies within these noise sources
Article
Full-text available
We propose a method for implementing a high-level interface for the synthesis and animation of animated virtual faces that is in full compliance with MPEG-4 specifications. This method allows us to implement the simple facial object profile and part of the calibration facial object profile. In fact, starting from a facial wireframe and from a set of configuration files, the developed system is capable of automatically generating the animation rules suited for model animation driven by a stream of facial animation parameters. If the calibration parameters (feature points and texture) are available, the system is able to exploit this information for suitably modifying the geometry of the wireframe and for performing its animation by means of calibrated rules computed ex novo on the adapted somatics of the model. Evidence of the achievable performance is reported at the end of this paper by means of figures showing the capability of the system to reshape its geometry according to the decoded MPEG-4 facial calibration parameters and its effectiveness in performing facial expressions
Article
Model-based vision is firmly established as a robust approach to recognizing and locating known rigid objects in the presence of noise, clutter, and occlusion. It is more problematic to apply model-based methods to images of objects whose appearance can vary, though a number of approaches based on the use of flexible templates have been proposed. The problem with existing methods is that they sacrifice model specificity in order to accommodate variability, thereby compromising robustness during image interpretation. We argue that a model should only be able to deform in ways characteristic of the class of objects it represents. We describe a method for building models by learning patterns of variability from a training set of correctly annotated images. These models can be used for image search in an iterative refinement algorithm analogous to that employed by Active Contour Models (Snakes). The key difference is that our Active Shape Models can only deform to fit the data in ways consistent with the training set. We show several practical examples where we have built such models and used them to locate partially occluded objects in noisy, cluttered images.
Article
In recent years several speech recognition systems that use visual together with audio information showed significant increase in per-formance over the standard speech recognition systems. The use of visual features is justified by both the bimodality of the speech generation and by the need of features that are invariant to acous-tic noise perturbation. The audio-visual speech recognition sys-tem presented in this paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM). The statistical properties of the coupled-HMM allow us to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time. The ex-perimental results show that the coupled HMM outperforms the multistream HMM in audio visual speech recognition.
Article
Tensor representation is helpful to reduce the small sample size problem in discriminative subspace selection. As pointed by this paper, this is mainly because the structure information of objects in computer vision research is a reasonable constraint to reduce the number of unknown parameters used to represent a learning model. Therefore, we apply this information to the vector-based learning and generalize the vector-based learning to the tensor-based learning as the supervised tensor learning (STL) framework, which accepts tensors as input. To obtain the solution of STL, the alternating projection optimization procedure is developed. The STL framework is a combination of the convex optimization and the operations in multilinear algebra. The tensor representation helps reduce the overfitting problem in vector-based learning. Based on STL and its alternating projection optimization procedure, we generalize support vector machines, minimax probability machine, Fisher discriminant analysis, and distance metric learning, to support tensor machines, tensor minimax probability machine, tensor Fisher discriminant analysis, and the multiple distance metrics learning, respectively. We also study the iterative procedure for feature extraction within STL. To examine the effectiveness of STL, we implement the tensor minimax probability machine for image classification. By comparing with minimax probability machine, the tensor version reduces the overfitting problem.
Conference Paper
The paper explores several statistical pattern recognition techniques to classify utterances according to their emotional content. The authors have recorded a corpus containing emotional speech with over a 1000 utterances from different speakers. They present a new method of extracting prosodic features from speech, based on a smoothing spline approximation of the pitch contour. To make maximal use of the limited amount of training data available, they introduce a novel pattern recognition technique: majority voting of subspace specialists. Using this technique, they obtain classification performance that is close to human performance on the task
Article
The most expressive way humans display emotions is through facial expressions. In this work we report on several advances we have made in building a system for classification of facial expressions from continuous video input. We introduce and test different Bayesian network classifiers for classifying expressions from video, focusing on changes in distribution assumptions, and feature dependency structures. In particular we use Naive–Bayes classifiers and change the distribution from Gaussian to Cauchy, and use Gaussian Tree-Augmented Naive Bayes (TAN) classifiers to learn the dependencies among different facial motion features. We also introduce a facial expression recognition from live video input using temporal cues. We exploit the existing methods and propose a new architecture of hidden Markov models (HMMs) for automatically segmenting and recognizing human facial expression from video sequences. The architecture performs both segmentation and recognition of the facial expressions automatically using a multi-level architecture composed of an HMM layer and a Markov model layer. We explore both person-dependent and person-independent recognition of expressions and compare the different methods.
Conference Paper
The paper describes an experimental study on vocal emotion expression and recognition and the development of a computer agent for emotion recognition. The study deals with a corpus of 700 short utterances expressing five emotions: happiness, anger, sadness, fear, and normal (unemotional) state, which were portrayed by thirty subjects. The utterances were evaluated by twenty three subjects, twenty of whom participated in recording. The accuracy of recognition emotions in speech is the following: happiness - 61.4%, anger - 72.2%, sadness - 68.3%, fear - 49.5%, and normal - 66.3%. The human ability to portray emotions is approximately at the same level (happiness - 59.8%, anger - 71.7%, sadness - 68.1%, fear - 49.7%, and normal - 65.1%), but the standard deviation is much larger. The human ability to recognize their own emotions has been also evaluated. It turned out that people are good in recognition anger (98.1%), sadness (80%) and fear (78.8%), but are less confident for normal state (71.9%) and happiness (71.2%). A part of the corpus was used for extracting features and training computer based recognizers. Some statistics of the pitch, the first and second formants, energy and the speaking rate were selected and several types of recognizers were created and compared. The best results were obtained using the ensembles of neural network recognizers, which demonstrated the following accuracy: normal state - 55-75%, happiness - 60-70%, anger - 70-80%, sadness - 75-85%, and fear - 35-55%. The total average accuracy is about 70%. An emotion recognition agent was created that is able to analyze telephone quality speech signal and distinguish between two emotional states --"agitation" and "calm" -- with the accuracy of 77%. The agent was used as a part of a decision support system for prioritizing voice messages and assigning a proper human agent to response the message at call center environment. The architecture of the system is presented and discussed.
Article
This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions.
Article
In this paper, we outline the approach we have developed to construct an emotion-recognising system. It is based on guidance from psychological studies of emotion, as well as from the nature of emotion in its interaction with attention. A neural network architecture is constructed to be able to handle the fusion of different modalities (facial features, prosody and lexical content in speech). Results from the network are given and their implications discussed, as are implications for future direction for the research.
Article
The traditional image representations are not suited to conventional classification methods, such as the linear discriminant analysis (LDA), because of the under sample problem (USP): the dimensionality of the feature space is much higher than the number of training samples. Motivated by the successes of the two dimensional LDA (2DLDA) for face recognition, we develop a general tensor discriminant analysis (GTDA) as a preprocessing step for LDA. The benefits of GTDA compared with existing preprocessing methods, e.g., principal component analysis (PCA) and 2DLDA, include 1) the USP is reduced in subsequent classification by, for example, LDA; 2) the discriminative information in the training tensors is preserved; and 3) GTDA provides stable recognition rates because the alternating projection optimization algorithm to obtain a solution of GTDA converges, while that of 2DLDA does not. We use human gait recognition to validate the proposed GTDA. The averaged gait images are utilized for gait representation. Given the popularity of Gabor function based image decompositions for image understanding and object recognition, we develop three different Gabor function based image representations: 1) the GaborD representation is the sum of Gabor filter responses over directions, 2) GaborS is the sum of Gabor filter responses over scales, and 3) GaborSD is the sum of Gabor filter responses over scales and directions. The GaborD, GaborS and GaborSD representations are applied to the problem of recognizing people from their averaged gait images.A large number of experiments were carried out to evaluate the effectiveness (recognition rate) of gait recognition based on first obtaining a Gabor, GaborD, GaborS or GaborSD image representation, then using GDTA to extract features and finally using LDA for classification. The proposed methods achieved good performance for gait recognition based on image sequences from the USF HumanID Database. Experimental comparisons are made with nine state of the art classification methods in gait recognition.
Conference Paper
Visual and auditory modalities are two of the most commonly used media in interactions between humans. The authors describe a system to continuously monitor the user's voice and facial motions for recognizing emotional expressions. Such an ability is crucial for intelligent computers that take on a social role such as an actor or a companion. We outline methods to extract audio and visual features useful for classifying emotions. Audio and visual information must be handled appropriately in single-modal and bimodal situations. We report audio-only and video-only emotion recognition on the same subjects, in person-dependent and person-independent fashions, and outline methods to handle bimodal recognition
Article
experienced increased attention recently. Most current research focuses on techniques for capturing, synthesizing, and retargeting facial expressions. Little attention has been paid to the problem of controlling and modiing the expression itself. We present techniques that separate video data into expressive features and underlying content. This allows, for example, a sequence originally recorded with a happy expression to be modified so that the speaker appears to be speaking with an angry or neutral expression. Although the expression has been modified, the new sequences maintain the same visual speech content as the original sequence. The facial expression space that allows these transformations is learned with the aid of a factorization model.
Article
Classifiers based on Bayesian networks are usually learned with a fixed structure or a small subset of possible structures. In the presence of unlabeled data this strategy can be detrimental to classification performance, when the assumed classifier structure is incorrect. In this paper we present a classification driven learning method for Bayesian network classifiers that is based on Metropolis-Hastings sampling. We first show that this learning method outperforms existing approaches for fully labeled datasets. We then show that the method is successful in dealing with unlabeled data. Provided we have abundant unlabeled data, the learning method can process scarce labeled data to produce classifiers that attain performance comparable to classifiers learned with large amounts of fully labeled data.
Feedtrace: an instrument for recording perceived emotion in real time
  • R Cowie
  • E Douglas-Cowie
  • S Savvidou
  • E Mcmahon
  • M Sawey
  • M Schöder
Expression space learning
  • E S Chuang
  • H Dedeshpande
  • C Bregler