Thesis

From Synchronous to Asynchronous Event-driven Fusion Approaches in Multi-modal Affect Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The cues that describe emotional conditions are encoded within multiple modalities and fusion of multi-modal information is a natural way to improve the automated recognition of emotions. Throughout many studies, we see traditional fusion approaches in which decisions are synchronously forced for fixed time segments across all considered modalities and generic combination rules are applied. Varying success is reported, sometimes performance is worse than uni-modal classification. Starting from these premises, this thesis investigates and compares the performance of various synchronous fusion techniques. We enrich the traditional set with custom and emotion adapted fusion algorithms that are tailored towards the affect recognition domain they are used in. These developments enhance recognition quality to a certain degree, but do not solve the sometimes occurring performance problems. To isolate the issue, we conduct a systematic investigation of synchronous fusion techniques on acted and natural data and conclude that the synchronous fusion approach shows a crucial weakness especially on non-acted emotions: The implicit assumption that relevant affective cues happen at the same time across all modalities is only true if emotions are depicted very coherent and clear - which we cannot expect in a natural setting. This implies a switch to asynchronous fusion approaches. This change can be realized by the application of classification models with memory capabilities (\eg recurrent neural networks), but these are often data hungry and non-transparent. We consequently present an alternative approach to asynchronous modality treatment: The event-driven fusion strategy, in which modalities decide when to contribute information to the fusion process in the form of affective events. These events can be used to introduce an additional abstraction layer to the recognition process, as provided events do not necessarily need to match the sought target class but can be cues that indicate the final assessment. Furthermore, we will see that the architecture of an event-driven fusion system is well suited for real-time usage and is very tolerant to temporarily missing input from single modalities and is therefore a good choice for affect recognition in the wild. We will demonstrate mentioned capabilities in various comparison and prototype studies and present the application of event-driven fusion strategies in multiple European research projects.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We present work in progress on an intelligent embodied conversation agent that is supposed to act as a social companion with linguistic and emotional competence in the context of basic and health care. The core of the agent is an ontology-based knowledge model that supports flexible reasoning-driven conversation planning strategies. A dedicated search engine ensures the provision of background information from the web, necessary for conducting a conversation on a specific topic. Multimodal communication analysis and generation modules analyze respectively generate facial expressions, gestures and multilingual speech. The assessment of the prototypical implementation of the agent shows that users accept it as a natural and trustworthy conversation counterpart. For the final release, all involved technologies will be further improved and matured.
Article
Full-text available
This paper details the methodology and results of the EmotioNet challenge. This challenge is the first to test the ability of computer vision algorithms in the automatic analysis of a large number of images of facial expressions of emotion in the wild. The challenge was divided into two tracks. The first track tested the ability of current computer vision algorithms in the automatic detection of action units (AUs). Specifically, we tested the detection of 11 AUs. The second track tested the algorithms' ability to recognize emotion categories in images of facial expressions. Specifically, we tested the recognition of 16 basic and compound emotion categories. The results of the challenge suggest that current computer vision and machine learning algorithms are unable to reliably solve these two tasks. The limitations of current algorithms are more apparent when trying to recognize emotion. We also show that current algorithms are not affected by mild resolution changes, small occluders, gender or age, but that 3D pose is a major limiting factor on performance. We provide an in-depth discussion of the points that need special attention moving forward.
Article
Full-text available
Throughout many present studies dealing with multi-modal fusion, decisions are synchronously forced for fixed time segments across all modalities. Varying success is reported, sometimes performance is worse than unimodal classification. Our goal is the synergistic exploitation of multimodality whilst implementing a real-time system for affect recognition in a naturalistic setting. Therefore we present a categorization of possible fusion strategies for affect recognition on continuous time frames of complete recording sessions and we evaluate multiple implementations from resulting categories. These involve conventional fusion strategies as well as novel approaches that incorporate the asynchronous nature of observed modalities. Some of the latter algorithms consider temporal alignments between modalities and observed frames by applying asynchronous neural networks that use memory blocks to model temporal dependencies. Others use an indirect approach that introduces events as an intermediate layer to accumulate evidence for the target class through all modalities. Recognition results gained on a naturalistic conversational corpus show a drop in recognition accuracy when moving from unimodal classification to synchronous multimodal fusion. However, with our proposed asynchronous and event-based fusion techniques we are able to raise the recognition system’s accuracy by 7.83% compared to video analysis and 13.71% in comparison to common fusion strategies.
Conference Paper
Full-text available
Over the last years, mobile devices have become an integral part of people's everyday life. At the same time, they provide more and more computational power and memory capacity to perform complex calculations that formerly could only be accomplished with bulky desktop machines. These capabilities combined with the willingness of people to permanently carry them around open up completely new perspectives to the area of Social Signal Processing. To allow for an immediate analysis and interaction, real-time assessment is necessary. To exploit the benefits of multiple sensors, fusion algorithms are required that are able to cope with data loss in asynchronous data streams. In this paper we present MobileSSI, a port of the Social Signal Interpretation (SSI) framework to Android and embedded Linux platforms. We will test to what extent it is possible to run sophisticated synchronization and fusion mechanisms in an everyday mobile setting and compare the results with similar tasks in a laboratory environment.
Conference Paper
Full-text available
The fourth Emotion Recognition in the Wild (EmotiW) challenge is a grand challenge in the ACM International Conference on Multimodal Interaction 2016, Tokyo. EmotiW is a series of benchmarking and competition effort for researchers working in the area of automatic emotion recognition in the wild. The fourth EmotiW has two sub-challenges: Video based emotion recognition (VReco) and Group-level emotion recognition (GReco). The VReco sub-challenge is being run for the fourth time and GReco is a new sub-challenge this year.
Article
Full-text available
Interest makes one hold her attention on the object of interest. Automatic recognition of interest has numerous applications in human-computer interaction. In this paper, we study the facial expressions associated with interest and its underlying and closely related components, namely, curiosity, coping potential, novelty and complexity. We develop a method for automatic recognition of visual interest in response to images and micro-videos. To this end, we conducted an experiment in which participants watched images and micro-videos while their frontal videos were recorded. After each item they self-reported their level of interest, coping potential and perceived novelty and complexity. We used OpenFace to track facial action units (AU) and studied the presence of AUs with interest and its related components. We then tracked the facial landmarks and extracted features from each response. We trained random forest regression models to detect the level of interest, curiosity, and appraisals. We obtained promising results on coping potential, novelty and complexity detection. With this work, we demonstrate the feasibility of detecting cognitive appraisals from facial expressions which will open the door for appraisal-driven emotion recognition methods.
Conference Paper
Full-text available
Persuasiveness is a high-level personality trait that quantifies the influence a speaker has on the beliefs, attitudes, intentions , motivations, and behavior of the audience. With social multimedia becoming an important channel in propagating ideas and opinions, analyzing persuasiveness is very important. In this work, we use the publicly available Persuasive Opinion Multimedia (POM) dataset to study persuasion. One of the challenges associated with this problem is the limited amount of annotated data. To tackle this challenge, we present a deep multimodal fusion architecture which is able to leverage complementary information from individual modalities for predicting persuasiveness. Our methods show significant improvement in performance over previous approaches.
Chapter
Full-text available
With the growing number of conversational systems that find their way in our daily life, new questions and challenges arise. Even though natural conversation with agent-based systems has been improved in the recent years, e.g., by better speech recognition algorithms, they still lack the ability to understand nonverbal behavior and conversation dynamics—a key part of human natural interaction. To make a step towards intuitive and natural interaction with virtual agents, social robots, and other conversational systems, this chapter proposes a probabilistic framework that models the dynamics of interpersonal cues reflecting the user’s social attitude within the context they occur.
Conference Paper
Full-text available
The third Emotion Recognition in the Wild (EmotiW) challenge 2015 consists of an audio-video based emotion and static image based facial expression recognition sub-challenges, which mimics real-world conditions. The two sub-challenges are based on the Acted Facial Expression in the Wild (AFEW) 5.0 and the Static Facial Expression in the Wild (SFEW) 2.0 databases, respectively. The paper describes the data, baseline method, challenge protocol and the challenge results. A total of 12 and 17 teams participated in the video based emotion and image based expression sub-challenges, respectively.
Article
Full-text available
A tremendous research is being done on Speech Emotion Recognition (SER) in the recent years with its main motto to improve human machine interaction. In this work, the effect of cepstral coefficients in the detection of emotions is performed. Also, a comparative analysis of cepstum, Mel-frequency Cepstral Coefficients (MFCC) and synthetically enlarged MFCC coefficients on emotion classification is done. Using a compact feature vector, our algorithm depicted better recognition rates of identifying seven emotions from Berlin speech corpus compared to the earlier work by Firoz Shah where only four emotions were recognized with good accuracy. The proposed method has facilitated a considerable reduction in the misclassification efficiency which outperforms the algorithm by InmaMohino, where the feature vector included only synthetically enlarged MFCC coefficients.
Article
Full-text available
Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards become an essential safeguard to ensure compliance with state-of-the-art methods allowing appropriate comparison of results across studies and potential integration and combination of extraction and recognition systems. In this paper we propose a basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a large brute-force parameter set, we present a minimalistic set of voice parameters here. These were selected based on a) their potential to index affective physiological changes in voice production, b) their proven value in former studies as well as their automatic extractability, and c) their theoretical significance. The set is intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. Our implementation is publicly available with the openSMILE toolkit. Comparative evaluations of the proposed feature set and large baseline feature sets of INTERSPEECH challenges show a high performance of the proposed set in relation to its size.
Conference Paper
Full-text available
The Second Emotion Recognition In The Wild Challenge (EmotiW) 2014 consists of an audio-video based emotion classification challenge, which mimics the real-world conditions. Traditionally, emotion recognition has been performed on data captured in constrained lab-controlled like environment. While this data was a good starting point, such lab controlled data poorly represents the environment and conditions faced in real-world situations. With the exponential increase in the number of video clips being up-loaded online, it is worthwhile to explore the performance of emotion recognition methods that work 'in the wild'. The goal of this Grand Challenge is to carry forward the common platform defined during EmotiW 2013, for evaluation of emotion recognition methods in real-world conditions. The database in the 2014 challenge is the Acted Facial Expression In Wild (AFEW) 4.0, which has been collected from movies showing close-to-real-world conditions. The paper describes the data partitions, the baseline method and the experimental protocol.
Article
Full-text available
Automatic emotion recognition systems based on supervised machine learning require reliable annotation of affective behaviours to build useful models. Whereas the dimensional approach is getting more and more popular for rating affective behaviours in continuous time domains, e.g., arousal and valence, methodologies to take into account reaction lags of the human raters are still rare. We therefore investigate the relevance of using machine learning algorithms able to integrate contextual information in the modelling, like long short-term memory recurrent neural networks do, to automatically predict emotion from several (asynchronous) raters in continuous time domains, i.e., arousal and valence. Evaluations are performed on the recently proposed RECOLA multimodal database (27 subjects, 5 min of data and six raters for each), which includes audio, video, and physiological (ECG, EDA) data. In fact, studies uniting audiovisual and physiological information are still very rare. Features are extracted with various window sizes for each modality and performance for the automatic emotion prediction is compared for both different architectures of neural networks and fusion approaches (feature-level/decision-level). The results show that: (i) LSTM network can deal with (asynchronous) dependencies found between continuous ratings of emotion with video data, (ii) the prediction of the emotional valence requires longer analysis window than for arousal and (iii) a decision-level fusion leads to better performance than a feature-level fusion. The best performance (concordance correlation coefficient) for the multimodal emotion prediction is 0.804 for arousal and 0.528 for valence.
Conference Paper
Full-text available
Social signals and interpretation of carried information is of high importance in Human Computer Interaction. Often used for affect recognition, the cues within these signals are displayed in various modalities. Fusion of multi-modal signals is a natural and interesting way to improve automatic classification of emotions transported in social signals. Throughout most present studies, uni-modal affect recognition as well as multi-modal fusion, decisions are forced for fixed annotation segments across all modalities. In this paper, we investigate the less prevalent approach of event driven fusion, which indirectly accumulates asynchronous events in all modalities for final predictions. We present a fusion approach, handling short-timed events in a vector space, which is of special interest for real-time applications. We compare results of segmentation based uni-modal classification and fusion schemes to the event driven fusion approach. The evaluation is carried out via detection of enjoyment-episodes within the audiovisual Belfast Story-Telling Corpus.
Conference Paper
Full-text available
The development of systems that allow multimodal interpretation of human-machine interaction is crucial to advance our understanding and validation of theoretical models of user behavior. In particular, a system capable of collecting, perceiving and interpreting unconscious behavior can provide rich contextual information for an interactive system. One possible application for such a system is in the exploration of complex data through immersion, where massive amounts of data are generated every day both by humans and computer processes that digitize information at different scales and resolutions thus exceeding our processing capacity. We need tools that accelerate our understanding and generation of hypotheses over the datasets, guide our searches and prevent data overload. We describe XIM-engine, a bio-inspired software framework designed to capture and analyze multi-modal human behavior in an immersive environment. The framework allows performing studies that can advance our understanding on the use of conscious and unconscious reactions in interactive systems.
Conference Paper
Full-text available
Emotion recognition is a very active field of research. The Emotion Recognition In The Wild Challenge and Workshop (EmotiW) 2013 Grand Challenge consists of an audio-video based emotion classification challenges, which mimics real-world conditions. Traditionally, emotion recognition has been performed on laboratory controlled data. While undoubtedly worthwhile at the time, such laboratory controlled data poorly represents the environment and conditions faced in real-world situations. The goal of this Grand Challenge is to define a common platform for evaluation of emotion recognition methods in real-world conditions. The database in the 2013 challenge is the Acted Facial Expression in the Wild (AFEW), which has been collected from movies showing close-to-real-world conditions.
Article
Full-text available
Improved guided image fusion for magnetic resonance and computed tomography imaging is proposed. Existing guided filtering scheme uses Gaussian filter and two-level weight maps due to which the scheme has limited performance for images having noise. Different modifications in filter (based on linear minimum mean square error estimator) and weight maps (with different levels) are proposed to overcome these limitations. Simulation results based on visual and quantitative analysis show the significance of proposed scheme.
Conference Paper
Full-text available
Facial actions cause local appearance changes over time, and thus dynamic texture descriptors should inherently be more suitable for facial action detection than their static variants. In this paper we propose the novel dynamic appearance descriptor Local Gabor Binary Patterns from Three Orthogonal Planes (LGBP-TOP), combining the previous success of LGBP-based expression recognition with TOP extensions of other descriptors. LGBP-TOP combines spatial and dynamic texture analysis with Gabor filtering to achieve unprecedented levels of recognition accuracy in real-time. While TOP features risk being sensitive to misalignment of consecutive face images, a rigorous analysis of the descriptor shows the relative robustness of LGBP-TOP to face registration errors caused by errors in rotational alignment. Experiments on the MMI Facial Expression and Cohn-Kanade databases show that for the problem of FACS Action Unit detection, LGBP-TOP outperforms both its static variant LGBP and the related dynamic appearance descriptor LBP-TOP.
Conference Paper
Full-text available
Classifying continuous signals from multiple channels poses several challenges: different sample rates from different types of channels have to be incorporated. Furthermore, when leaping from the laboratory to the real world, it is mandatory to deal with failing sensors and also uncertain or even incorrect classifications. We propose a new Multi Classifier System (MCS) based on the application of classifier making use of an reject option and a Markov Fusion Network (MFN) which is evaluated in an off-line and on-line manner. The architecture is tested using the publicly available AVEC corpus, that collects affectively labeled episodes of human computer interaction. The MCS achieved a significant improvement compared to the results obtained on the single modalities.
Conference Paper
Full-text available
The INTERSPEECH 2013 Computational Paralinguistics Challenge provides for the first time a unified test-bed for Social Signals such as laughter in speech. It further introduces conflict in group discussions as a new task and deals with autism and its manifestations in speech. Finally, emotion is revisited as task, albeit with a broader range of overall twelve enacted emotional states. In this paper, we describe these four Sub-Challenges, their conditions, baselines, and a new feature set by the openSMILE toolkit, provided to the participants.
Conference Paper
Full-text available
The aim of the Multimodal and Multiperson Corpus of Laughter in Interaction (MMLI) was to collect multimodal data of laughter with the focus on full body movements and different laughter types. It contains both induced and interactive laughs from human triads. In total we collected 500 laugh episodes of 16 participants. The data consists of 3D body position information, facial tracking, multiple audio and video channels as well as physiological data. In this paper we discuss methodological and technical issues related to this data collection including techniques for laughter elicitation and synchronization between different independent sources of data. We also present the enhanced visualization and segmentation tool used to segment captured data. Finally we present data annotation as well as preliminary results of the analysis of the nonverbal behavior patterns in laughter.
Conference Paper
Full-text available
Automatic detection and interpretation of social signals car-ried by voice, gestures, mimics, etc. will play a key-role for next-generation interfaces as it paves the way towards a more intuitive and natural human-computer interaction. The paper at hand introduces Social Signal Interpretation (SSI), a framework for real-time recognition of social signals. SSI supports a large range of sensor devices, filter and feature algorithms, as well as, machine learning and pattern recognition tools. It encourages developers to add new components using SSI's C++ API, but also addresses front end users by offering an XML interface to build pipelines with a text editor. SSI is freely available under GPL at http://openssi.net.
Article
Full-text available
Two large facial-expression databases depicting challenging real-world conditions were constructed using a semi-automatic approach via a recommender system based on subtitles.
Conference Paper
Full-text available
Laughter is a frequently occurring social signal and an important part of human non-verbal communication. However it is often overlooked as a serious topic of scientific study. While the lack of research in this area is mostly due to laughter's non-serious nature, it is also a particularly difficult social signal to produce on demand in a convincing manner; thus making it a difficult topic for study in laboratory settings. In this paper we provide some techniques and guidance for inducing both hilarious laughter and conversational laughter. These techniques were devised with the goal of capturing mo-tion information related to laughter while the person laughing was either standing or seated. Comments on the value of each of the techniques and general guidance as to the importance of atmosphere, environment and social setting are provided.
Article
Full-text available
Recent technological advances have enabled human users to interact with computers in ways previously unimaginable. Beyond the confines of the keyboard and mouse, new modalities for human-computer interaction such as voice, gesture, and force-feedback are emerging. Despite important advances, one necessary ingredient for natural interaction is still missing-emotions. Emotions play an important role in human-to-human communication and interaction, allowing people to express themselves beyond the verbal domain. The ability to understand human emotions is desirable for the computer in several applications. This paper explores new ways of human-computer interaction that enable the computer to be more aware of the user's emotional and attentional expressions. We present the basic research in the field and the recent advances into the emotion recognition from facial, voice, and physiological signals, where the different modalities are treated independently. We then describe the challenging problem of multimodal emotion recognition and we advocate the use of probabilistic graphical models when fusing the different modalities. We also discuss the difficult issues of obtaining reliable affective data, obtaining ground truth for emotion recognition, and the use of unlabeled data.
Conference Paper
We present an intelligent embodied conversation agent with linguistic, social and emotional competence. Unlike the vast majority of the state-of-the-art conversation agents, the proposed agent is constructed around an ontology-based knowledge model that allows for flexible reasoning-driven dialogue planning, instead of using predefined dialogue scripts. It is further complemented by multimodal communication analysis and generation modules and a search engine for the retrieval of multimedia background content from the web needed for conducting a conversation on a given topic. The evaluation of the 1st prototype of the agent shows a high degree of acceptance of the agent by the users with respect to its trustworthiness, naturalness, etc. The individual technologies are being further improved in the 2nd prototype.
Article
Physiological response is an important component of an emotional episode. In this paper, we introduce a Toolbox for Emotional feAture Extraction from Physiological signals (TEAP). This open source toolbox can preprocess and calculate emotionally relevant features from multiple physiological signals, namely, electroencephalogram (EEG), galvanic skin response (GSR), electromyogram (EMG), skin temperature, respiration pattern, and blood volume pulse. The features from this toolbox are tested on two publicly available databases, i.e., MAHNOB-HCI and DEAP. We demonstrate that we achieve similar performance to the original work with the features from this toolbox. The toolbox is implemented in MATLAB and is also compatible with Octave. We hope this toolbox to be further developed and accelerate research in affective physiological signal analysis.
Article
The book that launched the Dempster–Shafer theory of belief functions appeared 40 years ago. This intellectual autobiography looks back on how I came to write the book and how its ideas played out in my later work.
Chapter
An approach enabling the detection, tracking and fine analysis (e.g. gender and facial expression classification) of faces using a single web camera is described. One focus of the contribution lies in the description of the concept of a framework (the so-called Sophisticated High-speed Object Recognition Engine – SHORE), designed in order to create a flexible environment for varying detection tasks. The functionality and the setup of the framework are described, and a coarse overview about the algorithms used for the classification tasks will be given. Benchmark results have been obtained on both, standard and publicly available face data sets. Even though the framework has been designed for general object recognition tasks, the focus of this contribution lies in the field of face detection and facial analysis. In addition a demonstration application based on the described framework is given to show analysis of still images, movies or video streams.
Chapter
This chapter describes the development of a multimodal, ensemble-based system for emotion recognition covering the major steps in processing: emotion modeling, data segmentation and annotation, feature extraction and selection, classification and multimodal fusion techniques. It specifically focuses on the problem of temporary missing data in one or more observed modalities. In offline evaluation the issue can be easily solved by excluding those parts of the corpus where one or more channels are corrupted or not suitable for evaluation. In online applications, however, we cannot neglect the challenge of missing data and have to find adequate ways to handle it. The presented system solves the problem at the multimodal fusion stage-established and novel emotion-specific ensemble techniques and is enriched with strategies on how to compensate temporarily unavailable modalities. Extensive evaluation, including application of different annotation schemes, is carried out on the CALLAS Expressivity Corpus, featuring facial and vocal modalities.
Article
In this article, we introduce CURRENNT, an open-source parallel implementation of deep recurrent neural networks (RNNs) supporting graphics processing units (GPUs) through NVIDIA's Computed Unified Device Architecture (CUDA). CURRENNT supports uni- and bidirectional RNNs with Long Short-Term Memory (LSTM) memory cells which overcome the vanishing gradient problem. To our knowledge, CURRENNT is the first publicly available parallel implementation of deep LSTM-RNNs. Benchmarks are given on a noisy speech recognition task from the 2013 2nd CHiME Speech Separation and Recognition Challenge, where LSTM-RNNs have been shown to deliver best performance. In the result, double digit speedups in bidirectional LSTM training are achieved with respect to a reference single-threaded CPU implementation. CURRENNT is available under the GNU General Public License from http://sourceforge.net/p/currennt.
Article
In this paper, we propose a novel method for highly efficient exploitation of unlabeled data-Cooperative Learning. Our approach consists of combining Active Learning and Semi-Supervised Learning techniques, with the aim of reducing the costly effects of human annotation. The core underlying idea of Cooperative Learning is to share the labeling work between human and machine efficiently in such a way that instances predicted with insufficient confidence value are subject to human labeling, and those with high confidence values are machine labeled. We conducted various test runs on two emotion recognition tasks with a variable number of initial supervised training instances and two different feature sets. The results show that Cooperative Learning consistently outperforms individual Active and Semi-Supervised Learning techniques in all test cases. In particular, we show that our method based on the combination of Active Learning and Co-Training leads to the same performance of a model trained on the whole training set, but using 75% fewer labeled instances. Therefore, our method efficiently and robustly reduces the need for human annotations.
Article
Stacked generalization is a general method of using a high-level model to combine lower-level models to achieve greater predictive accuracy. In this paper we address two crucial issues which have been considered to be a ‘black art’ in classification tasks ever since the introduction of stacked generalization in 1992 by Wolpert: the type of generalizer that is suitable to derive the higher-level model, and the kind of attributes that should be used as its input. We find that best results are obtained when the higher-level model combines the confidence (and not just the predictions) of the lower-level ones. We demonstrate the effectiveness of stacked generalization for combining three different types of learning algorithms for classification tasks. We also compare the performance of stacked generalization with majority vote and published results of arcing and bagging.
Conference Paper
The EU-ICT FET Project ILHAIRE is aimed at endowing machines with automated detection, analysis, and synthesis of laughter. This paper describes the Body Laughter Index (BLI) for automated detection of laughter starting from the analysis of body movement captured by a video source. The BLI algorithm is described, and the index is computed on a corpus of videos. The assessment of the algorithm by means of subject's rating is also presented. Results show that BLI can successfully distinguish between different videos of laughter, even if improvements are needed with respect to perception of subjects, multimodal fusion, cultural aspects, and generalization to a broad range of social contexts.
Article
Significance Though people regularly recognize many distinct emotions, for the most part, research studies have been limited to six basic categories—happiness, surprise, sadness, anger, fear, and disgust; the reason for this is grounded in the assumption that only these six categories are differentially represented by our cognitive and social systems. The results reported herein propound otherwise, suggesting that a larger number of categories is used by humans.
Conference Paper
Recognition of emotions from speech is one of the most important sub domains in the field of affective computing. Six basic emotional states are considered for classification of emotions from speech in this work. In this work, features are extracted from audio characteristics of emotional speech by Mel-frequency Cepstral Coefficient (MFCC), and Subband based Cepstral Parameter (SBC) method. Further these features are classified using Gaussian Mixture Model (GMM). SAVEE audio database is used in this work for testing of Emotions. In the experimental results, SBC method out performs with 70% in recognition compared to 51% of recognition in MFCC algorithm.
Article
The recognition of patterns in real-time scenarios has become an important trend in the field of multi-modal user interfaces in human computer interaction. Cognitive technical systems aim to improve the human computer interaction by means of recognizing the situative context, e.g. by activity recognition (Ahad et al. in IEEE, 1896-1901, 2008), or by estimating the affective state (Zeng et al., IEEE Trans Pattern Anal Mach Intell 31(1):39-58, 2009) of the human dialogue partner. Classifier systems developed for such applications must operate on multiple modalities and must integrate the available decisions over large time periods. We address this topic by introducing the Markov fusion network (MFN) which is a novel classifier combination approach, for the integration of multi-class and multi-modal decisions continuously over time. The MFN combines results while meeting real-time requirements, weighting decisions of the modalities dynamically, and dealing with sensor failures. The proposed MFN has been evaluated in two empirical studies: the recognition of objects involved in human activities, and the recognition of emotions where we successfully demonstrate its outstanding performance. Furthermore, we show how the MFN can be applied in a variety of different architectures and the several options to configure the model in order to meet the demands of a distinct problem.
Book
With the advent of new technology, the most important impetus for the proliferation of studies on vocal affect expression has been the recent interest in the large-scale application of speech technology in automatic speech and speaker recognition and speech synthesis. This chapter advocates the Brunswikian lens model as a meta-structure for studies in this area, especially because it alerts researchers to important design considerations in studies of vocal affect expression. Vocal expression involves the joint operation of push and pull effects, and the interaction of psychobiological and sociocultural factors, both of which urgently need to be addressed in future studies. So far there is very little cross-language and cross-cultural research in this area, which is surprising because phonetic features of language may constrain the affect signaling potential of voice cues.
Article
This reference work provides broad and up-to-date coverage of the major perspectives - ethological, neurobehavioral, developmental, dynamic systems, componential - on facial expression. It reviews Darwin's legacy in the theories of Izard and Tomkins and in Fridlund's recently proposed Behavioral Ecology theory. It explores continuing controversies on universality and innateness. It also updates the research guidelines of Ekman, Friesen and Ellsworth. This book anticipates emerging research questions: what is the role of culture in children's understanding of faces? In what precise ways do faces depend on the immediate context? What is the ecology of facial expression: when do different expressions occur and in what frequency? The Psychology of Facial Expressions is aimed at students, researchers and educators in psychology anthropology, and sociology who are interested in the emotive and communicative uses of facial expression.