Conference Paper

Alone versus In-a-group: A Comparative Analysis of Facial Affect Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Automatic affect analysis and understanding has become a well established research area in the last two decades. Recent works have started moving from individual to group scenarios. However, little attention has been paid to comparing the affect expressed in individual and group settings. This paper presents a framework to investigate the differences in affect recognition models along arousal and valence dimensions in individual and group settings. We analyse how a model trained on data collected from an individual setting performs on test data collected from a group setting, and vice versa. A third model combining data from both individual and group settings is also investigated. A set of experiments is conducted to predict the affective states along both arousal and valence dimensions on two newly collected databases that contain sixteen participants watching affective movie stimuli in individual and group settings, respectively. The experimental results show that (1) the affect model trained with group data performs better on individual test data than the model trained with individual data tested on group data, indicating that facial behaviours expressed in a group setting capture more variation than in an individual setting; and (2) the combined model does not show better performance than the affect model trained with a specific type of data (i.e., individual or group), but proves a good compromise. These results indicate that in settings where multiple affect models trained with different types of data are not available, using the affect model trained with group data is a viable solution.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... From the psychological perspective, affect analysis in group settings is more complex than in individual settings due to the influence of the overall group as well as influences by each group member [2]. From the automatic analysis perspective, it has been shown that the degree of variation between individual and group settings is significant in terms of differences in facial and bodily behaviors, timing, and dynamics [41,42]. To obtain further insights into this challenging problem, it is important to study the affect expressed in group settings. ...
... In this article, we aim to investigate the following: (1) whether it is possible to recognize the affect expressed by each participant while presented with movie stimuli; (2) whether the affect recognition performance is affected by different settings or databases, i.e., individual vs. group setting; (3) what kind of body and face features work better for different tasks; (4) whether the fusion of body and facial features is able to improve the recognition results; (5) whether it is possible to predict the context information (a person being alone or in-a-group) using facial and body behavioral cues. This work is an extended version of our previous works [41] and [42]. Different from the aforementioned papers, the contributions of this work are: ...
... (2) The temporal modeling method, Long Short-term Memory Networks (LSTM) combined with QLZM facial features, is utilized to analyze affect in terms of both arousal and valence dimensions. Specifically, in our previous work [41], we introduced a framework to analyze individual affect in individual and group videos along arousal and valence using facial features only. In Reference [42], we proposed a method to recognize affect and group membership in group videos. ...
Article
Recognition and analysis of human affect has been researched extensively within the field of computer science in the past two decades. However, most of the past research in automatic analysis of human affect has focused on the recognition of affect displayed by people in individual settings and little attention has been paid to the analysis of the affect expressed in group settings. In this article, we first analyze the affect expressed by each individual in terms of arousal and valence dimensions in both individual and group videos and then propose methods to recognize the contextual information, i.e., whether a person is alone or in-a-group by analyzing their face and body behavioral cues. For affect analysis, we first devise affect recognition models separately in individual and group videos and then introduce a cross-condition affect recognition model that is trained by combining the two different types of data. We conduct a set of experiments on two datasets that contain both individual and group videos. Our experiments show that (1) the proposed Volume Quantized Local Zernike Moments Fisher Vector outperforms other unimodal features in affect analysis; (2) the temporal learning model, Long-Short Term Memory Networks, works better than the static learning model, Support Vector Machine; (3) decision fusion helps to improve affect recognition, indicating that body behaviors carry emotional information that is complementary rather than redundant to the emotion content in facial behaviors; and (4) it is possible to predict the context, i.e., whether a person is alone or in-a-group, using their non-verbal behavioral cues.
... Additionally, depth and full body video was recorded using Microsoft's Kinect V1 5 placed at the top of the screen. In this study we have not used the visual modality for prediction, but Mou et al [40], [41] have already shown the utility of our dataset using the visual information for prediction of affect, social context and group belonging. A participant during the 3. http://www.emotiv.com/ 4. http://www.shimmersensing.com/ 5. ...
... By employing the Welch method with windows of 128 samples (1.0s), PSDs, between 3 and 47 Hz, of the signals of every clip were calculated for each of the 14 EEG channels. The obtained PSDs were then averaged over the frequency bands of theta (3-7 Hz), slow alpha (8-10 Hz), alpha (8-13 Hz), beta (14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29), and gamma (30)(31)(32)(33)(34)(35)(36)(37)(38)(39)(40)(41)(42)(43)(44)(45)(46)(47), and their logarithms were obtained as features. Additionally, the spectral power asymmetry between the 7 pairs of symmetrical electrodes, in the five bands, was calculated. ...
Article
We present a database for research on affect, personality traits and mood by means of neuro-physiological signals. Different to other databases, we elicited affect using both short and long videos in two settings, one with individual viewers and one with groups of viewers. The database allows the multimodal study of the affective responses of individuals in relation to their personality and mood, and the analysis of how these responses are affected by (i) the individual/group setting, and (ii) the duration of the videos (short vs long). The data is collected in two experiments. In the first one, 40 participants watched 16 short emotional videos while they were alone. In the second one, the same participants watched 4 long videos, some of them alone and the rest in groups. Participants' signals, namely, Electroencephalogram (EEG), Electrocardiogram (ECG), and Galvanic Skin Response (GSR), were recorded using wearable sensors. Frontal, full-body and depth videos were also recorded. Participants have been profiled for personality using the Big-five personality traits, and for mood with the baseline Positive Affect and Negative Affect Schedules. Participants emotions have been annotated with both, self-assessment of affective levels (valence, arousal, control, familiarity, like/dislike, and selection of basic emotion) felt by the participants during the first experiment, and external-assessment of participants' levels of valence and arousal for both experiments. We present a detailed correlation analysis between the different scores of personality, mood and affect. We also present baseline methods and results for single-trial classification of valence and arousal, and for single-trial classification of personality traits, mood and social context (alone vs group), using EEG, GSR and ECG and fusion of modalities for both experiments. The database has been made publicly available.
... Existing research using multiple cameras and front-facing Kinects and/or wearable sensors addresses fine-grained analyses of group interaction patterns and the automatic determination of social constructs such as agreement/disagreement [8], cohesion [18], dominance [19,22], leadership [3,4,44] or emotion [31] in group interactions. In [29], the authors discuss some of these automatic analysis techniques and how automating the process saves hours of manual annotation effort. ...
Conference Paper
Full-text available
Group meetings are often inefficient, unorganized and poorly documented. Factors including "group-think," fear of speaking, unfocused discussion, and bias can affect the performance of a group meeting. In order to actively or passively facilitate group meetings, automatically analyzing group interaction patterns is critical. Existing research on group dynamics analysis still heavily depends on video cameras in the lines of sight of participants or wearable sensors, both of which could affect the natural behavior of participants. In this thesis, we present a smart meeting room that combines microphones and unobtrusive ceiling-mounted Time-of-Flight (ToF) sensors to understand group dynamics in team meetings. Since the ToF sensors are ceiling-mounted and out of the lines of sight of the participants, we posit that their presence would not disrupt the natural interaction patterns of individuals. We collect a new multi-modal dataset of group interactions where participants have to complete a task by reaching a group consensus, and then fill out a post-task questionnaire. We use this dataset for the development of our algorithms and analysis of group meetings. In this paper, we combine the ceiling-mounted ToF sensors and lapel microphones to: (1) estimate the seated body orientation of participants, (2) estimate the head pose and visual focus of attention (VFOA) of meeting participants, (3) estimate the arm pose and body posture of participants, and (4) analyze the multimodal data for passive understanding of group meetings, with a focus on perceived leadership and contribution.
... Computer science scholars (i.e., geeks) working in the area of social signal processing (see Vinciarelli, Pantic, & Bourlard, 2009 for a review) or affective computing (see Picard, 1997 for an introduction and Gunes & Pantic, 2010;Gunes & Schuller, 2013;and Sariyanidi, Gunes, & Cavallaro, 2015 for more focused surveys) have been making significant advances in the identification and analysis of small group interaction, particularly in controlled settings (see the survey by Gatica-Perez, 2009). Thus, it has been possible to provide fine-grained analyses of group interaction patterns and use these to automatically determine social constructs such as agreement/disagreement (e.g., Bousmalis, Mehu, & Pantic, 2013), cohesion (e.g., Hung & Gatica-Perez, 2010), dominance (e.g., Hung, Huang, Friedland, & Gatica-Perez, 2011), leadership (e.g., Scherer, Weibel, Morency, & Oviatt, 2012, or emotion (e.g., Mou, Gunes, & Patras, 2016) in group interactions. However, these innovations remain out of the reach of group scholars as considerable expertise is required to understand the practicalities of how data captured for human interpretation differ from data captured for automation. ...
Article
Full-text available
This special issue on advancing interdisciplinary collaboration between computer scientists and social scientists documents the joint results of the international Lorentz workshop, “Interdisciplinary Insights into Group and Team Dynamics,” which took place in Leiden, The Netherlands, July 2016. An equal number of scholars from social and computer science participated in the workshop and contributed to the papers included in this special issue. In this introduction, we first identify interaction dynamics as the core of group and team models and review how scholars in social and computer science have typically approached behavioral interactions in groups and teams. Next, we identify key challenges for interdisciplinary collaboration between social and computer scientists, and we provide an overview of the different articles in this special issue aimed at addressing these challenges.
... Ibrahim et al. [2] focus on group activity recognition. More recently, other research fields, including emotion recognition, have also started to shift their focus from individual to group settings [3], [4]. Research works focusing on the analysis of social dimensions, such as engagement and rapport in group settings have also been introduced [5], [6]. ...
Conference Paper
Full-text available
Automatic understanding and analysis of groups has attracted increasing attention in the vision and multimedia communities in recent years. However, little attention has been paid to the automatic analysis of group membership-i.e., recognizing which group the individual in question is part of. This paper presents a novel two-phase Support Vector Machine (SVM) based specific recognition model that is learned using an optimized generic recognition model. We conduct a set of experiments using a database collected to study group analysis from multimodal cues while each group (i.e., four participants together) were watching a number of long movie segments. Our experimental results show that the proposed specific recognition model (52%) outperforms the generic recognition model trained across all different videos (35%) and the independent recognition model trained directly on each specific video (33%) using linear SVM.
... Mou et al. [12] perform an interesting study of humanaffect on individual and group scenarios. They created three models as mentioned below: 1) individual model, which is trained with an individual-level data-set. ...
Conference Paper
Full-text available
This paper proposes a pipeline for automatic group-level affect analysis. A deep neural network-based approach, which leverages on the facial-expression information, scene information and a high-level facial visual attribute information is proposed. A capsule network-based architecture is used to predict the facial expression. Transfer learning is used on Inception-V3 to extract global image-based features which contain scene information. Another network is trained for inferring the facial attributes of the group members. Further , these attributes are pooled at a group-level to train a network for inferring the group-level affect. The facial attribute prediction network, although is simple yet, is effective and generates result comparable to the state-of-the-art methods. Later, model integration is performed from the three channels. The experiments show the effectiveness of the proposed techniques on three 'in the wild' databases: Group Affect Database, HAPPEI and UCLA-Protest database. Index Terms-Group level affect recognition.
... Huang et al. [19] model the group using a conditional random field and represent faces with a local binary pattern variant. Mou et al. [20] perform an interesting study of human-affect on individual and group scenarios. They create three models as mentioned below: 1) Individual model which is trained with an individual level dataset. ...
Conference Paper
Analysis of a group of people is an important aspect of affective computing , multimedia and computer vision for past few years. Generally, the estimation of the group affect, emotional responses, eye gaze and position of people in images are the important cues to identify an important person from a group of people. The main focus of this paper is to explore the importance of group affect in finding the representative of a group. We call that person the " Most Influential Person" (for first impression) or "leader" of a group. In order to identify the main visual cues for " Most Influential Person", we conducted a user survey. Based on the survey statistics, we annotate the "influential persons" in 1000 images of Group AFfect database (GAF 2.0) via LabelMe toolbox and propose the "GAF-personage database". In order to identify " Most Influential Person", we proposed a DNN based Multiple Instance Learning (Deep MIL) method which takes deep facial features as input. To leverage the deep facial features, we first predict the individual emotion probabilities via CapsNet and rank the detected faces on the basis of it. Then, we extract deep facial features of the top-3 faces via VGG-16 network. Our method performs better than maximum facial area and saliency based importance methods and achieves human level perception of "Most Influential Person" at group-level.
... Huang et al. [19] model the group using a conditional random field and represent faces with a local binary pattern variant. Mou et al. [20] perform an interesting study of human-affect on individual and group scenarios. They create three models as mentioned below: 1) Individual model which is trained with an individual level dataset. ...
... Automatic analysis of group-level image in the wild has received much attention in computer vision in recent years. Several research fields, including emotion recognition, have started to shift their focus from individual to group-level [10,11]. It has a variety of applications, for example, a car can monitor the emotion of all occupants and engage in additional safety measures. ...
Chapter
Group emotion recognition in the wild has received much attention in computer vision community. It is a very challenge issue, due to interactions taking place between various numbers of people, different occlusions. According to human cognitive and behavioral researches, background and facial expression play a dominating role in the perception of group’s mood. Hence, in this paper, we propose a novel approach that combined these two features for image-based group emotion recognition with feature correlation enhancement. The feature enhancement is mainly reflected in two parts. For facial expression feature extraction, we plug non-local blocks into Xception network to enhance the feature correlation of different positions in low-level, which can avoid the fast loss of position information of the traditional CNNs and effectively enhance the network’s feature representation capability. For global scene information, we build a bilinear convolutional neural network (B-CNN) consisting of VGG16 networks to model local pairwise feature interactions in a translationally invariant manner. The experimental results show that the fused feature could effectively improve the performance.
... Ibrahim et al. (2015) focus on group activity recognition. More recently, other research fields, including emotion recognition, have also started to shift their focus from individual to group settings Mou et al. (2016aMou et al. ( , 2015. Research works focusing on the analysis of social dimensions, such as engagement and rapport in group settings have also been reported in Leite et al. (2015) and Hagad et al. (2011). ...
Article
Full-text available
Automatic understanding and analysis of groups has attracted increasing attention in the vision and multimedia communities in recent years. However, little attention has been paid to the automatic analysis of the non-verbal behaviors and how this can be utilized for analysis of group membership, i.e., recognizing which group each individual is part of. This paper presents a novel Support Vector Machine (SVM) based Deep Specific Recognition Model (DeepSRM) that is learned based on a generic recognition model. The generic recognition model refers to the model trained with data across different conditions, i.e., when people are watching movies of different types. Although the generic recognition model can provide a baseline for the recognition model trained for each specific condition, the different behaviors people exhibit in different conditions limit the recognition performance of the generic model. Therefore, the specific recognition model is proposed for each condition separately and built on top of the generic recognition model. A number of experiments are conducted using a database aiming to study group analysis while each group (i.e., four participants together) were watching a number of long movie segments. Our experimental results show that the proposed deep specific recognition model (44%) outperforms the generic recognition model (26%). The recognition of group membership also indicates that the non-verbal behaviors of individuals within a group share commonalities.
... Hybrid approaches use both holistic level and individual level information for study. Mou et al. [23] performed an interesting study of human affect on individual and group scenarios. They created three models as mentioned below: 1) individual model, which is trained with an individual level database. ...
Preprint
Full-text available
Cohesiveness of a group is an essential indicator of emotional state, structure and success of a group of people. We study the factors that influence the perception of group level cohesion and propose methods for estimating the human-perceived cohesion on the Group Cohesiveness Scale (GCS). Image analysis is performed at a group level via a multi-task convolutional neural network. For analyzing the contribution of facial expressions of the group members for predicting GCS, capsule network is explored. In order to identify the visual cues (attributes) for cohesion, we conducted a user survey. Based on the Group Affect database, we add GCS and propose the `GAF-Cohesion database'. The proposed model performs well on the database and is able to achieve near human-level performance in predicting group's cohesion score. It is interesting to note that GCS as an attribute, when jointly trained for group level emotion prediction, helps in increasing the performance for the later task. This suggests that group level emotion and GCS are correlated.
... Gatica-Perez [12] reviewed around a hundred papers dealing with small social interactions with a focus on non-verbal behavior, computational models, social constructs, and face-to-face interactions. The fine-grained analyses of group interaction patterns using these automated methods help in understanding social constructs such as agreement/disagreement [6], cohesion [15], dominance [16], leadership [3,18,38] and emotion [30] in group interactions. ...
Conference Paper
Full-text available
Studying group dynamics requires fine-grained spatial and temporal understanding of human behavior. Social psychologists studying human interaction patterns in face-to-face group meetings often find themselves struggling with huge volumes of data that require many hours of tedious manual coding. There are only a few publicly available multi-modal datasets of face-to-face group meetings that enable the development of automated methods to study verbal and non-verbal human behavior. In this paper, we present a new, publicly available multi-modal dataset for group dynamics study that differs from previous datasets in its use of ceiling-mounted, unobtrusive depth sensors. These can be used for fine-grained analysis of head and body pose and gestures, without any concerns about participants' privacy or inhibited behavior. The dataset is complemented by synchronized and time-stamped meeting transcripts that allow analysis of spoken content. The dataset comprises 22 group meetings in which participants perform a standard collaborative group task designed to measure leadership and productivity. Participants' post-task questionnaires, including demographic information, are also provided as part of the dataset. We show the utility of the dataset in analyzing perceived leadership, contribution, and performance, by presenting results of multi-modal analysis using our sensor-fusion algorithms designed to automatically understand audio-visual interactions.
... Hybrid approaches use both holistic level and individual level information. Mou et al. [48] performed an interesting study of human affect on individual and group scenarios. They created three models: 1) first trained with an individual level database. ...
Article
This paper discusses the prediction of cohesiveness of a group of people in images. The cohesiveness of a group is an essential indicator of the emotional state, structure and success of the group. We study the factors that influence the perception of group-level cohesion and propose methods for estimating the human-perceived cohesion on the group cohesiveness scale. To identify the visual cues (attributes) for cohesion, we conducted a user survey. Image analysis is performed at a group-level via a multi-task convolutional neural network. A capsule network is explored for analyzing the contribution of facial expressions of the group members on predicting the Group Cohesion Score (GCS). We add GCS to the Group Affect database and propose the 'GAF-Cohesion database'. The proposed model performs well on the database and achieves near human-level performance in predicting a group's cohesion score. It is interesting to note that group cohesion as an attribute, when jointly trained for group-level emotion prediction, helps in increasing the performance for the later task. This suggests that group-level emotion and cohesion are correlated. Further, we investigate the effect of face-level similarity, body pose and subset of a group on the task of automatic cohesion perception.
Article
Full-text available
A person's behavior significantly influences their health and wellbeing. It also contributes to the social environment in which humans interact, with cascading impacts to the health and behaviors of others. During social interactions, our understanding and awareness of vital nonverbal messages expressing beliefs, emotions, and intentions can be obstructed by a variety of factors including greatly flawed self-awareness. For these reasons, human behavior is a very important topic to study using the most advanced technology. Moreover, technology offers a breakthrough opportunity to improve people's social awareness and self-awareness through machine-enhanced recognition and interpretation of human behaviors. This paper reviews (1) the social psychology theory that has established the framework to study human behaviors and their manifestations during social interactions and (2) the technologies that have contributed to the monitoring of human behaviors. State-of-the-art in sensors, signal features, and computational models are categorized, summarized, and evaluated from a comprehensive transdisciplinary perspective. This review focuses on assessing technologies most suitable for real-time monitoring while highlighting their challenges and opportunities in near-future applications. Although social behavior monitoring has been highly reported in psychology literature and in engineering literature, this paper uniquely aims to serve as a disciplinary convergence bridge and a guide for engineers capable of bringing new technologies to bear against the current challenges in real-time human behavior monitoring.
Chapter
Group affect analysis is an important cue for predicting various group traits. Generally, the estimation of the group affect, emotional responses, eye gaze and position of people in images are the important cues to identify an important person from a group of people. The main focus of this paper is to explore the importance of group affect in finding the representative of a group. We call that person the “Most Influential Person” (for the first impression) or “leader” of a group. In order to identify the main visual cues for “Most Influential Person”, we conducted a user survey. Based on the survey statistics, we annotate the “influential persons” in 1000 images of Group AFfect database (GAF 2.0) via LabelMe toolbox and propose the “GAF-personage database”. In order to identify “Most Influential Person”, we proposed a DNN based Multiple Instance Learning (Deep MIL) method which takes deep facial features as input. To leverage the deep facial features, we first predict the individual emotion probabilities via CapsNet and rank the detected faces on the basis of it. Then, we extract deep facial features of the top-3 faces via VGG-16 network. Our method performs better than maximum facial area and saliency-based importance methods and achieves the human-level perception of “Most Influential Person” at group-level.
Article
Video-conferencing is becoming an essential part in everyday life. The visual channel allows for interactions which were not possible over audio-only communication systems such as the telephone. However, being a de-facto over-the-top service, the quality of the delivered video-conferencing experience is subject to variations, dependent on network conditions. Video-conferencing systems adapt to network conditions by changing for example encoding bitrate of the video. For this adaptation not to hamper the benefits related to the presence of a video channel in the communication, it needs to be optimized according to a measure of the Quality of Experience (QoE) as perceived by the user. The latter is highly dependent on the ongoing interaction and individual preferences, which have hardly been investigated so far. In this paper, we focus on the impact video quality has on conversations that revolve around objects that are presented over the video channel. To this end we conducted an empirical study where groups of 4 people collaboratively build a Lego® model over a video-conferencing system. We examine the requirements for such a task by showing when the interaction, measured by visual and auditory cues, changes depending on the encoding bitrate and loss. We then explore the impact that prior experience with the technology and affective state have on QoE of participants. We use these factors to construct predictive models which double the accuracy compared to a model based on the system factors alone. We conclude with a discussion how these factors could be applied in real world scenarios.
Article
This review presents, in a tutorial-like manner, the current state of research in group affects. We focus on the automatic processing of affects and their estimation, which is, of course, based on foundations in social sciences regarding multiple modalities. In addition, we highlight recent developments in methods and classification approaches. Further, a collection of suitable corpora for affect estimation is presented. Based on the literature, we discuss current achievements in relation to parameters that influence the processing of groups. Finally, in the sense of a road map, different perspectives on further research are presented.
Conference Paper
Full-text available
Depression and other mood disorders are common, disabling disorders with a profound impact on individuals and families. Inspite of its high prevalence, it is easily missed during the early stages. Automatic depression analysis has become a very active field of research in the affective computing community in the past few years. This paper presents a framework for depression analysis based on unimodal visual cues. Temporally piece-wise Fisher Vectors (FV) are computed on temporal segments. As a low-level feature, block-wise Local Binary Pattern-Three Orthogonal Planes descriptors are computed. Statistical aggregation techniques are analysed and compared for creating a discriminative representative for a video sample. The paper explores the strength of FV in representing temporal segments in a spontaneous clinical data. This creates a meaningful representation of the facial dynamics in a temporal segment. The experiments are conducted on the Audio Video Emotion Challenge (AVEC) 2014 German speaking depression database. The superior results of the proposed framework show the effectiveness of the technique as compared to the current state-of-art.
Conference Paper
Full-text available
Changes in type of interaction (e.g., individual vs. group interactions) can potentially impact data-driven models developed for social robots. In this paper, we provide a first investigation in the effects of changing group size in data-driven models for HRI, by analyzing how a model trained on data collected from participants interacting individually performs in test data collected from group interactions, and vice-versa. Another model combining data from both individual and group interactions is also investigated. We perform these experiments in the context of predicting disengagement behaviors in children interacting with two social robots. Our results show that a model trained with group data generalizes better to individual participants than the other way around. The mixed model seems a good compromise , but it does not achieve the performance levels of the models trained for a specific type of interaction.
Article
Full-text available
The recent advancement of social media has given users a platform to socially engage and interact with a larger population. Millions of images and videos are being uploaded everyday by users on the web from different events and social gatherings. There is an increasing interest in designing systems capable of understanding human manifestations of emotional attributes and affective displays. As images and videos from social events generally contain multiple subjects, it is an essential step to study these groups of people. In this paper, we study the problem of happiness intensity analysis of a group of people in an image using facial expression analysis. A user perception study is conducted to understand various attributes, which affect a person’s perception of the happiness intensity of a group. We identify the challenges in developing an automatic mood analysis system and propose three models based on the attributes in the study. An ‘in the wild’ image-based database is collected. To validate the methods, both quantitative and qualitative experiments are performed and applied to the problem of shot selection, event summarisation and album creation. The experiments show that the global and local attributes defined in the paper provide useful information for theme expression analysis, with results close to human perception results.
Article
Full-text available
Automatic affect analysis has attracted great interest in various contexts including the recognition of action units and basic or non-basic emotions. In spite of major efforts, there are several open questions on what the important cues to interpret facial expressions are and how to encode them. In this paper, we review the progress across a range of affect recognition applications to shed light on these fundamental questions. We analyse the state-of-the-art solutions by decomposing their pipelines into fundamental components, namely face registration, representation, dimensionality reduction and recognition. We discuss the role of these components and highlight the models and new trends that are followed in their design. Moreover, we provide a comprehensive analysis of facial representations by uncovering their advantages and limitations; we elaborate on the type of information they encode and discuss how they deal with the key challenges of illumination variations, registration errors, head-pose variations, occlusions, and identity bias. This survey allows us to identify open issues and to define future directions for designing real-world affect recognition systems.
Conference Paper
Full-text available
In this paper, we propose to use local Zernike Moments (ZMs) for facial affect recog-nition and introduce a representation scheme based on performing non-linear encoding on ZMs via quantization. Local ZMs provide a useful and compact description of im-age discontinuities and texture. We demonstrate the use of this ZM-based representa-tion for posed and discrete as well as naturalistic and continuous affect recognition on standard datasets, and show that ZM-based representations outperform well-established alternative approaches for both tasks. To the best of our knowledge, the performance we achieved on CK+ dataset is superior to all results reported to date.
Article
Full-text available
This paper introduces a video representation based on dense trajectories and motion boundary descriptors. Trajectories capture the local motion information of the video. A dense representation guarantees a good coverage of foreground motion as well as of the surrounding context. A state-of-the-art optical flow algorithm enables a robust and efficient extraction of dense trajectories. As descriptors we extract features aligned with the trajectories to characterize shape (point coordinates), appearance (histograms of oriented gradients) and motion (histograms of optical flow). Additionally, we introduce a descriptor based on motion boundary histograms (MBH) which rely on differential optical flow. The MBH descriptor shows to consistently outperform other state-of-the-art descriptors, in particular on real-world videos that contain a significant amount of camera motion. We evaluate our video representation in the context of action classification on nine datasets, namely KTH, YouTube, Hollywood2, UCF sports, IXMAS, UIUC, Olympic Sports, UCF50 and HMDB51. On all datasets our approach outperforms current state-of-the-art results.
Article
Full-text available
A standard approach to describe an image for classification and retrieval purposes is to extract a set of local patch descriptors, encode them into a high dimensional vector and pool them into an image-level signature. The most common patch encoding strategy consists in quantizing the local descriptors into a finite set of prototypical elements. This leads to the popular Bag-of-Visual words representation. In this work, we propose to use the Fisher Kernel framework as an alternative patch encoding strategy: we describe patches by their deviation from an “universal” generative Gaussian mixture model. This representation, which we call Fisher vector has many advantages: it is efficient to compute, it leads to excellent results even with efficient linear classifiers, and it can be compressed with a minimal loss of accuracy using product quantization. We report experimental results on five standard datasets—PASCAL VOC 2007, Caltech 256, SUN 397, ILSVRC 2010 and ImageNet10K—with up to 9M images and 10K classes, showing that the FV framework is a state-of-the-art patch encoding technique.
Article
Full-text available
Since emotions are expressed through a combination of verbal and non-verbal channels, a joint analysis of speech and gestures is required to understand expressive human communication. To facilitate such investigations, this paper describes a new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California (USC). This database was recorded from ten actors in dyadic sessions with markers on the face, head, and hands, which provide detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios. The actors performed selected emotional scripts and also improvised hypothetical scenarios designed to elicit specific types of emotions (happiness, anger, sadness, frustration and neutral state). The corpus contains approximately 12 h of data. The detailed motion capture information, the interactive setting to elicit authentic emotions, and the size of the database make this corpus a valuable addition to the existing databases in the community for the study and modeling of multimodal and expressive human communication.
Article
Full-text available
Automated analysis of human affective behavior has attracted increasing attention from researchers in psychology, computer science, linguistics, neuroscience, and related disciplines. However, the existing methods typically handle only deliberately displayed and exaggerated expressions of prototypical emotions despite the fact that deliberate behaviour differs in visual appearance, audio profile, and timing from spontaneously occurring behaviour. To address this problem, efforts to develop algorithms that can process naturally occurring human affective behaviour have recently emerged. Moreover, an increasing number of efforts are reported toward multimodal fusion for human affect analysis including audiovisual fusion, linguistic and paralinguistic fusion, and multi-cue visual fusion based on facial expressions, head movements, and body gestures. This paper introduces and surveys these recent advances. We first discuss human emotion perception from a psychological perspective. Next we examine available approaches to solving the problem of machine understanding of human affective behavior, and discuss important issues like the collection and availability of training and test data. We finally outline some of the scientific and engineering challenges to advancing human affect sensing technology.
Conference Paper
Automatic affect analysis and understanding has become a well established research area in the last two decades. However, little attention has been paid to the analysis of the affect expressed in group settings, either in the form of affect expressed by the whole group collectively or affect expressed by each individual member of the group. This paper presents a framework which, in group settings automatically classifies the affect expressed by each individual group member along both arousal and valence dimensions. We first introduce a novel Volume Quantised Local Zernike Moments Fisher Vectors (vQLZM-FV) descriptor to represent the facial behaviours of individuals in the spatio-temporal domain and then propose a method to recognize the group membership of each individual (i.e., which group the individual in question is part of) by using their face and body behavioural cues. We conduct a set of experiments on a newly collected dataset that contains fourteen recordings of four groups, each consisting of four people watching affec-tive movie stimuli. Our experimental results show that (1) the proposed vQLZM-FV outperforms the other feature representations in affect recognition, and (2) group membership can be recognized using the non-verbal face and body features, indicating that individuals influence each other's behaviours within a group setting.
Conference Paper
Automatic analysis of affect has become a well-established research area in the last two decades. However, little attention has been paid to analysing the affect expressed by a group of people in a scene or an interaction setting, either in the form of the individual group member’s affect or the overall affect expressed collectively. In this paper, we (i) introduce a framework for analysing an image that contains multiple people and recognizing the arousal and valence expressed at the group-level; (ii) present a dataset of images annotated along arousal and valence dimensions; and (iii) extract and evaluate a multitude of face, body and context features. We conduct a set of experiments to classify the overall affect expressed at the group-level along arousal (high, medium, low) and valence (positive, neutral, negative) using k-Nearest Neighbour classifier and integrate the information provided by the face, body and context features using decision level fusion. Our experimental results show the viability of the proposed framework compared to other in-the-wild recognition works - we obtain 54% and 55% recognition accuracy for individual arousal and valence dimensions, respectively.
Article
A solution is suggested for an old unresolved social psychological problem.
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Many computer vision problems (e.g., camera calibration, image alignment, structure from motion) are solved through a nonlinear optimization method. It is generally accepted that 2nd order descent methods are the most robust, fast and reliable approaches for nonlinear optimization of a general smooth function. However, in the context of computer vision, 2nd order descent methods have two main drawbacks: (1) The function might not be analytically differentiable and numerical approximations are impractical. (2) The Hessian might be large and not positive definite. To address these issues, this paper proposes a Supervised Descent Method (SDM) for minimizing a Non-linear Least Squares (NLS) function. During training, the SDM learns a sequence of descent directions that minimizes the mean of NLS functions sampled at different points. In testing, SDM minimizes the NLS objective using the learned descent directions without computing the Jacobian nor the Hessian. We illustrate the benefits of our approach in synthetic and real examples, and show how SDM achieves state-of-the-art performance in the problem of facial feature detection. The code is available at www.humansensing.cs. cmu.edu/intraface.
Article
In the context of affective human behavior analysis, we use the term continuous input to refer to naturalistic settings where explicit or implicit input from the subject is continuously available, where in a human–human or human–computer interaction setting, the subject plays the role of a producer of the communicative behavior or the role of a recipient of the communicative behavior. As a result, the analysis and the response provided by the automatic system are also envisioned to be continuous over the course of time, within the boundaries of digital machine output. The term continuous affect analysis is used as analysis that is continuous in time as well as analysis that uses affect phenomenon represented in dimensional space. The former refers to acquiring and processing long unsegmented recordings for detection of an affective state or event (e.g., nod, laughter, pain), and the latter refers to prediction of an affect dimension (e.g., valence, arousal, power). In line with the Special Issue on Affect Analysis in Continuous Input, this survey paper aims to put the continuity aspect of affect under the spotlight by investigating the current trends and provide guidance towards possible future directions.
Article
The explosion of user-generated, untagged multimedia data in recent years, generates a strong need for efficient search and retrieval of this data. The predominant method for content-based tagging is through slow, labor-intensive manual annotation. Consequently, automatic tagging is currently a subject of intensive research. However, it is clear that the process will not be fully automated in the foreseeable future. We propose to involve the user and investigate methods for implicit tagging, wherein users' responses to the interaction with the multimedia content are analyzed in order to generate descriptive tags.Here, we present a multi-modal approach that analyses both facial expressions and electroencephalography (EEG) signals for the generation of affective tags. We perform classification and regression in the valence-arousal space and present results for both feature-level and decision-level fusion. We demonstrate improvement in the results when using both modalities, suggesting the modalities contain complementary information.
Conference Paper
Viewers' preference for multimedia selection depends highly on their emotional experience. In this paper, we present an emotion detection method for music videos using central and peripheral nervous system physiological signals as well as multimedia content analysis. A set of 40 music clips eliciting a broad range of emotions were first selected. After extracting the one minute long emotional highlight of each video, they were shown to 32 participants while their physiological responses were recorded. Participants self-reported their felt emotions after watching each clip by means of arousal, valence, dominance, and liking ratings. The physiological signals included electroencephalogram, galvanic skin response, respiration pattern, skin temperature, electromyograms and blood volume pulse using plethysmograph. Emotional features were extracted from the signals and the multimedia content. The emotional features were used to train a linear ridge regressor to detect emotions for each participant using a leave-one-out cross-validation strategy. The performance of the personalized emotion detection is shown to be significantly superior to a random regressor.
Article
Dynamic texture (DT) is an extension of texture to the temporal domain. Description and recognition of DTs have attracted growing attention. In this paper, a novel approach for recognizing DTs is proposed and its simplifications and extensions to facial image analysis are also considered. First, the textures are modeled with volume local binary patterns (VLBP), which are an extension of the LBP operator widely used in ordinary texture analysis, combining motion and appearance. To make the approach computationally simple and easy to extend, only the co-occurrences of the local binary patterns on three orthogonal planes (LBP-TOP) are then considered. A block-based method is also proposed to deal with specific dynamic events such as facial expressions in which local information and its spatial locations should also be taken into account. In experiments with two DT databases, DynTex and Massachusetts Institute of Technology (MIT), both the VLBP and LBP-TOP clearly outperformed the earlier approaches. The proposed block-based method was evaluated with the Cohn-Kanade facial expression database with excellent results. The advantages of our approach include local processing, robustness to monotonic gray-scale changes, and simple computation.
Social facilitation. Research Center for Group Dynamics
  • R B Zajonc