Conference Paper

Affective Computing in Computer Vision: A Study on Facial Expression Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Introduction Nowadays, several robots have been developed to provide not only companionship to older adults, but also to cooperate with them during health and lifestyle activities. Despite the undeniable wealth of socially assistive robots (SARs), there is an increasing need to customize the tools used for measuring their acceptance in real-life applications. Methods Within the Robot-Era project, a scale was developed to understand the degree of acceptance of the robotic platform. A preliminary test with 21 participants was performed to assess the statistical validity of the Robot-Era Inventory (REI) scales. Results Based on the criteria observed in the literature, 41 items were developed and grouped in different scales (perceived robot personality, human–robot interaction, perceived benefit, ease of use, and perceived usefulness). The reliability of the Robot-Era Inventory scale was analyzed with Cronbach's alpha, with a mean value of 0.79 (range = 0.61–0.91). Furthermore, the preliminary validity of this scale has been tested by using the correlation analysis with a gold standard, the Unified Theory of Acceptance and Use of Technology (UTAUT) model. Discussion The Robot-Era Inventory represents a useful tool that can be easily personalized and included in the assessment of any SARs that cooperate with older people in real environment applications.
Article
Full-text available
When designing social robots, it is crucial to understand the diverse expectations of different kinds of innovation adopters. Different factors influence early adopters of innovations and mass market representatives’ perception of the usefulness of social robots. The first aim of the study was to test how applicable the technology acceptance model 3 (TAM3) is in the context of social robots. Participants’ acceptance of social robotics in a workplace environment in the fuzzy front-end (FFE) innovation phase of a robot development project was examined. Based on the findings for the model, we developed a reduced version of the TAM3 that is more applicable for social robots. The second objective was to analyze how early adopters’ and mass market representatives’ acceptance of social robots differs. Quantitative research methods were used. For early adopters, result demonstrability has a significant influence on perceived usefulness of social robots, while for mass market representatives, perceived enjoyment has a more significant influence on perceived usefulness. The findings indicate that users’ innovation adoption style influences the factors that users consider important in the usefulness of social robots. Robot developers should take these into account during the FFE innovation phase.
Article
Full-text available
Purpose This paper aims to explore the implications of integrating humanoid service robots into hospitality service encounters by evaluating two service prototypes using Softbank Robotics’ popular service robot Pepper™: to provide information (akin to a receptionist) and to facilitate order-taking (akin to a server). Drawing both studies together, the paper puts forward novel, theory-informed yet context-rooted design principles for humanoid robot adoption in hospitality service encounters. Design/methodology/approach Adopting a multiple method qualitative approach, two service prototypes are evaluated with hospitality and tourism experts ( N = 30, Prototype 1) and frontline hospitality employees ( N = 18, Prototype 2) using participant observation, in situ feedback, semi-structured interviews and photo-elicitation. Findings The adoption of humanoid service robots in hospitality is influenced by the following four layers of determinants: contextual, social, interactional and psychological factors, as well as extrinsic and intrinsic drivers of adoption. These empirical findings both confirm and extend previous conceptualizations of human-robot interaction (HRI) in hospitality service. Research limitations/implications Despite using photo-elicitation to evoke insight regarding the use of different types of service robots in hospitality, the paper mostly focuses on anthropomorphized service robots such as Pepper™. Practical implications Adopting humanoid service robots will transform hospitality operations, whereby the most routine, unpleasant tasks such as taking repeat orders or dealing with complaints may be delegated to service robots or human-robot teams. Social implications Working with and receiving service from Pepper™ changes the service encounter from direct practical, technical considerations to more nuanced social and psychological implications, particularly around feelings of self-esteem, social pressure and social judgment. Originality/value This paper presents one of the first empirical studies on HRI in hospitality service encounters using Softbank Robotics’ Pepper™. In doing so, the paper presents a novel framework for service robot adoption rooted in first-hand user interaction as opposed to previous, theory-driven conceptualizations of behavior or empirical studies exploring behavioral intention.
Article
Full-text available
A rapidly increasing growth of social networks and the propensity of users to communicate their physical activities, thoughts, expressions, and viewpoints in text, visual, and audio material have opened up new possibilities and opportunities in sentiment and activity analysis. Although sentiment and activity analysis of text streams has been extensively studied in the literature, it is relatively recent yet challenging to evaluate sentiment and physical activities together from visuals such as photographs and videos. This paper emphasizes human sentiment in a socially crucial field, namely social media disaster/catastrophe analysis, with associated physical activity analysis. We suggest multi-tagging sentiment and associated activity analyzer fused with a a deep human count tracker, a pragmatic technique for multiple object tracking, and count in occluded circumstances with a reduced number of identity switches in disaster-related videos and images. A crowd-sourcing study has been conducted to analyze and annotate human activity and sentiments towards natural disasters and related images in social networks. The crowdsourcing study outcome into a large-scale benchmark dataset with three annotations sets each resolves distinct tasks. The presented analysis and dataset will anchor a baseline for future research in the domain. We believe that the proposed system will contribute to more viable communities by benefiting different stakeholders, such as news broadcasters, emergency relief organizations, and the public in general.
Article
Full-text available
The interaction between humans and an NAO robot using deep convolutional neural networks (CNN) is presented in this paper based on an innovative end-to-end pipeline method that applies two optimized CNNs, one for face recognition (FR) and another one for the facial expression recognition (FER) in order to obtain real-time inference speed for the entire process. Two different models for FR are considered, one known to be very accurate, but has low inference speed (faster region-based convolutional neural network), and one that is not as accurate but has high inference speed (single shot detector convolutional neural network). For emotion recognition transfer learning and fine-tuning of three CNN models (VGG, Inception V3 and ResNet) has been used. The overall results show that single shot detector convolutional neural network (SSD CNN) and faster region-based convolutional neural network (Faster R-CNN) models for face detection share almost the same accuracy: 97.8% for Faster R-CNN on PASCAL visual object classes (PASCAL VOCs) evaluation metrics and 97.42% for SSD Inception. In terms of FER, ResNet obtained the highest training accuracy (90.14%), while the visual geometry group (VGG) network had 87% accuracy and Inception V3 reached 81%. The results show improvements over 10% when using two serialized CNN, instead of using only the FER CNN, while the recent optimization model, called rectified adaptive moment optimization (RAdam), lead to a better generalization and accuracy improvement of 3%-4% on each emotion recognition CNN.
Chapter
Full-text available
This study proposes an approach to ensemble sentiment classification of a text to a score in the range of 1–5 of negative-positive scoring. A high-performing model is produced from TripAdvisor restaurant reviews via a generated dataset of 684 word-stems, gathered by information gain attribute selection from the entire corpus. The best performing classification was an ensemble classifier of RandomForest, Naive Bayes Multinomial and Multilayer Perceptron (Neural Network) methods ensembled via a Vote on Average Probabilities approach. The best ensemble produced a classification accuracy of 91.02% which scored higher than the best single classifier, a Random Tree model with an accuracy of 78.6%. Other ensembles through Adaptive Boosting, Random Forests and Voting are explored with ten-fold cross-validation. All ensemble methods far outperformed the best single classifier methods. Even though extremely high results are achieved, analysis documents the few mis-classified instances as almost entirely being close to their real class via the model’s given error matrix.
Article
Full-text available
As an emerging research topic for Proximity Service (ProSe), automatic emotion recognition enables the machines to understand the emotional changes of human beings which can not only facilitate natural, effective, seamless, and advanced human-robot interaction (HRI) or human-computer interface (HCI), but also promote emotional health. Facial expression recognition (FER) is a vital task for emotion recognition. However, significant gap between human and machine exists in FER task. In this paper, we present a conditional generative adversarial network (cGAN) based approach to alleviate the intra-class variations by individually controlling the facial expressions and learning the generative and discriminative representations simultaneously. The proposed framework consists of a generator G and three discriminators (Di, Da, and Dexp). The generator G transforms any query face image into another prototypic facial expression image with other factors preserved. Armed with action units (AUs) condition, the generator G pays more attention to information relevant to facial expression. Three loss functions (LI, La, and Lexp) corresponding to the three discriminators (Di, Da, and Dexp) were designed to learn generative and discriminative representations. Moreover, after rendering the generated expression back to its original facial expression, cycle consistency loss is also applied to guarantee the identity and produce more constrained visual representations. Optimized by combining both synthesis and classification loss functions, the learnt representation is explicitly disentangled from other variations such as identity, head pose and illumination. Qualitative and quantitative experimental results demonstrate the proposed FER system is effective for expression recognition.
Article
Full-text available
>>> Purpose – The service sector is at an inflection point with regard to productivity gains and service industrialization similar to the industrial revolution in manufacturing that started in the 18th century. Robotics in combination with rapidly improving technologies like artificial intelligence (AI), mobile, cloud, big data and biometrics will bring opportunities for a wide range of innovations that have the potential to dramatically change service industries. This conceptual paper explores the potential role service robots will play in the future and advances a research agenda for service researchers. >>> Design/methodology/approach – This paper uses a conceptual approach that is rooted in the service, robotics, and AI literature. >>> Findings – The contribution of this article is threefold. First, it provides a definition of service robots, describes their key attributes, contrasts their features and capabilities with those of frontline employees, and provides an understanding for which types of service tasks robots will dominate and where humans will dominate. Second, this article examines consumer perceptions, beliefs and behaviors as related to service robots, and advances the service robot acceptance model (sRAM). Third, it provides an overview of the ethical questions surrounding robot-delivered services at the individual, market and societal level. >>> Practical implications – This article helps service organizations and their management, service robot innovators, programmers and developers, and policymakers better understand the implications of a ubiquitous deployment of service robots. >>> Originality/value – This is the first conceptual article that systematically examines key dimensions of robot-delivered frontline service and explores how these will differ in the future.
Article
Full-text available
Background: Depression is a common illness worldwide. Traditional procedures have generated controversy and criticism such as accuracy and agreement consistency of depression diagnosis and assessment among clinicians. More objective biomarkers are needed for better treatment evaluation and monitoring. Hypothesis: Depression will leave recognizable markers in a patient’s acoustic, linguistic, and facial patterns, all of which have demonstrated increasing promise for more objectively evaluating and predicting a patient’s mental state. Methods: We applied a multi-modality fusion model to combine the audio, video, and text modalities, to identify the biomarkers that are predictive of depression with consideration of gender differences. Results: We identified promising biomarkers from a successive search on feature extraction analysis for each modality. We found that gender disparity in vocal and facial expressions plays an important role in detecting depression. Conclusion: Audio, video and text biomarkers provided the possibility of detecting depression in addition to traditional clinical assessments. Biomarkers detected for gender-dependent analysis were not identical, indicating that gender can affect the depression manifestations.
Conference Paper
Full-text available
Abstract—Human emotions are the universally common mode of interaction. Automated human facial expression identification has its own advantages. In this paper, the author has proposed and developed a methodology to identify facial emotions using facial landmarks and random forest classifier. Firstly, faces are identified in each image using a histogram of oriented gradients with a linear classifier, image pyramid, and sliding window detection scheme. Then facial landmarks are identified using a model trained with the iBUG 300-W dataset. A feature vector is calculated using a proposed method which uses identified facial landmarks and it is normalized using a proposed method in order to remove facial size variations. The same feature vector is calculated for the neutral pose and vector difference is used to identify emotions using random forest classifier. Famous Extended Cohn-Kanade database has been used to train random forest and to test the accuracy of the system.
Article
Full-text available
Facial emotion recognition (FER) is an important topic in the fields of computer vision and artificial intelligence owing to its significant academic and commercial potential. Although FER can be conducted using multiple sensors, this review focuses on studies that exclusively use facial images, because visual expressions are one of the main information channels in interpersonal communication. This paper provides a brief review of researches in the field of FER conducted over the past decades. First, conventional FER approaches are described along with a summary of the representative categories of FER systems and their main algorithms. Deep-learning-based FER approaches using deep networks enabling “end-to-end” learning are then presented. This review also focuses on an up-to-date hybrid deep-learning approach combining a convolutional neural network (CNN) for the spatial features of an individual frame and long short-term memory (LSTM) for temporal features of consecutive frames. In the later part of this paper, a brief review of publicly available evaluation metrics is given, and a comparison with benchmark results, which are a standard for a quantitative comparison of FER researches, is described. This review can serve as a brief guidebook to newcomers in the field of FER, providing basic knowledge and a general understanding of the latest state-of-the-art studies, as well as to experienced researchers looking for productive directions for future work.
Article
Full-text available
In order to avoid the complex explicit feature extraction process and the problem of low-level data operation involved in traditional facial expression recognition, we proposed a method of Faster R-CNN (Faster Regions with Convolutional Neural Network Features) for facial expression recognition in this paper. Firstly, the facial expression image is normalized and the implicit features are extracted by using the trainable convolution kernel. Then, the maximum pooling is used to reduce the dimensions of the extracted implicit features. After that, RPNs (Region Proposal Networks) is used to generate high-quality region proposals, which are used by Faster R-CNN for detection. Finally, the Softmax classifier and regression layer is used to classify the facial expressions and predict boundary box of the test sample, respectively. The dataset is provided by Chinese Linguistic Data Consortium (CLDC), which is composed of multimodal emotional audio and video data. Experimental results show the performance and the generalization ability of the Faster R-CNN for facial expression recognition. The value of the mAP is around 0.82.
Article
Full-text available
We present a baseline convolutional neural network (CNN) structure and image preprocessing methodology to improve facial expression recognition algorithm using CNN. To analyze the most efficient network structure, we investigated four network structures that are known to show good performance in facial expression recognition. Moreover, we also investigated the effect of input image preprocessing methods. Five types of data input (raw, histogram equalization, isotropic smoothing, diffusion-based normalization, difference of Gaussian) were tested, and the accuracy was compared. We trained 20 different CNN models (4 networks x 5 data input types) and verified the performance of each network with test images from five different databases. The experiment result showed that a three-layer structure consisting of a simple convolutional and a max pooling layer with histogram equalization image input was the most efficient. We describe the detailed training procedure and analyze the result of the test accuracy based on considerable observation.
Article
Full-text available
Although robots are starting to enter into our professional and private lives, little is known about the emotional effects which robots elicit. However, insights into this topic are an important prerequisite when discussing, for example, ethical issues regarding the question of what role we (want to) allow robots to play in our lives. In line with the Media Equation, humans may react towards robots as they do towards humans, making it all the more important to carefully investigate the preconditions and consequences of contact with robots. Based on assumptions on the socialness of reactions towards robots and anecdotal evidence of emotional attachments to robots (e. g. Klamer and BenAllouch in Trappl R. (ed.), Proceedings of EMCSR 2010, Vienna, 2010; Klamer and BenAllouch in Proceedings of the 27th International Conference on Human Factors in Computing Systems (CHI-2010), Atlanta, GA. ACM, New York, 2010; Kramer et al. in Appl. Artif. Intell. 25(6): 474-502, 2011), we conducted a study that provides further insights into the question of whether humans show emotional reactions towards Ugobe's Pleo, which is shown in different situations. We used a 2 x 2 design with one between-subjects factor "prior interaction with the robot" (never seen the robot before vs. 10-minute interaction with the robot) and a within-subject factor "type of video" (friendly interaction video vs. torture video). Following a multi-method approach, we assessed participants' physiological arousal and self-reported emotions as well as their general evaluation of the videos and the robot. In line with our hypotheses, participants showed increased physiological arousal during the reception of the torture video as compared to the normal video. They also reported fewer positive and more negative feelings after the torture video and expressed empathic concern for the robot. It appears that the acquaintance with the robot does not play a role, as "prior interaction with the robot" showed no effect.
Article
Full-text available
Factor-analytic evidence has led most psychologists to describe affect as a set of dimensions, such as displeasure, distress, depression, excitement, and so on, with each dimension varying independently of the others. However, there is other evidence that rather than being independent, these affective dimensions are interrelated in a highly systematic fashion. The evidence suggests that these interrelationships can be represented by a spatial model in which affective concepts fall in a circle in the following order: pleasure (0), excitement (45), arousal (90), distress (135), displeasure (180), depression (225), sleepiness (270), and relaxation (315). This model was offered both as a way psychologists can represent the structure of affective experience, as assessed through self-report, and as a representation of the cognitive structure that laymen utilize in conceptualizing affect. Supportive evidence was obtained by scaling 28 emotion-denoting adjectives in 4 different ways: R. T. Ross's (1938) technique for a circular ordering of variables, a multidimensional scaling procedure based on perceived similarity among the terms, a unidimensional scaling on hypothesized pleasure–displeasure and degree-of-arousal dimensions, and a principal-components analysis of 343 Ss' self-reports of their current affective states. (70 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Facial expression is a powerful, natural, and universal signal for human beings to convey their emotional states and intentions. Numerous studies have been conducted on automatic facial expression analysis because of its practical importance in sociable robotics, medical treatment, driver fatigue surveillance, and many other human-computer interaction systems. Various facial expression recognition (FER) systems have been explored to encode expression information from facial representations in the field of computer vision and machine learning. Traditional methods typically use handcrafted features or shallow learning for FER. However, related studies have collected training samples from challenging real-world scenarios, which implicitly promote the transition of FER from laboratory-controlled to in-the-wild settings since 2013. Meanwhile, studies in various fields have increasingly used deep learning methods, which achieve state-of-the-art recognition accuracy and remarkably exceed the results of previous investigations due to considerably improved chip processing abilities (e.g., GPU units) and appropriately designed network architectures. Moreover, deep learning techniques are increasingly utilized to handle challenging factors for emotion recognition in the wild because of the effective training of facial expression data. The transition of facial expression recognition from being laboratory-controlled to challenging in-the-wild conditions and the recent success of deep learning techniques in various fields have promoted the use of deep neural networks to learn discriminative representations for automatic FER. Recent deep FER systems generally focus on the following important issues. 1) Deep neural networks require a large amount of training data to avoid overfitting. However, existing facial expression databases are insufficient for training common neural networks with deep architecture, which achieve promising results in object recognition tasks. 2) Expression-unrelated variations are common in unconstrained facial expression scenarios, such as illumination, head pose, and identity bias. These disturbances are nonlinearly confounded with facial expressions and therefore strengthen the requirement of deep networks to address the large intraclass variability and learn effective expression-specific representations. We provide a comprehensive review of deep FER, including datasets and algorithms that provide insights into these intrinsic problems, in this survey. First, we introduce the background of fields of FER and summarize the development of available datasets widely used in the literature as well as FER algorithms in the past 10 years. Second, we divide the FER system into two main categories according to feature representations, namely, static image and dynamic sequence FER. The feature representation in static-based methods is encoded with only spatial information from the current single image, whereas dynamic-based methods consider temporal relations among contiguous frames in input facial expression sequences. On the basis of these two vision-based methods, other modalities, such as audio and physiological channels, have also been used in multimodal sentiment analysis systems to assist in FER. Although pure expression recognition based on visible face images can achieve promising results, incorporating it with other models into a high-level framework can provide complementary information and further enhance the robustness. We introduce existing novel deep neural networks and related training strategies, which are designed for FER based on both static and dynamic image sequences, and discuss their advantages and limitations in state-of-the-art deep FER. Competitive performance and experimental comparisons of these deep FER systems in widely used benchmarks are also summarized. We then discuss relative advantages and disadvantages of these different types of methods with respect to two open issues (data size requirement and expression-unrelated variations) and other focuses (computation efficiency, performance, and network training difficulty). Finally, we review and summarize the following challenges in this field and future directions for the design of robust deep FER systems. 1) Lacking training data in terms of both quantity and quality is a main challenge in deep FER systems. Abundant sample images with diverse head poses and occlusions as well as precise face attribute labels, including expression, age, gender, and ethnicity, are crucial for practical applications. The crowdsourcing model under the guidance of expert annotators is a reasonable approach for massive annotations. 2) Data bias and inconsistent annotations are very common among different facial expression datasets due to various collecting conditions and the subjectiveness of annotating. Furthermore, the FER performance fails to improve when training data is enlarged by directly merging multiple datasets due to inconsistent expression annotations. Cross-database performance is an important evaluation criterion of generalizability and practicability of FER systems. Deep domain adaption and knowledge distillation are promising trends to address this bias. 3) Another common issue is imbalanced class distribution in facial expression due to the practicality of sample acquirement. One solution is to resample and balance the class distribution on the basis of the number of samples for each class during the preprocessing stage using data augmentation and synthesis. Another alternative is to develop a cost-sensitive loss layer for reweighting during network work training. 4) Although FER within the categorical model has been extensively investigated, the definition of prototypical expressions covers only a small portion of specific categories and cannot capture the full repertoire of expressive behavior for realistic interactions. Incorporating other affective models, such as FACS(facial action coding system) and dimensional models, can facilitate the recognition of facial expressions and allow them to learn expression-discriminative representations. 5) Human expressive behavior in realistic applications involves encoding from different perspectives, with facial expressions as only one modality. Although pure expression recognition based on visible face images can achieve promising results, incorporating it with other models into a high-level framework can provide complementary information and further enhance the robustness. For example, the fusion of other modalities, such as the audio information, infrared images, and depth information from 3D face models and physiological data, has become a promising research direction due to the large complementarity of facial expressions and the good application value of human-computer interaction (HCI) applications. © 2020, Editorial and Publishing Board of Journal of Image and Graphics. All right reserved.
Article
Service robots continue to permeate and automate the hospitality sector. In doing so, these technological innovations pose to radically change current service production and delivery practices and, consequently, service management and marketing strategies. This study explores the various impacts of robotization in the sector by offering one of the first empirical accounts on the current state-of-the-art of service robotics as deployed in hospitality service encounters. The results suggest that service robots either support or substitute employees in service encounters. They also offer hospitality businesses a novel point of differentiation, but only if properly integrated as part of wider marketing efforts. Finally, the automation of tasks, processes, and, ultimately, jobs has serious socioeconomic implications both at the microlevel and macrolevel. Consequently, hospitality executives need to consider where and how to apply robotization to strike a balance between operational efficiency and customer expectations. Displaying ethical leadership is key to reaping the benefits of the robot revolution.
Chapter
Human action recognition has gained popularity because of its worldwide applications such as video surveillance, video retrieval and human–computer interaction. This paper provides a comprehensive overview of notable advances made by deep neural networks in this field. Firstly, the basic conception of action recognition and its common applications were introduced. Secondly, action recognition was categorized as action classification and action detection according to its respective research goals. And various deep learning frameworks for recognition tasks were discussed in detail and the most challenging datasets and taxonomies were briefly reviewed. Finally, the limitations of the state-of-the-art and promising directions of the research were briefly outlined.
Article
With the transition of facial expression recognition (FER) from laboratory-controlled to challenging in-the-wild conditions and the recent success of deep learning techniques in various fields, deep neural networks have increasingly been leveraged to learn discriminative representations for automatic FER. Recent deep FER systems generally focus on two important issues: overfitting caused by a lack of sufficient training data and expression-unrelated variations, such as illumination, head pose and identity bias. In this paper, we provide a comprehensive survey on deep FER, including datasets and algorithms that provide insights into these intrinsic problems. First, we describe the standard pipeline of a deep FER system with the related background knowledge and suggestions of applicable implementations for each stage. We then introduce the available datasets that are widely used in the literature and provide accepted data selection and evaluation principles for these datasets. For the state of the art in deep FER, we review existing novel deep neural networks and related training strategies that are designed for FER based on both static images and dynamic image sequences, and discuss their advantages and limitations. Competitive performances on widely used benchmarks are also summarized in this section. We then extend our survey to additional related issues and application scenarios. Finally, we review the remaining challenges and corresponding opportunities in this field as well as future directions for the design of robust deep FER systems.
Article
Automated affective computing in the wild setting is a challenging problem in computer vision. Existing annotated databases of facial expressions in the wild are small and mostly cover discrete emotions (aka the categorical model). There are very limited annotated facial databases for affective computing in the continuous dimensional model (e.g., valence and arousal). To meet this need, we collected, annotated, and prepared for public distribution a new database of facial emotions in the wild (called AffectNet). AffectNet contains more than 1,000,000 facial images from the Internet by querying three major search engines using 1250 emotion related keywords in six different languages. About half of the retrieved images were manually annotated for the presence of seven discrete facial expressions and the intensity of valence and arousal. AffectNet is by far the largest database of facial expression, valence, and arousal in the wild enabling research in automated facial expression recognition in two different emotion models. Two baseline deep neural networks are used to classify images in the categorical model and predict the intensity of valence and arousal. Various evaluation metrics show that our deep neural network baselines can perform better than conventional machine learning methods and off-the-shelf facial expression recognition systems.
Conference Paper
We report our image based static facial expression recognition method for the Emotion Recognition in the Wild Challenge (EmotiW) 2015. We focus on the sub-challenge of the SFEW 2.0 dataset, where one seeks to automatically classify a set of static images into 7 basic emotions. The proposed method contains a face detection module based on the ensemble of three state-of-the-art face detectors, followed by a classification module with the ensemble of multiple deep convolutional neural networks (CNN). Each CNN model is initialized randomly and pre-trained on a larger dataset provided by the Facial Expression Recognition (FER) Challenge 2013. The pre-trained models are then fine-tuned on the training set of SFEW 2.0. To combine multiple CNN models, we present two schemes for learning the ensemble weights of the network responses: by minimizing the log likelihood loss, and by minimizing the hinge loss. Our proposed method generates state-of-the-art result on the FER dataset. It also achieves 55.96% and 61.29% respectively on the validation and test set of SFEW 2.0, surpassing the challenge baseline of 35.96% and 39.13% with significant gains.
Chapter
Sentiment analysis is the task of automatically determining from text the attitude, emotion, or some other affectual state of the author. This chapter summarizes the diverse landscape of tasks and applications associated with sentiment analysis. We outline key challenges stemming from the complexity and subtlety of language use, the prevalence of creative and non-standard language, and the lack of paralinguistic information, such as tone and stress markers. We describe automatic systems and datasets commonly used in sentiment analysis. We summarize several manual and automatic approaches to creating valence- and emotion-association lexicons. We also discuss preliminary approaches for sentiment composition (how smaller units of text combine to express sentiment) and approaches for detecting sentiment in figurative and metaphoric language—these are the areas where we expect to see significant work in the near future.
Article
Social smiles of 10 visually impaired infants, ages 4 to 12 months, were examined longitudinally in play interactions with their mothers. Characteristics examined included the cognitive skills of the infants when the social smile was first seen, the parental behaviors that elicited and followed social smiles, and the frequency of social smiles in play interactions across the first year of life. All infants demonstrated both the presence of social smiles and the second Piagetian stage of cognitive development at the start of the study. Social smiling appeared to increase in frequency from 6 to 12 months except for a drop at 8 months. Smiles occurred in response to social and environmental events and were consistently followed by another parental social behavior.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Article
Our goal is to reveal temporal variations in videos that are difficult or impossible to see with the naked eye and display them in an indicative manner. Our method, which we call Eulerian Video Magnification, takes a standard video sequence as input, and applies spatial decomposition, followed by temporal filtering to the frames. The resulting signal is then amplified to reveal hidden information. Using our method, we are able to visualize the flow of blood as it fills the face and also to amplify and reveal small motions. Our technique can run in real time to show phenomena occurring at the temporal frequencies selected by the user.
Article
A neural network model for visual pattern recognition, called the “neocognitron,” was previously proposed by the author. In this paper, we discuss the mechanism of the model in detail. In order to demonstrate the ability of the neocognitron, we also discuss a pattern-recognition system which works with the mechanism of the neocognitron. The system has been implemented on a minicomputer and has been trained to recognize handwritten numerals.The neocognitron is a hierarchical network consisting of many layers of cells, and has variable connections between the cells in adjoining layers. It can acquire the ability to recognize patterns by learning, and can be trained to recognize any set of patterns. After finishing the process of learning, pattern recognition is performed on the basis of similarity in shape between patterns, and is not affected by deformation, nor by changes in size, nor by shifts in the position of the input patterns.In the hierarchical network of the neocognitron, local features of the input pattern are extracted by the cells of a lower stage, and they are gradually integrated into more global features. Finally, each cell of the highest stage integrates all the information of the input pattern, and responds only to one specific pattern. Thus, the response of the cells of the highest stage shows the final result of the pattern-recognition of the network. During this process of extracting and integrating features, errors in the relative position of local features are gradually tolerated. The operation of tolerating positional error a little at a time at each stage, rather than all in one step, plays an important role in endowing the network with an ability to recognize even distorted patterns.