Article

Facial action coding system: A technique for the measurement of facial movement

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The Facial Action Coding System (FACS) [16] is one of the most impactful methods for analyzing facial behavior. It is a comprehensive, anatomy-based system capable of encoding diverse facial movements through combinations of fundamental Action Units (AUs). ...
... AUs represent specific facial configurations resulting from the contraction of one or more facial muscles and are not influenced by emotional interpretation. The earlier version of FACS [16] included 44 Action Units (AUs), with 30 of them anatomically linked to specific facial muscles, while the remaining 14 were categorized as miscellaneous actions. In a later version [17], the criteria were updated: AU25, AU26, and AU27 were merged based on intensity, as were AU41, AU42, and AU43. ...
... FaceAdapter is designed to achieve face swapping and reenactment, we modify the condition of their model to enable only expression variation of the generated images. For fair comparison with these methods, we use the activated AUs related with these two labels [16] to test our model, which is AU6+AU12 for happiness and AU1+AU4+AU15 for sadness. The intensity of AUs grows gradually from 1 to 9 gapped by 2 from left to right in each row of each emotional label. ...
Preprint
We address the problem of facial expression editing by controling the relative variation of facial action-unit (AU) from the same person. This enables us to edit this specific person's expression in a fine-grained, continuous and interpretable manner, while preserving their identity, pose, background and detailed facial attributes. Key to our model, which we dub MagicFace, is a diffusion model conditioned on AU variations and an ID encoder to preserve facial details of high consistency. Specifically, to preserve the facial details with the input identity, we leverage the power of pretrained Stable-Diffusion models and design an ID encoder to merge appearance features through self-attention. To keep background and pose consistency, we introduce an efficient Attribute Controller by explicitly informing the model of current background and pose of the target. By injecting AU variations into a denoising UNet, our model can animate arbitrary identities with various AU combinations, yielding superior results in high-fidelity expression editing compared to other facial expression editing works. Code is publicly available at https://github.com/weimengting/MagicFace.
... For example, anger involves the brows being lowered, eyelids raised, and lips tightened. Surprise involves eyebrows high and curved and the eyes stretching open, with (white) sclera visible between the iris and eyelid (Ekman, 1992;Ekman et al., 2002). In general, the perception of surface shape is tied closely to the effect of illumination on an object's appearance, which includes shading, shadows, and highlights. ...
... The visual cues underlying expression recognition in humans can be described at various levels of analysis, including as visible configurations of facial actions (Ekman et al., 2002) or image-based landmark and textural cues (Sormaz et al., 2016;Young, 2021). Here, it is useful to delineate between visual features that result from the interaction between 3D shape and lighting direction (namely, shading gradients, shadows, and specular highlights) and features related to the reflectance properties of the face (i.e., the visual patterns caused by the sclera, eyebrows, and skin reflecting light to differing extent due to their different material properties). ...
... Changes in vertical lighting direction also alter the edge information present around the eye region, nose, and mouth ( Figure 8). The changes in appearance of the sclera and brows in particular can help to explain why neutral faces appear more negatively valenced (and more stimulated) when lit from below, as they align with featural characteristics that define emotions such as anger, fear, and surprise (i.e., exposed sclera, raised eyebrows; Ekman, 1992;Ekman et al., 2002). For example, when we express fear, we tend to widen our eyes, exposing more of the sclera (eye whites) to those around us. ...
Article
Full-text available
Our daily interactions draw on a shared language of what facial expressions mean, but accurate perception of these signals may be subject to the same challenges that characterize visual perception in general. One such challenge is that faces vary in their appearance with the context, partly due to the interaction between environmental lighting and the characteristic geometry of the human face. Here, we examine how asymmetries in lighting across the horizontal and vertical axes of the face influence the perception of facial expressions in human observers. In Experiment 1, we find that faces with neutral expression appear to bear a negatively valenced expression and appear higher in emotional arousal when lit from below—an illusion of facial expression where none really exists. In Experiment 2, we find that faces performing common emotional expressions are more often miscategorized when lit from below compared to when lit from above, specifically for angry and neutral expressions. These data show that changes in facial appearance related to illumination direction can modify visual cues relevant to social communication—and suggest that facial expression recognition in humans is partially adapted to (naturalistic) environments in which light arrives predominately from overhead.
... Spontaneous facial mimicry to emotional faces has been shown in a number of studies using facial electromyography (Dimberg et al., 2000;Kret et al., 2013;Tamietto et al., 2009) and methods such as the Facial Action Coding System (Ekman et al., 2002;Ekman & Friesen, 1978) or the Maximally Discriminative Facial Movement Coding System (Izard, 1979). The mimicry of facial muscle patterns occurs quickly (Dimberg, 1982;Lundqvist & Dimberg, 1995), and the perceiver does not need to be aware of one's own facial changes or even mimicked expressions (Bornemann et al., 2012;Dimberg et al., 2000;Moody et al., 2007;Tamietto et al., 2009). ...
... This limitation does not apply to experiments which focused on specific emotion categories by Hawk et al. (2012). To examine the participants' facial expression in more detail, the authors used the Facial Action Coding System (Ekman et al., 2002;Ekman & Friesen, 1978). In their study, the authors used nonverbal expressions of anger, disgust, sadness, and joy, as well as neutral sounds, represented by grunts, retching, laughter, sobs, and cries, among others. ...
... This finding is consistent with the emphasis that mimicry is particularly concerned with positive emotions (Bourgeois & Hess, 2008;Fischer et al., 2012;Hinsz & Tomhave, 1991;Olszanowski et al., 2020) and the argument that the zygomaticus major muscle can be considered "more social." As indicated earlier, although the smile is considered by some researchers primarily as an expression of experienced positive emotions (Ekman, 1992(Ekman, , 1993Ekman & Friesen, 1978), according to the position of behavioral ecology (Fridlund, 1994(Fridlund, , 2017, it also has important functions in social interactions. The social-functional approach (Martin et al., 2017) lists three elementary functions of smiling: It can reinforce desirable behavior in the recipient, signal dominance, and serve to sustain relationships by communicating an open attitude and willingness to establish rapport. ...
Article
Full-text available
Facial mimicry of visually observed emotional facial actions is a robust phenomenon. Here, we examined whether such facial mimicry extends to auditory emotional stimuli. We also examined if participants’ facial responses differ to sounds that are more strongly associated with congruent facial movements, such as vocal emotional expressions (e.g., laughter, screams), or less associated with movements, such as nonvocal emotional sounds (e.g., happy, scary instrumental sounds). Furthermore, to assess whether facial mimicry of sounds reflects visual–motor or auditory–motor associations, we compared individuals that vary on lifetime visual experience (sighted vs. blind). To measure spontaneous facial responding, we used facial electromyography to record the activity of the corrugator supercilii (frowning) and the zygomaticus major (smiling) muscles. During measurement, participants freely listened to the two types of emotional sounds. Both types of sounds were rated similarly on valence and arousal. Notably, only vocal, but not instrumental, sounds elicited robust congruent and selective facial responses. The facial responses were observed in both sighted and blind participants. However, the muscles’ responses of blind participants showed less differentiation between emotion categories of human vocalizations. Furthermore, the groups differed in the shape of the time courses of the zygomatic activity to human vocalizations. Overall, the study shows that emotion-congruent facial responses occur to nonvisual stimuli and are more robust to human vocalizations than instrumental sounds. Furthermore, the amount of lifetime visual experience matters little for the occurrence of cross-channel facial mimicry, but it shapes response differentiation.
... Since then, one pioneering work that has significantly impacted today's automatic FEA systems was done by Charles Darwin, who provided evidence to the existence of some basic emotions universally across cultures and ethnics. Another work was done by Ekman and his colleagues [Ekman 1978], who designed the Facial Action Coding System (FACS) to encode the states of facial expressions using facial Action Units (AUs). Until the 1980s, most FEA work was conducted by philosophers and psychologists (see a review of early work [Keltner et al. 2003]). ...
... The movement of facial components based method uses the movements of individual facial muscles to encode facial expression states. Examples of this method include FACS [Ekman 1978], Emotional Facial Action Coding System (EMFACS), MAXimally discriminative facial movement coding system (MAX) [Izard et al. 1979], and probability-based AU space [Zhao et al. 2016]. The FACS, which was originally developed by Paul Ekman and his colleagues in 1978, defines a total of 44 AUs to encode movements of facial muscles. ...
... Early psychological studies [Dunlap 1927], [Ruckmick 1921] focused on the question of whether there is one facial area which can best distinguish among facial expressions. This question was later largely answered by the predominant evidence from researchers such as Ekman [Boucher and Ekman 1975], [Ekman et al. 2013] and Hanawalt [Hanawalt 1942] that the most distinctive facial component varies with each emotion. Using static photographs of posed facial expressions, it was generally found that the most important facial components are mouth/cheeks, eyes/eyelids, and brows/forehead, and that disgust is best distinguished from the mouth, fear from the eyes, sadness from both brows and eyes, happiness from both mouth and eyes, anger from mouth and brows, and surprise from all and eyes, anger from mouth and brows, and surprise from all three components. ...
Preprint
Automatic machine-based Facial Expression Analysis (FEA) has made substantial progress in the past few decades driven by its importance for applications in psychology, security, health, entertainment and human computer interaction. The vast majority of completed FEA studies are based on non-occluded faces collected in a controlled laboratory environment. Automatic expression recognition tolerant to partial occlusion remains less understood, particularly in real-world scenarios. In recent years, efforts investigating techniques to handle partial occlusion for FEA have seen an increase. The context is right for a comprehensive perspective of these developments and the state of the art from this perspective. This survey provides such a comprehensive review of recent advances in dataset creation, algorithm development, and investigations of the effects of occlusion critical for robust performance in FEA systems. It outlines existing challenges in overcoming partial occlusion and discusses possible opportunities in advancing the technology. To the best of our knowledge, it is the first FEA survey dedicated to occlusion and aimed at promoting better informed and benchmarked future work.
... • A dataset generation method based on the Facial Action Coding System (FACS) [15]. • A spatiotemporal normalization and stabilization method, and a seed-based automated labeling method (semi-supervised learning). ...
... A common approach to mitigate the lack of face diversity is to base the analysis on theoretical muscle deformation and movement using cataloged studies grounded in human anatomy, such as the Facial Action Coding System (FACS), initially proposed by Ekman and Friesen in 1976 [15]. FACS contains a catalog of the main muscle deformations summarized into a set of standard Action Units (AUs); while FACS assumes these AUs are independent of human diversity variations, differences in intensity need to be addressed for each variation. ...
... While the FACS began as a static catalog in the 1970s due to technological limitations, it has progressively evolved into short-duration videos in recent years [13,15]. However, many of these videos are recorded in controlled environments, which may not fully capture the variability and noise typically encountered in real-world scenarios, as discussed in [9,10,13]. ...
Article
Full-text available
Visual biosignals can be used to analyze human behavioral activities and serve as a primary resource for Facial Expression Recognition (FER). FER computational systems face significant challenges, arising from both spatial and temporal effects. Spatial challenges include deformations or occlusions of facial geometry, while temporal challenges involve discontinuities in motion observation due to high variability in poses and dynamic conditions such as rotation and translation. To enhance the analytical precision and validation reliability of FER systems, several datasets have been proposed. However, most of these datasets focus primarily on spatial characteristics, rely on static images, or consist of short videos captured in highly controlled environments. These constraints significantly reduce the applicability of such systems in real-world scenarios. This paper proposes the Facial Biosignals Time–Series Dataset (FBioT), a novel dataset providing temporal descriptors and features extracted from common videos recorded in uncontrolled environments. To automate dataset construction, we propose Visual–Temporal Facial Expression Recognition (VT-FER), a method that stabilizes temporal effects using normalized measurements based on the principles of the Facial Action Coding System (FACS) and generates signature patterns of expression movements for correlation with real-world temporal events. To demonstrate feasibility, we applied the method to create a pilot version of the FBioT dataset. This pilot resulted in approximately 10,000 s of public videos captured under real-world facial motion conditions, from which we extracted 22 direct and virtual metrics representing facial muscle deformations. During this process, we preliminarily labeled and qualified 3046 temporal events representing two emotion classes. As a proof of concept, these emotion classes were used as input for training neural networks, with results summarized in this paper and available in an open-source online repository.
... Por otro lado, existen alternativas de codificación descriptivas que se enfocan en el tipo de movimiento que puede hacer el rostro humano. Una de las más influyentes es el Facial Action Coding System (FACS) desarrollado por Ekman y Friesen [8] que codifica cada posible movimiento de una expresión en acciones de músculos faciales denominadas Action Units (AUs). A partir de esto, se puede definir una codificación híbrida mezclando ambas ideas como por ejemplo EMFACS [28] que define a las categorías de emociones a partir de las AUs. ...
... Además, para intentar compensar la posible pérdida de rendimiento por la elección de un diseño menos complejo, emplean Label Distribution Learning (LDL) [38,9], una estrategia alternativa al uso de etiquetas únicas que intenta hacer el entrenamiento más robusto teniendo en cuenta que la mayoría de las expresiones faciales pueden entenderse como con una combinación de emociones de distintas intensidades [30]. Relacionado con esto, en Chen et al. [3] se utiliza esta misma estrategia pero incorporando información directamente de la topología de un espacio de etiquetas descriptivas auxiliares, como Action Units [8] o Landmarks, para tratar de abordar el problema del entrenamiento con anotaciones inconsistentes. ...
... Como se mencionó en la Sec. 1.1, las Action Units (AUs) codifican movimientos de grupos de músculos faciales según el Facial Action Coding System (FACS) [8]. En particular la última revisión del 2002 determina 9 Action Units en la parte superior de la cara y 18 en la parte inferior que pueden observarse en la Figura 7. Además de esto, cuenta con 14 relacionadas con la posición y movimiento de la cabeza, 9 relacionadas con la posición y movimiento de los ojos y otras para acciones misceláneas. ...
Article
Full-text available
Hoy en día, la búsqueda de soluciones lightweight que logren resultados comparables a modelos de Deep learning robustos ha recibido particular atención debido a su implementación factible en dispositivos móviles. Uno de los problemas que podrían aprovechar esta cualidad es el de Facial Expression Recognition (FER). Considerando que una gran cantidad de datasets de expresiones faciales suelen estar anotados con emociones categóricas cuando en realidad la mayoría de las expresiones exhibidas en escenarios ‘in the wild’ ocurren como combinaciones o composición de emociones básicas, se puede hacer uso de Label Distibution Learning (LDL) como estrategia para el entrenamiento. En este trabajo se abordará el problema de FER a través de redes neuronales livianas entrenadas con LDL. Bajo el supuesto de que las imágenes de expresiones faciales deberían tener una distribución de emoción similar a la de su vecindad en un espacio de etiquetas auxiliares adecuado, como aquel determinado por la tarea de Action Unit recognition, se puede aprovechar la información de las distribuciones e incorporarla como parte la función de pérdida. Concretamente, se estudiarán en profundidad la arquitectura lightweight EfficientFace y se analizará el impacto de distintos acercamientos para implementar LDL considerando datasets ‘in the wild’ como RAF-DB, CAER-S, FER+ y AffectNet.
... Eliciting spontaneous micro-expression is a real challenge because it can be very difficult to induce the emotions in participants and also get them to conceal them effectively in a lab-controlled environment. Micro-expression datasets need decent ground truth labelling with Action Units (AUs) using the Facial Action Coding System (FACS) [27]. FACS objectively assigns AUs to the muscle movements of the face. ...
... Class VII relates to contempt and other AUs that have no emotional link in EMFACS [29]. It should be noted that the classes do not directly correlate to being these emotions, however the links used are informed from previous research [27,29,45]. Each movement in both datasets were classified based on the AU categories of Table 2, with the resulting frequency of movements being shown in Table 3. ...
Preprint
Micro-expressions are brief spontaneous facial expressions that appear on a face when a person conceals an emotion, making them different to normal facial expressions in subtlety and duration. Currently, emotion classes within the CASME II dataset are based on Action Units and self-reports, creating conflicts during machine learning training. We will show that classifying expressions using Action Units, instead of predicted emotion, removes the potential bias of human reporting. The proposed classes are tested using LBP-TOP, HOOF and HOG 3D feature descriptors. The experiments are evaluated on two benchmark FACS coded datasets: CASME II and SAMM. The best result achieves 86.35\% accuracy when classifying the proposed 5 classes on CASME II using HOG 3D, outperforming the result of the state-of-the-art 5-class emotional-based classification in CASME II. Results indicate that classification based on Action Units provides an objective method to improve micro-expression recognition.
... For each emotion, different facial muscles move to create the facial expression. The Facial Action Coding System (FACS) specifies the facial action units (AU) corresponding to each emotional expression [63]. The facial landmarks were used to segment each of the six basic emotions, as described below. ...
... Using fewer landmarks also reduces computational complexity. Landmarks numbered 48,50,52,54,57,60,61,62,63,64,65,66,67 around the lips and landmarks on the nose numbered 27, 28, 29 were excluded. This excluded landmark set affected the calculation of several distances given in Table 4. Therefore, we used the minor adjustments given in Table 9 to calculate 11 distances using the remaining 40-landmarks. ...
Article
Full-text available
Psychological studies have demonstrated that the facial dynamics play a significant role in recognizing an individual’s identity. This study introduces a novel database (MYFED) and approach for person identification based on facial dynamics, to extract the identity-related information associated with the facial expressions of the six basic emotions (happiness, sadness, surprise, anger, disgust, and fear). Our contribution includes the collection of the MYFED database, featuring facial videos capturing both spontaneous and deliberate expressions of the six basic emotions. The database is uniquely tailored for person identification using facial dynamics of emotional expressions, ensuring an average of ten repetitions for each emotional expression per subject-a characteristic often absent in existing facial expression databases. Additionally, we present a novel person identification method leveraging dynamic features extracted from videos depicting the six basic emotions. Experimental results confirm that dynamic features of all emotional expressions contain identity-related information. Notably, surprise, happiness, and sadness expressions exhibit the highest levels of identity-related data in descending order. To our knowledge, this is the first research that comprehensively analyzes facial expressions of all six basic emotions for person identification. For further research and exploration, the MYFED database is made accessible to researchers via the MYFED database website.
... We found that existing emotional talking head datasets typically provide only tag-level annotations with limited emotion categories (e.g., "happy" or "sad") for talking videos, so we obtained fine-grained AU labels and natural text instructions through Action Unit Extraction and VLM Paraphrase. Action Units (AUs) are defined by Facial Action Coding System (FACS) [7] and are used to describe facial muscle movements. We extracted AU labels for each frame and the corresponding action descriptions using the ME-GraphAU model [20]. ...
Preprint
The generation of talking avatars has achieved significant advancements in precise audio synchronization. However, crafting lifelike talking head videos requires capturing a broad spectrum of emotions and subtle facial expressions. Current methods face fundamental challenges: a)the absence of frameworks for modeling single basic emotional expressions, which restricts the generation of complex emotions such as compound emotions; b)the lack of comprehensive datasets rich in human emotional expressions, which limits the potential of models. To address these challenges, we propose the following innovations: 1)the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental emotions to enable the precise synthesis of both singular and compound emotional states; 2)the DH-FaceEmoVid-150 dataset, specifically curated to include six prevalent human emotional expressions as well as four types of compound emotions, thereby expanding the training potential of emotion-driven models. Furthermore, to enhance the flexibility of emotion control, we propose an emotion-to-latents module that leverages multimodal inputs, aligning diverse control signals-such as audio, text, and labels-to ensure more varied control inputs as well as the ability to control emotions using audio alone. Through extensive quantitative and qualitative evaluations, we demonstrate that the MoEE framework, in conjunction with the DH-FaceEmoVid-150 dataset, excels in generating complex emotional expressions and nuanced facial details, setting a new benchmark in the field. These datasets will be publicly released.
... In the 1960s, Ekman et al. [5] discovered the phenomenon of MEs through the analysis of facial expressions exhibited by a patient with severe depression. Building on this discovery, Ekman et al. [6] proposed the Facial Action Coding System (FACS). This system divides the face into multiple independent action units (AUs) based on the anatomical principles of the human face. ...
Article
Full-text available
Micro-expressions often reveal more genuine emotions but are challenging to recognize due to their brief duration and subtle amplitudes. To address these challenges, this paper introduces a micro-expression recognition method leveraging regions of interest (ROIs). Firstly, four specific ROIs are selected based on an analysis of the optical flow and relevant action units activated during micro-expressions. Secondly, effective feature extraction is achieved using the optical flow method. Thirdly, a block partition module is integrated into a convolutional neural network to reduce computational complexity, thereby enhancing model accuracy and generalization. The proposed model achieves notable performance, with accuracies of 93.96%, 86.15%, and 81.17% for three-class recognition on the CASME II, SAMM, and SMIC datasets, respectively. For five-class recognition, the model achieves accuracies of 81.63% on the CASME II dataset and 84.31% on the SMIC dataset. Experimental results validate the effectiveness of using ROIs in improving micro-expression recognition accuracy.
... Among these, facial emotion recognition offers a direct, realtime window into an individual's emotional state. Unlike text, facial expressions reflect spontaneous, involuntary responses that are difficult to consciously control, making them a more immediate and less filtered measure of emotion [7]. ...
Preprint
Full-text available
Sentiment analysis of textual content has become a well-established solution for analyzing social media data. However, with the rise of images and videos as primary modes of expression, more information on social media is conveyed visually. Among these, facial expressions serve as one of the most direct indicators of emotional content in images. This study analyzes a dataset of Instagram posts related to the 2024 U.S. presidential election, spanning April 5, 2024, to August 9, 2024, to compare the relationship between textual and facial sentiment. Our findings reveal that facial expressions generally align with text sentiment, although neutral and negative facial expressions provide critical information beyond valence. Furthermore, during politically significant events such as Donald Trump's conviction and assassination attempt, posts depicting Trump showed a 12% increase in negative sentiment. Crucially, Democrats use their opponent's fear to depict weakness whereas Republicans use their candidate's anger to depict resilience. Our research highlights the potential of integrating facial expression analysis with textual sentiment analysis to uncover deeper insights into social media dynamics.
... Facial expression analysis is a critical tool for understanding human emotions and intentions. The research on facial expression analysis started in the 1970s [41,42]. Over the years of development, facial expression analysis has penetrated into our human daily lives. ...
Preprint
Full-text available
Facial expressions convey human emotions and can be categorized into macro-expressions (MaEs) and micro-expressions (MiEs) based on duration and intensity. While MaEs are voluntary and easily recognized, MiEs are involuntary, rapid, and can reveal concealed emotions. The integration of facial expression analysis with Internet-of-Thing (IoT) systems has significant potential across diverse scenarios. IoT-enhanced MaE analysis enables real-time monitoring of patient emotions, facilitating improved mental health care in smart healthcare. Similarly, IoT-based MiE detection enhances surveillance accuracy and threat detection in smart security. This work aims at providing a comprehensive overview of research progress in facial expression analysis and explores its integration with IoT systems. We discuss the distinctions between our work and existing surveys, elaborate on advancements in MaE and MiE techniques across various learning paradigms, and examine their potential applications in IoT. We highlight challenges and future directions for the convergence of facial expression-based technologies and IoT systems, aiming to foster innovation in this domain. By presenting recent developments and practical applications, this study offers a systematic understanding of how facial expression analysis can enhance IoT systems in healthcare, security, and beyond.
... Request permissions from permissions@acm.org. HRI '24 Companion, March 11-14, 2024 [5] can be used to combine AUs into basic emotions. ...
Preprint
Full-text available
The ability to display rich facial expressions is crucial for human-like robotic heads. While manually defining such expressions is intricate, there already exist approaches to automatically learn them. In this work one such approach is applied to evaluate and control a robot head different from the one in the original study. To improve the mapping of facial expressions from human actors onto a robot head, it is proposed to use 3D landmarks and their pairwise distances as input to the learning algorithm instead of the previously used facial action units. Participants of an online survey preferred mappings from our proposed approach in most cases, though there are still further improvements required.
... Each API call returns an array of 63 floating point numbers ranging from 0 to 1. Each number indicates the activation strength of one of 63 facial expressions, such as jaw drop or left inner brow raiser, defined based on the Facial Actions Coding System (FACS) [22]. As visualized in Fig.2, these measurements are typically used to adjust blend shapes to accurately map a user's facial expressions onto a 3D model. ...
Preprint
Full-text available
In this study, we explored the potential of utilizing Facial Expression Activations (FEAs) captured via the Meta Quest Pro Virtual Reality (VR) headset for Facial Expression Recognition (FER) in VR settings. Leveraging the EmojiHeroVR Database (EmoHeVRDB), we compared several unimodal approaches and achieved up to 73.02% accuracy for the static FER task with seven emotion categories. Furthermore, we integrated FEA and image data in multimodal approaches, observing significant improvements in recognition accuracy. An intermediate fusion approach achieved the highest accuracy of 80.42%, significantly surpassing the baseline evaluation result of 69.84% reported for EmoHeVRDB's image data. Our study is the first to utilize EmoHeVRDB's unique FEA data for unimodal and multimodal static FER, establishing new benchmarks for FER in VR settings. Our findings highlight the potential of fusing complementary modalities to enhance FER accuracy in VR settings, where conventional image-based methods are severely limited by the occlusion caused by Head-Mounted Displays (HMDs).
... A variety of facial motion capture and retargeting systems [Seol et al. 2012] [Huang et al. 2011] [Weise et al. 2011] use a representation of blendshapes. This is known as Facial Action Coding System (FACS) [EKMAN and FRIESEN 1978]. Generally, the previously mentioned methodologies establish mappings between examples of facial expressions of a user and examples of facial expressions represented by blendshapes that are then used to control a face model. ...
Preprint
Inspired by kernel methods that have been used extensively in achieving efficient facial animation retargeting, this paper presents a solution to retargeting facial animation in virtual character's face model based on the kernel projection of latent structure (KPLS) regression between semantically similar facial expressions. Specifically, a given number of corresponding semantically similar facial expressions are projected into the latent space. By using the Nonlinear Iterative Partial Least Square method, decomposition of the latent variables is achieved. Finally, the KPLS is achieved by solving a kernalized version of the eigenvalue problem. By evaluating our methodology with other kernel-based solutions, the efficiency of the presented methodology in transferring facial animation to face models with different morphological variations is demonstrated.
... F ACIAL behavior is the most powerful and natural means of expressing the affective and emotional states of human being [1]. The Facial Action Coding System (FACS) developed by Ekman and Friesen [2] is a comprehensive and widely used system for facial behavior analysis, where a set of facial action units (AUs) are defined. According to the FACS [3], each facial AU is anatomically related to the contraction of a specific set of facial muscles, and combinations of AUs can describe rich and complex facial behaviors. ...
Preprint
It is challenging to recognize facial action unit (AU) from spontaneous facial displays, especially when they are accompanied by speech. The major reason is that the information is extracted from a single source, i.e., the visual channel, in the current practice. However, facial activity is highly correlated with voice in natural human communications. Instead of solely improving visual observations, this paper presents a novel audiovisual fusion framework, which makes the best use of visual and acoustic cues in recognizing speech-related facial AUs. In particular, a dynamic Bayesian network (DBN) is employed to explicitly model the semantic and dynamic physiological relationships between AUs and phonemes as well as measurement uncertainty. A pilot audiovisual AU-coded database has been collected to evaluate the proposed framework, which consists of a "clean" subset containing frontal faces under well controlled circumstances and a challenging subset with large head movements and occlusions. Experiments on this database have demonstrated that the proposed framework yields significant improvement in recognizing speech-related AUs compared to the state-of-the-art visual-based methods especially for those AUs whose visual observations are impaired during speech, and more importantly also outperforms feature-level fusion methods by explicitly modeling and exploiting physiological relationships between AUs and phonemes.
... For each video frame (recorded at 15 fps), OpenFace estimates head pose, a number of facial landmark positions and a number of Facial Action Coding System (FACS) [15] outputs. OpenFace provides features based on the facial action coding system [38] which attempts to measure the movement of facial landmarks which roughly correspond to individual facial muscles (see Table 3). OpenFace's performance has been benchmarked on public manually coded databases [37]. ...
Preprint
We developed an online framework that can automatically pair two crowd-sourced participants, prompt them to follow a research protocol, and record their audio and video on a remote server. The framework comprises two web applications: an Automatic Quality Gatekeeper for ensuring only high quality crowd-sourced participants are recruited for the study, and a Session Controller which directs participants to play a research protocol, such as an interrogation game. This framework was used to run a research study for analyzing facial expressions during honest and deceptive communication using a novel interrogation protocol. The protocol gathers two sets of nonverbal facial cues in participants: features expressed during questions relating to the interrogation topic and features expressed during control questions. The framework and protocol were used to gather 151 dyadic conversations (1.3 million video frames). Interrogators who were lied to expressed the smile-related lip corner puller cue more often than interrogators who were being told the truth, suggesting that facial cues from interrogators may be useful in evaluating the honesty of witnesses in some contexts. Overall, these results demonstrate that this framework is capable of gathering high quality data which can identify statistically significant results in a communication study.
... Following this observation, Paul Ekman researched patterns of facial behavior among different cultures of the world. In 1978, Ekman and Friesen developed the Facial Action Coding System (FACS) to model human facial expressions [3]. In an updated form, this descriptive These are important markers of the emotional and cognitive inner state of a person. ...
Preprint
Automatic emotion recognition has become a trending research topic in the past decade. While works based on facial expressions or speech abound, recognizing affect from body gestures remains a less explored topic. We present a new comprehensive survey hoping to boost research in the field. We first introduce emotional body gestures as a component of what is commonly known as "body language" and comment general aspects as gender differences and culture dependence. We then define a complete framework for automatic emotional body gesture recognition. We introduce person detection and comment static and dynamic body pose estimation methods both in RGB and 3D. We then comment the recent literature related to representation learning and emotion recognition from images of emotionally expressive gestures. We also discuss multi-modal approaches that combine speech or face with body gestures for improved emotion recognition. While pre-processing methodologies (e.g. human detection and pose estimation) are nowadays mature technologies fully developed for robust large scale analysis, we show that for emotion recognition the quantity of labelled data is scarce, there is no agreement on clearly defined output spaces and the representations are shallow and largely based on naive geometrical representations.
... These methods do not need any labels, while they could depend on the pretrained generators to yield high-resolution images. Some other studies rely on the Action Units (AUs) labels [10], which are used to train the model to generate the edited expression images [11], [12], [13]. These methods are good for both continuous editing with the AUs labels and keeping the identity information. ...
Preprint
Full-text available
Facial expression manipulation aims to change human facial expressions without affecting face recognition. In order to transform the facial expressions to target expressions, previous methods relied on expression labels to guide the manipulation process. However, these methods failed to preserve the details of facial features, which causes the weakening or the loss of identity information in the output image. In our work, we propose WEM-GAN, in short for wavelet-based expression manipulation GAN, which puts more efforts on preserving the details of the original image in the editing process. Firstly, we take advantage of the wavelet transform technique and combine it with our generator with a U-net autoencoder backbone, in order to improve the generator's ability to preserve more details of facial features. Secondly, we also implement the high-frequency component discriminator, and use high-frequency domain adversarial loss to further constrain the optimization of our model, providing the generated face image with more abundant details. Additionally, in order to narrow the gap between generated facial expressions and target expressions, we use residual connections between encoder and decoder, while also using relative action units (AUs) several times. Extensive qualitative and quantitative experiments have demonstrated that our model performs better in preserving identity features, editing capability, and image generation quality on the AffectNet dataset. It also shows superior performance in metrics such as Average Content Distance (ACD) and Expression Distance (ED).
... One important research framework is the Facial Action Coding System (FACS) proposed by Ekman and Friesen in 1978, which utilizes Facial Action Units (FAUs) to decompose facial expressions into smaller components for analysis. In addition, Ekman and Friesen identified six basic facial expressions: anger, disgust, fear, happiness, sadness, and surprise, laying the foundation for subsequent research on FER (Ekman 1978). ...
Article
Full-text available
This study enhances the generalization ability and recognition accuracy of convolutional neural networks (CNNs) in educational settings, particularly in the task of Facial Expression Recognition (FER), to support effective analysis of classroom teaching behaviors. A novel multimodal teaching behavior analysis model is proposed to achieve this goal, combining the Adaptive Deep Convolutional Neural Network (ADCNN), an Adaptive Entropy Minimization algorithm, and Transfer Learning. The aim is to improve teaching quality through more precise behavior recognition. A Bayesian Network based on a Hybrid Deep Restricted Boltzmann Machine is introduced to enhance the model’s performance, enabling effective processing and analysis of multimodal data. The model’s performance in the FER task has been extensively validated through experiments, encompassing precise analysis of training and testing data. These methods allow the model to adapt to different datasets automatically and improve overall performance. In addition, the study collects data on classroom engagement and learning interest from Chinese university students through questionnaires, further supporting the model’s application in real educational environments. Experimental results show that the proposed model performs exceptionally well in the FER task, with an average training precision of approximately 99.79% and a testing precision of about 99.60%. The maximum difference between training and testing precision is 0.19%. The questionnaire results reveal issues such as low English proficiency and insufficient classroom engagement among students at Heilongjiang Bayi Agricultural University, providing data to support the improvement of teaching methods. The proposed multimodal teaching behavior analysis model provides an effective tool for optimizing classroom teaching by accurately recognizing students’ facial expressions and classroom behaviors.
... Despite searching for articles published from 2012 onwards, most of the studies were published in the last 5 years (n = 28, 84.8%; 26, 28, 53, 55, 57-59, 61-65, 67-71, 73, 75-84). The majority of studies were conducted in North America (n = 10, 30.3%; 53,57,60,64,66,72,75,78,80,84) or Europe (n = 9, 27.3%; 26,28,55,63,65,67,69,71,74), followed by those conducted in Asia (n = 8, 24.2%; 59, 61-62, 68, 70, 73, 76-77) and Australia (n = 1, 3.0%; 82). The remaining studies used international samples (n = 2, 6.1%; 56, 81) or did not contain information on geographical location (n = 3, 9.1%; 58,79,83). ...
Article
Full-text available
Background While anxiety disorders are one of the most prevalent mental diseases, they are often overlooked due to shortcomings of the existing diagnostic procedures, which predominantly rely on self-reporting. Due to recent technological advances, this source of information could be complemented by the so-called observable cues – indicators that are displayed spontaneously through individuals’ physiological responses or behaviour and can be detected by modern devices. However, while there are several individual studies on such cues, this research area lacks a synthesis. In line with this, our scoping review aimed to identify observable cues that offer meaningful insight into individuals’ anxiety and to determine how these cues can be measured. Methods We followed the PRISMA guidelines for scoping reviews. The search string containing terms related to anxiety and observable cues was entered into four databases (Web of Science, MEDLINE, ERIC, IEEE). While the search – limited to English peer-reviewed records published from 2012 onwards – initially yielded 2311 records, only 33 articles fit our selection criteria and were included in the final synthesis. Results The scoping review unravelled various categories of observable cues of anxiety, specifically those related to facial expressions, speech and language, breathing, skin, heart, cognitive control, sleep, activity and motion, location data and smartphone use. Moreover, we identified various approaches for measuring these cues, including wearable devices, and analysing smartphone usage and social media activity. Conclusions Our scoping review points to several physiological and behavioural cues associated with anxiety and highlights how these can be measured. These novel insights may be helpful for healthcare practitioners and fuel future research and technology development. However, as many cues were investigated only in a single study, more evidence is needed to generalise these findings and implement them into practice with greater confidence.
... Each registration contains a mesh with N V vertices and estimated 3DMM parameters Θ. See Fig. 2 for an example. Our expression basis E is FACS-like [EF78] and was authored by an artist, specifically, we purchased a set of blendshapes Polywink [Pol]. Each basis controls a localized area and guarantees a stable skull. ...
Preprint
Full-text available
Nowadays, it is possible to scan faces and automatically register them with high quality. However, the resulting face meshes often need further processing: we need to stabilize them to remove unwanted head movement. Stabilization is important for tasks like game development or movie making which require facial expressions to be cleanly separated from rigid head motion. Since manual stabilization is labor-intensive, there have been attempts to automate it. However, previous methods remain impractical: they either still require some manual input, produce imprecise alignments, rely on dubious heuristics and slow optimization, or assume a temporally ordered input. Instead, we present a new learning-based approach that is simple and fully automatic. We treat stabilization as a regression problem: given two face meshes, our network directly predicts the rigid transform between them that brings their skulls into alignment. We generate synthetic training data using a 3D Morphable Model (3DMM), exploiting the fact that 3DMM parameters separate skull motion from facial skin motion. Through extensive experiments we show that our approach outperforms the state-of-the-art both quantitatively and qualitatively on the tasks of stabilizing discrete sets of facial expressions as well as dynamic facial performances. Furthermore, we provide an ablation study detailing the design choices and best practices to help others adopt our approach for their own uses. Supplementary videos can be found on the project webpage syntec-research.github.io/FaceStab.
... The Facial Action Coding System (FACS) is a valuable quantitative tool to describe facial expressions in humans (Human FACS, Ekman & Friesen, 1978) and other species (e.g., Maq-FACS, Parr et al. 2010). Via FACS it is possible to associate an unequivocal code to a specific morphological change in the face (AU). ...
Article
Facial communication regulates many aspects of social life in human and nonhuman primates. Empirically identifying distinct facial expressions and their underlying functions can help illuminate the evolution of species' communicative complexity. We focused on bared‐teeth faces (BTFs), a highly versatile facial expression in the tolerant macaque Macaca tonkeana . By employing a diverse array of techniques (MaqFACS, unsupervised cluster analysis, Levenshtein distance, NetFACS), we quantitatively discriminated two distinct BTFs: bared‐teeth (BT) and open mouth bared‐teeth (OMBT), and evaluated their distribution across peaceful, playful, and agonistic contexts. Neither BT nor OMBT were context‐specific, although BT frequently occurred during peaceful interactions and with low levels of stereotypy. OMBT was highly stereotyped during play, a context involving strong unpredictability. The presence of tongue‐protrusion during OMBT was exclusive to peaceful contexts whereas the presence of glabella‐lowering during BT and OMBT was specific to agonistic contexts. Hence, BT and OMBT per se are not context‐specific, but their contextual relevance hinges on the inclusion of specific key elements. Moving forward, concurrent analyses of stereotypy and specificity should extend beyond our study to encompass other primate and non‐primate species, facilitating direct comparisons and revealing how communicative and social complexity coevolve.
... Additionally, some extract emotion features from latent spaces [25,34] leading to an implicit transformation of emotion. Although Facial Action Units (AUs) effectively describe emotions, various methods [13,14,16] use different combinations of AUs for the same emotion. Moreover, AUs alone remain insufficient. ...
Preprint
While existing one-shot talking head generation models have achieved progress in coarse-grained emotion editing, there is still a lack of fine-grained emotion editing models with high interpretability. We argue that for an approach to be considered fine-grained, it needs to provide clear definitions and sufficiently detailed differentiation. We present LES-Talker, a novel one-shot talking head generation model with high interpretability, to achieve fine-grained emotion editing across emotion types, emotion levels, and facial units. We propose a Linear Emotion Space (LES) definition based on Facial Action Units to characterize emotion transformations as vector transformations. We design the Cross-Dimension Attention Net (CDAN) to deeply mine the correlation between LES representation and 3D model representation. Through mining multiple relationships across different feature and structure dimensions, we enable LES representation to guide the controllable deformation of 3D model. In order to adapt the multimodal data with deviations to the LES and enhance visual quality, we utilize specialized network design and training strategies. Experiments show that our method provides high visual quality along with multilevel and interpretable fine-grained emotion editing, outperforming mainstream methods.
Article
Full-text available
Affective computing is an emerging area of education research and has the potential to enhance educational outcomes. Despite the growing number of literature studies, there are still deficiencies and gaps in the domain of affective computing in education. In this study, we systematically review affective computing in the education domain. Methods: We queried four well-known research databases, namely the Web of Science Core Collection, IEEE Xplore, ACM Digital Library, and PubMed, using specific keywords for papers published between January 2010 and July 2023. Various relevant data items are extracted and classified based on a set of 15 extensive research questions. Following the PRISMA 2020 guidelines, a total of 175 studies were selected and reviewed in this work from among 3102 articles screened. The data show an increasing trend in publications within this domain. The most common research purpose involves designing emotion recognition/expression systems. Conventional textual questionnaires remain the most popular channels for affective measurement. Classrooms are identified as the primary research environments; the largest research sample group is university students. Learning domains are mainly associated with science, technology, engineering, and mathematics (STEM) courses. The bibliometric analysis reveals that most publications are affiliated with the USA. The studies are primarily published in journals, with the majority appearing in the Frontiers in Psychology journal. Research gaps, challenges, and potential directions for future research are explored. This review synthesizes current knowledge regarding the application of affective computing in the education sector. This knowledge is useful for future directions to help educational researchers, policymakers, and practitioners deploy affective computing technology to broaden educational practices.
Article
Full-text available
Autism spectrum disorder (ASD) is characterized by impairments in social affective engagement. The present study uses a mild social stressor task to add to inconclusive past literature concerning differences in affective expressivity between autistic young adults and non-autistic individuals from the general population (GP). Young adults (mean age = 21.5) diagnosed with ASD (n = 18) and a non-autistic comparison group (n = 17) participated in the novel social stress task. Valence (positive/negative) and intensity of facial affect were coded across four observational episodes that alternated between engagement and disengagement of social conversational partner. Results indicated an overall attenuation in expressivity in the ASD group in comparison to the non-autistic group. Mean affect differed between groups, especially in the amount of affective expression. Both groups responded with increased positive expressions during social engagement episodes. The affect difference was driven by a smaller proportion of positive and a greater proportion of neutral affect displays in the ASD group compared to the non-autistic group during these episodes, and less so by negative affect differences. The results suggest that friendly, non-threatening social interactions should not be assumed to be aversive to autistic individuals, and that these individuals may respond to such situations with muted positive valence. These findings are consistent with past reports of decreased expressivity in autistic individuals compared to individuals from the general population, specifically in an ecologically valid social context.
Chapter
The relatively scarce research on emotional meaning in expressive body movement has been enacted via a variety of different displays, including but not limited to dance, gait, posture, and dynamic movement cues. Collectively this research demonstrates that emotion communication utilizing static and dynamic cues based upon body configuration and movement dynamics is a viable nonverbal channel in its own right, capable of demonstrating accuracy at levels equal to and developmentally aligned with other nonverbal channels such face, voice, and music. The goal of this chapter is to provide a brief overview of the expressive body movement research efforts, articulate the specific challenges of research in this medium, identify overlooked theoretical connections, and map out future directions to further extend findings.
Chapter
Dating back to Charles Darwin’s observations of how humans express certain basic emotions, such as happiness, fear, and anger, the prevailing view among social scientists has been on the correspondence between certain emotions and discrete patterns of facial cues that communicate these emotions. Receiving far less attention is the question of whether there is any similar correspondence between certain emotions and the vocal and/or bodily expression of these emotions. In this chapter, we explore the research on this fundamental question. We begin by addressing the nature of emotions and the range of emotions we experience, including those emotions that are not associated with any corresponding facial activity. Additionally, we consider the role that the context plays in the bodily and vocal communication of emotion. For instance, we focus on how and with what effect we use our bodies and voices to express emotions most prevalent in the context of interpersonal conflict. We extend that analysis to questions of how non-facial expressions of self-conscious emotions impact the need for and process of conflict management interventions.
Chapter
This chapter discusses how nonverbal behaviors signal credibility, relational messages such as trust and affection or liking, and deception. The face is the most widely studied channel for credibility, trust, and deception but other useful codes including the voice, head movements, hand gestures, eye movements, and others are introduced. Using three illustrative investigations including group conversations while playing a game, a mock hiring task, and asynchronous job interviews, we demonstrate the utility of a wide variety of cues for assessments of credibility, composure, affection, and trustworthiness. The chapter summarizes the results of the three studies and provides insights into other applications including interview training, business negotiations, and interdisciplinary collaboration
Chapter
Faces are capable of accurately indicating information not only about the identity, age, and sex of the individuals with whom they live, but also about their emotional state and intentions. Its recognition is essential in social animals, such as primates, which share the ability to express themselves and perceive emotions through this mechanism that has great adaptive importance. The human being is considered an expert in recognizing conspecific faces and can identify and differentiate innumerable faces throughout life in different contexts. Face recognition and discrimination are relevant skills among human beings. It is essential in multiple aspects of everyday life, such as socialization or when impaired by neurological and mental disorders (e.g., autism, prosopagnosia, and schizophrenia). The natural processing of faces occurs through an integrated analysis of facial features based on specific aspects of the face, such as eyes, nose, and mouth, and also in terms of spatial organization, which researchers name as holistic face processing. This chapter addresses the subject of face recognition from a neuropsychiatric perspective, its state of the art, and the areas of the brain involved in the processing paradigms used for research purposes, presenting itself together with results of studies carried out with humans and other species of nonhuman primates.
Book
Full-text available
This book argues that our success in navigating the social world depends heavily on scripts. Scripts play a central role in our ability to understand social interactions shaped by different contextual factors. In philosophy of social cognition, scholars have asked what mechanisms we employ when interacting with other people or when cognizing about other people. Recent approaches acknowledge that social cognition and interaction depend heavily on contextual, cultural, and social factors that contribute to the way individuals make sense of the social interactions they take part in. This book offers the first integrative account of scripts in social cognition and interaction. It argues that we need to make contextual factors and social identity central when trying to explain how social interaction works, and that this is possible via scripts. Additionally, scripts can help us understand bias and injustice in social interaction. The author’s approach combines several different areas of philosophy – philosophy of mind, social epistemology, feminist philosophy – as well as sociology and psychology to show why paying attention to injustice in interaction is much needed in social cognition research, and in philosophy of mind more generally. Scripts and Social Cognition: How We Interact with Others will appeal to scholars and graduate students working in philosophy of mind, philosophy of psychology, social epistemology, social ontology, sociology, and social psychology.
Article
We introduce VOODOO XP: a 3D-aware one-shot head reenactment method that can generate highly expressive facial expressions driven by an input video from a single 2D portrait. Our approach is real-time, view-consistent, and can be instantly used without calibration or fine-tuning. We demonstrate our solution in a monocular video setting and an end-to-end VR telepresence system for two-way communication. Compared to 2D head reenactment methods, 3D-aware approaches aim to preserve the identity of the subject and ensure view-consistent facial geometry for novel camera poses, which makes them suitable for immersive applications. While various facial disentanglement techniques have been introduced, cutting-edge 3D-aware neural reenactment techniques still lack expressiveness and fail to reproduce complex and fine-scale facial expressions. We present a novel cross-reenactment architecture that directly transfers the driver's facial expressions to transformer blocks of the input source's 3D lifting module. We show that highly effective disentanglement is possible using a new multi-stage self-supervision approach. It relies on a coarse-to-fine training strategy, which is combined with explicit face neutralization and 3D lifted frontalization during its initial training stage. We further integrate our novel head reenactment solution into an accessible high-fidelity VR telepresence system, where any person can instantly build a personalized neural head avatar from any photo and bring it to life using the headset. Furthermore, our proposed method demonstrates state-of-the-art expressiveness and likeness preservation on diverse subjects and capture conditions.
Article
This study analyzed changes in facial expressions over time to examine how rapport is formed when discussing any topics. Changes in facial expression are captured by action units (AUs) and 55 locations in the 2D eye region. The topics included (a) introduction and greeting, which are conventional conversations; (b) favorite food; (c) journeys and watching movies; and (d) the future where humans and AI interact. The data used in this study are 15–20-min recorded videos (.MP4 format) between 29 human participants and a virtual agent named Hazumi1902, operated by the Wizard-of-Oz method. Multimodal information generated by the participants during the dialogue was recorded using video and a Microsoft Kinect sensor. The AUs and locations of the 2D eye regions were detected by OpenFace 2.0. The intensity of AUs and the location of 2D eye region landmarks were analyzed using the Kruskal–Wallis test by combining the conversations and rapport criterion-type measurements. This study found that the intensity of AUs was insufficient, and the 2D eye region landmarks was necessary for analyzing how facial expressions were affected by perceptions of rapport and conversation type. One of the criteria measurements,” cold,” was not observed in the intensity change of the AUs. It was concluded that the AUs were not universal, and the location of 2D eye region landmarks played a crucial role in complementing the analysis of their intensity. The intensity of the AUs and the location of 2D eye region landmarks were observed in harmonious conversations. It was discovered that factors hindering the rapport were in the eyes, whereas those promoting rapport were in the AUs. These insights could be invaluable in various fields, from human–computer interaction to non-verbal communication.
Article
The emotional aspect of divination has rarely been systematically addressed in the history of Chinese divination. Focusing on divination manuals from Dunhuang 敦煌 , this article examines the interaction between emotions, especially worry, and divination in medieval China. It explores, on the one hand, how worry in the medieval context drove individuals to seek divinatory practices. On the other hand, it examines how the divination manuals used worry and other emotive words as part of their technical language to measure uncertainty. Surveying different usages of the terms for worry in these manuals, the article argues that the language of worry was part of the terminology of Dunhuang divination, serving as a heuristic to help the inquirers sense the level of auspiciousness and uncertainty of future scenarios. The article will also show that emotional words form a unique layer of terminology that conveys to inquirers a specific sense of auspiciousness.
ResearchGate has not been able to resolve any references for this publication.