Conference PaperPDF Available

Real vs. Fake Emotion Challenge: Learning to Rank Authenticity From Facial Activity Descriptors

Authors:
Real vs. Fake Emotion Challenge: Learning to Rank Authenticity From Facial
Activity Descriptors
Frerk Saxen, Philipp Werner, Ayoub Al-Hamadi
Otto von Guericke University
Magdeburg, Germany
{Frerk.Saxen, Philipp.Werner, Ayoub.Al-Hamadi}@ovgu.de
Abstract
Distinguishing real from fake expressions is an emergent
research topic. We propose a new method to rank authentic-
ity of multiple videos from facial activity descriptors, which
won the ChaLearn real vs. fake emotion challenge. Two
studies with 22 human observers show that our method out-
performs humans by a large margin. Further, it shows that
our proposed ranking method is superior to direct classifi-
cation. However, when humans are asked to compare two
videos from the same subject and emotion before deciding
which is fake or real there is no significant increase in per-
formance compared to classifying each video individually.
This suggests that our computer vision model is able to ex-
ploit facial attributes that are invisible for humans. The
code is available at https://github.com/fsaxen/
NIT-ICCV17Challenge.
1. Introduction
Human faces convey important information for social in-
teraction, including expressions of emotions [5]. Sponta-
neous facial movements are driven by the subcortical ex-
trapyramidal motor system, whereas voluntary facial ex-
pressions are controlled by a cortical pyramidal motor sys-
tem [18,1]. This pyramidal system also allows humans to
fake facial expressions. The simulation of emotions and
pain is so powerful that most observers are deceived [4,1].
However, computer vision systems have been proven to
distinguish deceptive facial expressions from genuine ex-
pressions for some tasks. Hoque et al. [8] enabled a com-
puter vision system to distinguish frustrated from delighted
smiles, a task humans performed much worse. Littlewort
et al. [15] and Bartlett et al. [1] present computer vision
approaches that classify real facial expressions of pain from
faked expressions of pain based on dynamics of action units.
Humans however can not reliably distinguish between real
and faked expressions of pain [7,15,1]. Here, we show
0.5 0.55 0.6 0.65 0.7 0.75
faceall Xlabs
BNU CIST
TUBITAK UZAY-METU
HCILab
NIT-OVGU
Accuracy
Figure 1. Official challenge results on the test set. NIT-OVGU is
our proposed method and together with HCILab team winner of
the real versus fake expressed emotion challenge.
that human observers could discriminate real expressions of
emotions from faked expressions of emotions slightly bet-
ter than chance. However, our computer vision system (also
based on dynamics of action units) achieved significantly
higher accuracy than humans and won the ChaLearn LAP
Real vs. Fake Emotion challenge with 67% accuracy on the
test set (see figure 1).
1.1. Contributions
We propose a new computer vision approach highly
based on existing methods that can classify fake from real
emotions. We provide a general model that classifies a sin-
gle input video and a more specific model that ranks a pair
or sequence of videos with respect to its estimated authen-
ticity. The source code to train and validate our models
is publicly available online.We conducted a human perfor-
mance study and compared the results with our computer
vision approach.
2. Real vs. Fake Emotion Challenge
This section introduces the ChaLearn Looking At Peo-
ple Real Versus Fake Expressed Emotion Challenge [20],
which took place from April 20 until July 2, 2017. In to-
1
3073
Labels
Subset Subjects Emotions Videos Emotion Subject Real/Fake
Training 40 6 480 X X X
Validation 5 6 60 X
Testing 5 6 60 X
Table 1. SASE-FE database split with labels that were provided
for participants during the challenge. Each subject and emotion
provides two videos: One authentic (real) and one fake emotional
display (see section 2.1). There is no subject overlap between sub-
sets.
tal 55 teams registered for the challenge, of which 9 teams
submitted results.
2.1. Dataset
The challenge was run on the newly recorded SASE-FE
database. For each of 50 subjects the challenge database
comprises 12 videos: 6 with authentic emotional reactions
to video clips and 6 with faked emotional displays. The 6
genuine and 6 acted emotion videos correspond to the 6 ba-
sic emotions angry, happy, sad, disgust, contempt, and sur-
prise. The videos have been recorded with a high resolution
GoPro-Hero camera at 100 frames per second, are about
3-4 seconds long, and show the emotional display starting
from and returning to neutral expression. More details on
the SASE-FE database can be found in [17].
The dataset has been split by subject into three subsets as
detailed in Table 1. 80% of the videos form the training set,
for which emotion, subject, and the true-or-fake labels are
given. 10%, which are 60 videos, belong to the validation
set. The test set is the same size. Subject and real-or-fake
labels were not provided with the validation and test set dur-
ing the challenge.
2.2. Task
The challenge task was to classify each video of the test
set into real or fake emotion (binary classification). Per-
formance was evaluated with the accuracy measure, i.e. the
percentage of correctly classified videos. The challenge was
divided in two phases: a validation and a test phase. In the
validation phase 100 evaluations on validation set where
granted (with submission system on CodaLab.org). From
the 23rd June 2017 the test phase started and the validation
labels have been published by the challenge organizers, but
participants were not allowed to use them for training. In
the following test phase, 12 evaluations (with submission
system on CodaLab.org) where granted before the organi-
zation committee verified the results.
3. Recognition Approach
Figure 2shows an overview of our method. From a pair
of videos we automatically estimate Action Unit intensi-
AU Intensity
Estimation
AU Intensity
Estimation
Activity
Descriptor
Activity
Descriptor
Rank SVM
Ensemble
Authen-
ticity
Ranking
real
fake
Videos
(same subject
and emotion)
Figure 2. Overview of our method. Two videos of the same sub-
ject and emotion are compared by individually calculating the ac-
tion units (see section 3.1) and facial activity descriptors (see sec-
tion 3.2). Both descriptors are then passed to a Rank SVM Ensem-
ble (see section 3.3) which outputs an authenticity score indicating
which video is more authentic (real).
ties (see Section 3.1) and compute facial activity descrip-
tors (see Section 3.2). The descriptors of both videos are
jointly classified with a rank SVM Ensemble (see Section
3.3). The rank SVM Ensemble ranks the input videos with
respect to authenticity, i.e. the descriptors of both videos are
combined and classified to detect the more authentic (real)
of both videos. Source code and trained models are avail-
able online 1.
3.1. Action Unit Intensity Estimation
As the first step in our recognition pipeline we estimate
the intensity of facial action units (AU) as described in [24].
For each frame of the video the method applies face de-
tection, facial landmark localization, face registration, LBP
feature extraction, and finally predicts AU intensities with
Support Vector Regression (SVR) ensembles. We apply a
model that was trained on the DISFA dataset [16] to pre-
dict 7 AUs: Inner Brow Raiser (AU 1), Outer Brow Raiser
(AU 2), Brow Lowerer (AU 4), Cheek Raiser (AU 6), Nose
Wrinkler (AU 9), Lip Corner Puller (AU 12), and Lips part
(AU 25).
The face detection and landmark localization that we em-
ploy differ from [24]. The faces are detected through a mul-
tiscale CNN resnet model that comes with dlib and is pub-
licly available online [12]. For landmark localization we
use the method by Kazemi and Sullivan [10] (an ensemble
of regression trees) as implemented in dlib [11], but with an
own model that we trained on multiple datasets (Multi-PIE
[6], afw [26], helen [14], ibug, 300-W [19], 300-VW [3],
and lfpw [2]).
As in [24], we only use the inner 49 landmarks (exclud-
ing chin-line and additional mouth points) for the following
steps. Landmarks and texture are registered with an aver-
age face through an affine transform by minimizing point
distances. Further, we extract uniform local binary pattern
(LBP) histogram features in a regular 10 ×10 grid from
the aligned texture. Finally, the LBP features and the regis-
tered landmarks are standardized and fed into the regression
1https://github.com/fsaxen/NIT-ICCV17Challenge
3074
models to predict AU intensities. We use an ensemble of 10
linear SVRs for each AU (see [24] for details).
Initially we tried an alternative approach to [24] to esti-
mate the AU intensities using a CNN Resnet-29 architec-
ture, which was very successful in other application do-
mains. The performance however was significantly worse
especially for unremarkable action units. Be believe that
small facial details in subregions of the face are hardly tar-
geted by state-of-the-art resnet architectures, which were in-
troduced for course grained detection tasks.
3.2. Facial Activity Descriptor
The method described in the section 3.1 yields 7 AU
intensity time series per video. We condense these time-
series, which differ in length, in descriptors as proposed
in [21]. Each time series is first smoothed with a Butter-
worth filter (first order, cutoff 1 Hz). Second, we calculate
the first and second derivative of the smoothed signal. In
contrast to [21], we also smooth the two derivative time se-
ries to decrease the influence of high variations in the AU
intensity estimation. Third, we extract 17 statistics from
each of the 3 smoothed time series per AU, among other:
mean, max, standard deviation, time of maximum value,
and duration in which the time series values are above their
mean. Compared to [24], which proposed 16 statistics, we
added the difference between the time of maximum AU in-
tensity and the time in which the mean AU intensity value
was crossed the first time. This was done to provide more
time related informations to the classifier because we be-
lieve that timing is crucial to distinguish between fake and
real emotions. Further, we squared some selected statistic
values and added them as additional features to cope with
nonlinear effects. This allows to model some non-linear ef-
fects without loosing the benefits of the linear SVM and
without increasing feature dimensionality too much. Since
we chose to learn a common model for all emotions, we de-
cided to include the emotion category in feature space by
adding a 6-dimensional one-hot coding of the emotion. In
total we got a 440-dimensional feature space. In the fol-
lowing chapters we refer to the “std. descriptors” from [21]
and the described changes with “add. descriptors” or simply
“+”.
3.3. Classification
We follow the idea of comparative learning [22,23]: it is
easier to decide based on comparison with a similar refer-
ence than to decide individually. In the context of this chal-
lenge we believe that it is easier to select the real and the
fake emotion by comparing a set of two videos rather than
classifying each video individually. For this purpose we in-
troduce a virtual authenticity scale in which a real emotion
has a greater value than a fake emotion. We compare videos
of the same emotion and subject, since they are very simi-
0.5 0.55 0.6 0.65 0.7 0.75
Rank SVM Ensemble
+ add. descriptors
Rank SVM Ensemble
Rank SVM
SVM
One SVM
per Emotion
Accuracy
Figure 3. 10 fold cross validation results on the training set for dif-
ferent classifier setups and descriptors. See section 4.1 for details.
lar and only differ regarding the aspect of interest (whether
they are real or fake).
We train a variant of the SVM which predicts pairwise
rankings and is called Rank SVM [9]. We use a common
model for all emotions, since this performed better than us-
ing individual models for each emotion, probably due to the
difference in training sample counts per model (480 for gen-
eral model vs. 80 for an emotion-specific model). Further,
a linear SVM performed better than SVM with RBF kernel,
probably due to overfitting to the limited amount of training
data. Instead of a single Rank SVM, we train an ensem-
ble of n= 75 Rank SVMs, each with a randomly selected
subset of the training sample pairs (m= 50% of samples).
We investigated the number of ensembles nand the ratio
of samples per model mbut gained very similar results for
n > 50 and m > 0.3. Ensemble model predictions are
aggregated by counting the votes for a video to be more au-
thentic. The decision of multiple pairs is fused by averaging
the vote counts. This way, a ranking can be established for
more than two videos (e.g. if subject or emotion label are
erroneous). The ranking is transformed into real/fake labels
by thresholding the authenticity scores with their median
value. If there is only one sample, ranking cannot be ap-
plied. For this case, we also train a fallback standard SVM
to predict real/fake labels from the feature vector directly,
which is less accurate than the ranking model.
Since subject labels are not available for validation and
test set, we apply face recognition to automatically partition
the videos by subject and find the pairs of videos for rank-
ing. The face recognition model comes with dlib [13] and
performs deep metric learning with a CNN resnet architec-
ture. On the test and validation set it runs without error in
subject assignment.
4. Experiments
We conduct several experiments to gain insights in fake
vs. real emotion classification. Section 4.1 discusses the
influence of several approaches in classification and feature
3075
extraction. Section 4.2 compares our method with human
performance on the validation set.
4.1. Our method
Figure 3shows a sequence of improvements for several
key changes in our model. We report the results obtained
through 10-fold leave subjects out cross validation, i.e. sam-
ples from the same subject do not appear in a training and
test set simultaneously. Cross-validation is preferred over
the validation set due to its better estimation of the gen-
eralization performance because it uses significantly more
samples for prediction (see Table 1). The cross-validation
provided much more stable results than the validation set
estimate.
First, we trained one SVM per emotion with the std.
descriptors from [21] and obtained 57% accuracy. Train-
ing a common model for all emotions (“SVM” in figure 3)
increased the accuracy to 61%, probably due to the very
limited number of training samples per emotion. We ex-
pect the individual model to outperform the common model
for significantly larger training sets. We also trained a
nonlinear RBF SVM (including parameter selection) and
gained worse performance compared to linear SVM. The in-
creased model complexity suffered from the limited amount
of training data and resulted in an overfitted model.
Second, we trained a Rank SVM [9] to compare pairs of
videos and gained a significant boost in classification per-
formance (67% accuracy). This improvement has its down-
side. Using the ordinary SVM enables to classify a single
video. The Rank SVM (1) needs two videos of the same
subject and emotion and (2) only provides which is more
authentic than the other. We compensate for (1) by provid-
ing a fallback to the original SVM if no video pair is avail-
able. (2) To obtain the real or fake labels from authenticity
we assume that both categories occur equally often in the
test data, which holds for the challenge data. The perfor-
mance benefit of ranking shows the importance of subject
adaptation.
Third, we improved the classification performance by
training an ensemble of Rank SVMs (70% accuracy). Each
model is trained with a random subset of the training set.
Finally, we included additional descriptors (see sec-
tion 3.2) and increased the cross validation performance to
73% accuracy. This improvement is mainly caused by the
additional time features and the one-hot coding of the emo-
tion, which allows the model to learn emotion specific rep-
resentations.
4.2. Comparison With Human Performance
To compare our computer vision system with human
observers regarding their ability to discriminate real ver-
sus faked emotional expressions, we conducted two exper-
iments with each of 22 participants. We compare the hu-
Human
(Exp. 1)
SVM Human
Rank
(Exp. 2)
Rank SVM
Ensemble+
0
0.2
0.4
0.6
0.8
1
Accuracy
Happiness Disgust Contempt Average
Sadness Anger Surprise
Figure 4. Human vs. computer vision approach on the validation
set for different emotions. The dotted black line shows the average
across all emotions. The computer vision models are trained on the
training set. See section 4.2 for details.
man performance with our SVM approach (common model
for all emotions, see section 4.1) and with our proposed
Rank SVM Ensemble + add. descriptors, both trained on
the full training set. To analyze statistical significance we
conducted Student’s t-tests. We report the test decision and
statistics for the null hypothesis that the detection accura-
cies of the human observers comes from a normal distri-
bution with mean accuracy µand unknown variance. The
alternative hypothesis is that the mean is not µ. The result
is significant if the test rejects the null hypothesis at the 1%
significance level, and not significant otherwise.
In experiment 1, “Human (Exp. 1)”, we showed each
participant one validation set video at a time (all 60 clips
in the already existing randomized order). The observers
judged whether the expression shown in the video clip was
real or faked before continuing with the next clip. The ob-
servers distinguished real emotions from faked emotions at
rates slightly greater than guessing (accuracy = 54.5%;
SD = 4.97; chance accuracy µ= 50%; t[21] = 4.22,
p < 0.01). We compare the human performance to our
SVM approach because both classify each video individu-
ally. The SVM performs significantly better than humans
(accuracy µ= 61.3%; t[21] = 6.79,p < 0.01). Fig-
ure 4shows the average performance of the observers and
the computer vision system for the validation set (for each
emotion and averaged across all emotions).
Experiment 2, “Human Rank (Exp. 2)”, examined
whether the comparison between a pair of video clips im-
proves the human performance. This ranking procedure is
similar to our Rank SVM approach. We showed each partic-
ipant two validation set videos at a time, both from the same
subject and emotion (thus, one real and one fake emotion).
The observers judged which of the two video clips appeared
more authentic before continuing with the next pair of clips.
3076
0
0.2
0.4
0.6
0.8
1
Accuracy
Human (Exp. 1) Human Rank (Exp. 2)
SVM Rank SVM Ensemble+
Figure 5. Average classification performance of humans and com-
puter vision models for the different subjects in the validation set.
See section 4.2 for details.
In this second experiment the observers distinguished real
emotions from faked emotions at rates slightly greater than
guessing (accuracy = 55.8%; SD = 8.13; chance accuracy
µ= 50%; t[21] = 3.37,p < 0.01). Our proposed Rank
SVM Ensemble however performs significantly better than
humans (accuracy µ= 73.3%; t[21] = 10.1,p < 0.01),
also see figure 4.
Our Rank SVM Ensemble outperforms the SVM ap-
proach significantly (t[9] = 40.9,p < 0.01). This seems to
prove the hypothesis, that it is easier to compare two video
clips than to classify each clip individually. This hypothesis
is not supported by the experiment 2 (Human Rank Exp. 2),
because there is no significant difference between the per-
formance of experiment 1 and experiment 2 (t[21] = 0.72,
not significant). This means that our computer vision ap-
proach is able to exploit fine-grained details that are inac-
cessible by humans. This result is consistent with prior re-
search about detection of pain expressions [15,1] and clas-
sification of frustrated and delighted smiles [8].
Figure 4shows a superior classification of happiness,
sadness, and surprise for the Rank SVM Ensemble. This
might suggest that individual models for disgust and con-
tempt might further improve the classification performance.
We do not observe such high variances between emotions
during cross-validation on the training set. Thus, we believe
that this is a random effect caused by the low sample count
of the validation set (only 5pairs of videos per emotion).
Figure 5shows the classification performance of humans
and our computer vision models with respect to each indi-
vidual subject in the validation set. It shows that the per-
formance of the Rank SVM Ensemble varies significantly
across subjects. Although it is reasonable that some sub-
jects show facial expressions that are easier to classify, each
subject only provides 6pairs of videos, which causes big
jumps in accuracy (about 17% per video pair) if one pair is
classified differently. Thus, the variance might be a random
effect caused by the small validation set.
4.3. Challenge Results on Test Set
Figure 1shows the results of our proposed Rank SVM
Ensemble method (NIT-OVGU) on the test set along with
the official results of other participants of the challenge
(also see [20]). The test set is like the validation set very
small. As a consequence we experienced high variance in
the performance of our models that previously performed
very similar on the cross-validated training set. We believe
that a bigger test set is necessary to properly distinguish be-
tween the top performing methods.
5. Conclusion
We propose a state-of-the-art computer vision approach
that classifies videos of real emotions from fake emotions.
Although our methods are old fashioned in terms of fea-
ture extraction, our approach was able to win the ChaLearn
real vs. fake expressed emotion challenge. Nevertheless, we
believe that there is plenty of room for improvements espe-
cially in automatic estimation of action unit intensities, e.g.
based on recent research with deep transfer learning [25].
We initially assumed that for humans it is easier to com-
pare two videos from the same subject and emotion and de-
cide which is more authentic rather than classifying each
video individually. Our findings do not support this as-
sumption. However, the accuracy of our computer vision
approach improved significantly by estimating the authen-
ticity of two videos from the same subject and emotion
compared to classifying each video individually. This is
particularly interesting because it shows that real and fake
expression is subject dependent but the differences between
both expressions have subject independent attributes that
can be learned. Humans however are not capable of ex-
ploiting these attributes.
Automatically classifying real from fake emotions re-
mains a challenging research topic. We believe that more
training and evaluation data will be necessary since we ob-
served high variance of very similar models for the small
validation and test sets. This also indicates that this classi-
fication task is very challenging. More training data would
also allow to train individual models for each emotion.
Acknowledgments
This work is part of the project “Ergonomics Assistance
Systems for Contactless Human-Machine-Operation” (no.
03ZZ0443G) and is funded by the Federal Ministry of Edu-
cation and Research (BMBF) within Zwanzig20 - Alliance
3D Sensation. We thank all 22 participants who volunteered
in our study.
3077
References
[1] M. Bartlett, G. Littlewort, M. Frank, and K. Lee. Automatic
decoding of facial movements reveals deceptive pain expres-
sions. Current Biology, 24(7):738–743, 2014. 1,5
[2] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Ku-
mar. Localizing parts of faces using a consensus of exem-
plars. In CVPR 2011, pages 545–552, 2011. 2
[3] G. G. Chrysos, E. Antonakos, S. Zafeiriou, and P. Snape.
Offline deformable face tracking in arbitrary videos. In 2015
IEEE International Conference on Computer Vision Work-
shop (ICCVW), pages 954–962, 2015. 2
[4] B. M. DePaulo, S. E. Kirkendol, D. A. Kashy, M. M. Wyer,
and J. A. Epstein. Lying in everyday life. Journal of Person-
ality and Social Psychology, 70(5):979–995, 1996. 1
[5] P. Ekman. The argument and evidence about universals in
facial expressions of emotion. In Handbook of Social Psy-
chophysiology, pages 143–164. John Wiley & Sons, Ltd.,
1989. 1
[6] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker.
Multi-PIE. Image Vision Comput., 28(5):807–813, 2010. 2
[7] M. L. Hill and K. D. Craig. Detecting deception in pain ex-
pressions: the structure of genuine and deceptive facial dis-
plays. Pain, 98(1):135–144, 2002. 1
[8] M. E. Hoque, D. J. McDuff, and R. W. Picard. Ex-
ploring temporal patterns in classifying frustrated and de-
lighted smiles. IEEE Transactions on Affective Computing,
3(3):323–334, 2012. 1,5
[9] T. Joachims. Optimizing search engines using clickthrough
data. In Proceedings of the eighth ACM SIGKDD interna-
tional conference on Knowledge discovery and data mining,
pages 133–142, 2002. 3,4
[10] V. Kazemi and J. Sullivan. One millisecond face alignment
with an ensemble of regression trees. In 2014 IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
1867–1874, 2014. 2
[11] D. King. Real-time face pose estimation, 2014.
http://blog.dlib.net/2014/08/real-time-
face-pose- estimation.html.2
[12] D. King. Easily create high quality object detectors
with deep learning, 2016. http://blog.dlib.
net/2016/10/easily-create- high-quality-
object.html.2
[13] D. King. High quality face recognition with deep metric
learning, 2017. http://blog.dlib.net/2017/
02/high-quality- face-recognition- with-
deep.html.3
[14] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang. Inter-
active facial feature localization. In Computer Vision ECCV
2012, Lecture Notes in Computer Science, pages 679–692.
Springer, Berlin, Heidelberg, 2012. 2
[15] G. C. Littlewort, M. S. Bartlett, and K. Lee. Automatic cod-
ing of facial expressions displayed during posed and gen-
uine pain. Image and Vision Computing, 27(12):1797–1803,
2009. 1,5
[16] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and
J. F. Cohn. DISFA: A spontaneous facial action inten-
sity database. IEEE Transactions on Affective Computing,
4(2):151–160, 2013. 2
[17] I. Ofodile, K. Kulkarni, C. A. Corneanu, S. Escalera,
X. Baro, S. Hyniewska, J. Allik, and G. Anbarjafari. Auto-
matic recognition of deceptive facial expressions of emotion.
arXiv preprint arXiv:1707.04061 [cs], 2017. 2
[18] W. E. Rinn. The neuropsychology of facial expression: A
review of the neurological and psychological mechanisms
for producing facial expressions. Psychological bulletin,
95(1):52–77, 1984. 1
[19] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou,
and M. Pantic. 300 faces in-the-wild challenge: database and
results. Image and Vision Computing, 47:3–18, 2016. 2
[20] J. Wan, S. Escalera, X. Baro, H. J. Escalante, I. Guyon,
M. Madadi, J. Allik, J. Gorbova, and G. Anbarjafari. Re-
sults and analysis of ChaLearn LAP multi-modal isolated
and continuous gesture recognition, and real versus fake ex-
pressed emotions challenges. In ChaLearn LaP, Action, Ges-
ture, and Emotion Recognition Workshop and Competitions:
Large Scale Multimodal Gesture Recognition and Real ver-
sus Fake expressed emotions, ICCVW, 2017. 1,5
[21] P. Werner, A. Al-Hamadi, K. Limbrecht-Ecklundt, S. Walter,
S. Gruss, and H. Traue. Automatic pain assessment with
facial activity descriptors. IEEE Transactions on Affective
Computing, PP(99):1–1, 2016. 3,4
[22] P. Werner, A. Al-Hamadi, and R. Niese. Pain recognition
and intensity rating based on comparative learning. In Im-
age Processing (ICIP), 2012 19th IEEE International Con-
ference on, pages 2313–2316, 2012. 3
[23] P. Werner, A. Al-Hamadi, and R. Niese. Comparative learn-
ing applied to intensity rating of facial expressions of pain.
International Journal of Pattern Recognition and Artificial
Intelligence, 28(5):1451008, 2014. 3
[24] P. Werner, F. Saxen, and A. Al-Hamadi. Handling data im-
balance in automatic facial action intensity estimation. In
X. Xie, M. W. Jones, and G. K. L. Tam, editors, Proceedings
of the British Machine Vision Conference (BMVC), pages
124.1–124.12. BMVA Press, 2015. 2,3
[25] Y. Zhou, J. Pi, and B. E. Shi. Pose-independent facial action
unit intensity regression based on multi-task deep transfer
learning. In 2017 12th IEEE International Conference on
Automatic Face Gesture Recognition (FG 2017), pages 872–
877, 2017. 5
[26] X. Zhu and D. Ramanan. Face detection, pose estimation,
and landmark localization in the wild. In 2012 IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
2879–2886, 2012. 2
3078
... Five teams achieved the final stage and published their analysis. The best two models, which achieved the same performance, were proposed by NIT-OVGU (NTO) [31] and HCILab (HCL) [20]. The first proposed a method based on Support Vector Regression (SVR) ensembles to estimate AUs intensity frame by frame. ...
... The second one (HCL) proposed a method based on an LSTM trained on each emotion after the extraction of facial landmarks from the videos. As in [31], the algorithm ranks a couple of videos to detect the most authentic expression between them. Also, in this case, the overall achieved accuracy was 67%. ...
... The results are compared to the state-of-the-art represented by the teams that previously took part in the challenge. In particular, NTO (NIT-OVGU) [31], HCL (HCILab) [20], TUM (TUBITAK UZAY-METU) [26], BNC (BNU CIST) [7], and Xlb (faceall Xlabs) [33]. Nested CV: in this setting, all the videos were considered, without following the constraints given by the challenge of training, validation, and test splits. ...
... However, these works may lose their efficiency in real situations since posed expressions are only exaggerated simulations of the spontaneous behavior. In fact, several studies have investigated the difference between these two types of facial expressions in terms of intensity, configuration and duration [21,22,48,50,70,78]. According to Ekman [23], there exist a few facial muscles that cannot be volitionally controlled. ...
Article
Full-text available
Facial expressions are among the most powerful ways to reveal the emotional state. Therefore , Facial Expression Recognition (FER) has been widely introduced to wide fields of applications, such as security, psychotherapy, neuromarketing, and advertisement. Feature extraction and selection are two essential key issues for the design of efficient FER systems. However, most of the previous studies focused on implementing static feature selection methods. Although these methods have shown promising results, they still present weaknesses , especially when dealing with spontaneous expressions. This is mainly due to the specificity of each face, which makes the facial emotion display differs from one subject to another. To address this problem, we propose a face-based dynamic feature selection of two geometric features sub-classes, namely linear and eccentricity features. This combination provides a better understanding of the facial transformation during the emotion display. Moreover, the suggested selection method takes into consideration the subject's general facial structure, muscle movements, and head position. The performed experiments, using the CK+ and the DISFA datasets, have showed that the proposed method outperforms the state-of-the-art techniques and maintains superior performance with cross-dataset validation. In fact, the accuracy of facial expression recognition by the proposed method reaches 97.72% and 91,26% on the CK+ and the DISFA datasets, respectively. Ones Sidhom, Haythem Ghazouani and Walid Barhoumi contributed equally to this work.
... But, the results were lesser than that of a competitor's who participated in the [55]. Hence, [45] proposed an algorithm that performed better than that of the remaining competitors. He proposed rank authenticity from facial activity descriptor, where they used Local Binary Patterns (LBP) for feature extraction. ...
Article
Full-text available
Differentiating real and fake emotions becomes a new challenge in facial expression recognition and emotion detection. Real and fake emotions should be taken into account when developing an application. Otherwise, a fake emotion can be categorized as real emotion thereby rendering the model as futile. Very limited research has dealt with identifying fake emotions with accuracy as results are in a range of 51–76%. Performance of the available methods in detecting fake emotions is not encouraging. Thus, in this paper, we have proposed Enhanced Boosted Support Vector Machine (EBSVM) algorithm. EBSVM is a novel technique to determine important thresholds required to understand fake emotions. We have created a new dataset named FED comprising both real and fake emotion images of 50 subjects and used them with experiments along with SASE-FE. EBSVM considers the entire data for classification at each iteration using the ensemble classifier. The EBSVM algorithm achieved 98.08% as classification accuracy for different K-fold validations.
... The FF we use include 17 AU intensity outputs of OpenFace: AU1, AU2, AU4, AU5, AU6, AU7, AU9, AU10, AU12, AU14, AU15, AU17, AU20, AU23, AU25, AU26, and AU45. The FF and their change over time can be represented by a Time series Statistics Descriptor [8,10,41,42]. We obtained a 48-dimensional descriptor per time series. ...
Article
Full-text available
Prior work on automated methods demonstrated that it is possible to recognize pain intensity from frontal faces in videos, while there is an assumption that humans are very adept at this task compared to machines. In this paper, we investigate whether such an assumption is correct by comparing the results achieved by two human observers with the results achieved by a Random Forest classifier (RFc) baseline model (called RFc-BL) and by three proposed automated models. The first proposed model is a Random Forest classifying descriptors of Action Unit (AU) time series; the second is a modified MobileNetV2 CNN classifying face images that combine three points in time; and the third is a custom deep network combining two CNN branches using the same input as for MobileNetV2 plus knowledge of the RFc. We conduct experiments with X-ITE phasic pain database, which comprises videotaped responses to heat and electrical pain stimuli, each of three intensities. Distinguishing these six stimulation types plus no stimulation was the main 7-class classification task for the human observers and automated approaches. Further, we conducted reduced 5-class and 3-class classification experiments, applied Multi-task learning, and a newly suggested sample weighting method. Experimental results show that the pain assessments of the human observers are significantly better than guessing and perform better than the automatic baseline approach (RFc-BL) by about 1%; however, the human performance is quite poor due to the challenge that pain that is ethically allowed to be induced in experimental studies often does not show up in facial reaction. We discovered that downweighting those samples during training improves the performance for all samples. The proposed RFc and two-CNNs models (using the proposed sample weighting) significantly outperformed the human observer by about 6% and 7%, respectively.
... The other is based on physiological signals such as electroencephalogram (EEG) signals (Sourina et al., 2012). Facial expressions are prone to misinterpretation (Saxen et al., 2017), but EEG signals are directly extracted from the cerebral cortex without damage, accurately reflecting the physiological state of the human brain. Therefore, emotion recognition technology based on EEG signals has received more extensive research interest. ...
Article
Full-text available
Emotion recognition plays an important part in human-computer interaction (HCI). Currently, the main challenge in electroencephalogram (EEG)-based emotion recognition is the non-stationarity of EEG signals, which causes performance of the trained model decreasing over time. In this paper, we propose a two-level domain adaptation neural network (TDANN) to construct a transfer model for EEG-based emotion recognition. Specifically, deep features from the topological graph, which preserve topological information from EEG signals, are extracted using a deep neural network. These features are then passed through TDANN for two-level domain confusion. The first level uses the maximum mean discrepancy (MMD) to reduce the distribution discrepancy of deep features between source domain and target domain, and the second uses the domain adversarial neural network (DANN) to force the deep features closer to their corresponding class centers. We evaluated the domain-transfer performance of the model on both our self-built data set and the public data set SEED. In the cross-day transfer experiment, the ability to accurately discriminate joy from other emotions was high: sadness (84%), anger (87.04%), and fear (85.32%) on the self-built data set. The accuracy reached 74.93% on the SEED data set. In the cross-subject transfer experiment, the ability to accurately discriminate joy from other emotions was equally high: sadness (83.79%), anger (84.13%), and fear (81.72%) on the self-built data set. The average accuracy reached 87.9% on the SEED data set, which was higher than WGAN-DA. The experimental results demonstrate that the proposed TDANN can effectively handle the domain transfer problem in EEG-based emotion recognition.
... To detect SVP smiles, the method in Schmidt et al. (2009) quantified lip corner and eyebrow movement during periods of visible smiles and eyebrow raises, and found maximum speed and amplitude were greater and duration shorter in deliberate compared to spontaneous eyebrow raises. Aiming at multiple facial expressions, the method (Saxen et al., 2017) generated a 440-dimensional statistic feature space from the intensity series of seven facial AUs, and increased the performance to 73% by training an ensemble of Rank SVMs on the SASE-FE database. Alternatively, recent work in Racoviţeanu et al. (2019) used the AlexNet CNN architecture on 12 AU intensities to obtain the features in a transfer learning task. ...
Article
Full-text available
Facial expressions of emotion play an important role in human social interactions. However, posed expressions of emotion are not always the same as genuine feelings. Recent research has found that facial expressions are increasingly used as a tool for understanding social interactions instead of personal emotions. Therefore, the credibility assessment of facial expressions, namely, the discrimination of genuine (spontaneous) expressions from posed (deliberate/volitional/deceptive) ones, is a crucial yet challenging task in facial expression understanding. With recent advances in computer vision and machine learning techniques, rapid progress has been made in recent years for automatic detection of genuine and posed facial expressions. This paper presents a general review of the relevant research, including several spontaneous vs. posed (SVP) facial expression databases and various computer vision based detection methods. In addition, a variety of factors that will influence the performance of SVP detection methods are discussed along with open issues and technical challenges in this nascent field.
Article
As a face manipulation technique, the misuse of Deepfakes poses potential threats to the state, society, and individuals. Several countermeasures have been proposed to reduce the negative effects produced by Deepfakes. Current detection methods achieve satisfactory performance in dealing with uncompressed videos. However, videos are generally compressed when spread over social networks because of limited bandwidth and storage space, which generates compression artifacts and the detection performance inevitably decreases. Hence, how to effectively identify compressed Deepfake videos over social networks becomes a significant problem in video forensics. In this paper, we propose a facial-muscle-motions-based (FAMM) framework to solve the problem of compressed Deepfake video detection. Specifically, we first locate faces from consecutive frames and extract landmarks from the face images. Then, continuous facial landmarks are utilized to construct facial muscle motion features by modeling the five sensory and face regions. Finally, we fuse the diverse forensic knowledge using Dempster-Shafer theory and provide the final detection results. Furthermore, we demonstrate the effectiveness of FAMM through analyzing mutual information, compression procedure, and facial landmarks for compressed Deepfake videos. Theoretical analyses illustrate that compression does not affect facial muscle motion feature construction and the differences in designed features exist between the real and Deepfake videos. Extensive experimental results conclude that the proposed method outperforms the state-of-the-art methods in detecting compressed Deepfake videos. More importantly, FAMM achieves comparable detection performance on compressed videos that are over real-world social networks.
Conference Paper
Full-text available
Standard-Nutzungsbedingungen: Die Dokumente auf EconStor dürfen zu eigenen wissenschaftlichen Zwecken und zum Privatgebrauch gespeichert und kopiert werden. Sie dürfen die Dokumente nicht für öffentliche oder kommerzielle Zwecke vervielfältigen, öffentlich ausstellen, öffentlich zugänglich machen, vertreiben oder anderweitig nutzen. Sofern die Verfasser die Dokumente unter Open-Content-Lizenzen (insbesondere CC-Lizenzen) zur Verfügung gestellt haben sollten, gelten abweichend von diesen Nutzungsbedingungen die in der dort genannten Lizenz gewährten Nutzungsrechte. This paper illustrates how audiovisual data from pre-play face-to-face communication can be used to identify groups which contain free-riders in a public goods experiment. It focuses on two channels over which face-to-face communication influences contributions to a public good. Firstly, the contents of the face-to-face communication are investigated by categorising specific strategic information and using simple meta-data. Secondly, a machine-learning approach to analyse facial expressions of the subjects during their communications is implemented. These approaches constitute the first of their kind, analysing content and facial expressions in face-to-face communication aiming to predict the behaviour of the subjects in a public goods game. The analysis shows that verbally mentioning to fully contribute to the public good until the very end and communicating through facial clues reduce the commonly observed end-game behaviour. The length of the face-to-face communication quantified in number of words is further a good measure to predict cooperation behaviour towards the end of the game. The obtained findings provide first insights how a priori available information can be utilised to predict free-riding behaviour in public goods games.
Conference Paper
Standard-Nutzungsbedingungen: Die Dokumente auf EconStor dürfen zu eigenen wissenschaftlichen Zwecken und zum Privatgebrauch gespeichert und kopiert werden. Sie dürfen die Dokumente nicht für öffentliche oder kommerzielle Zwecke vervielfältigen, öffentlich ausstellen, öffentlich zugänglich machen, vertreiben oder anderweitig nutzen. Sofern die Verfasser die Dokumente unter Open-Content-Lizenzen (insbesondere CC-Lizenzen) zur Verfügung gestellt haben sollten, gelten abweichend von diesen Nutzungsbedingungen die in der dort genannten Lizenz gewährten Nutzungsrechte. This paper illustrates how audiovisual data from pre-play face-to-face communication can be used to identify groups which contain free-riders in a public goods experiment. It focuses on two channels over which face-to-face communication influences contributions to a public good. Firstly, the contents of the face-to-face communication are investigated by categorising specific strategic information and using simple meta-data. Secondly, a machine-learning approach to analyse facial expressions of the subjects during their communications is implemented. These approaches constitute the first of their kind, analysing content and facial expressions in face-to-face communication aiming to predict the behaviour of the subjects in a public goods game. The analysis shows that verbally mentioning to fully contribute to the public good until the very end and communicating through facial clues reduce the commonly observed end-game behaviour. The length of the face-to-face communication quantified in number of words is further a good measure to predict cooperation behaviour towards the end of the game. The obtained findings provide first insights how a priori available information can be utilised to predict free-riding behaviour in public goods games.
Article
Full-text available
Humans modify facial expressions in order to mislead observers regarding their true emotional states. Being able to recognize the authenticity of emotional displays is notoriously difficult for human observers. Evidence in experimental psychology shows that discriminative facial responses are short and subtle. This suggests that such behavior would be easier to distinguish when captured in high resolution at an increased frame rate. We are proposing SASE-FE, the first dataset of genuine and deceptive facial expressions of emotions for automatic recognition. We show that overall the problem of recognizing deceptive facial expressions can be successfully addressed by learning spatio-temporal representations of the data. For this purpose, we propose a method that aggregates features along fiducial trajectories in a deeply learnt feature space. Interesting additional results show that on average it is easier to distinguish among genuine expressions than deceptive ones and that certain emotion pairs are more difficult to distinguish than others.
Article
Full-text available
Pain is a primary symptom in medicine, and accurate assessment is needed for proper treatment. However, today’s pain assessment methods are not sufficiently valid and reliable in many cases. Automatic recognition systems may contribute to overcome this problem by facilitating objective and continuous assessment. In this article we propose a novel feature set for describing facial actions and their dynamics, which we call facial activity descriptors. We apply them to detect pain and estimate the pain intensity. The proposed method outperforms previous state-of-the-art approaches in sequence-level pain classification on both, the BioVid Heat Pain and the UNBC-McMaster Shoulder Pain Expression database. We further discuss major challenges of pain recognition research, benefits of temporal integration, and shortcomings of widely used frame-based pain intensity ground truth.
Conference Paper
Full-text available
Generic face detection and facial landmark localization in static imagery are among the most mature and well-studied problems in machine learning and computer vision. Currently, the top performing face detectors achieve a true positive rate of around 75-80% whilst maintaining low false positive rates. Furthermore, the top performing facial landmark localization algorithms obtain low point-to-point errors for more than 70% of commonly benchmarked images captured under unconstrained conditions. The task of facial landmark tracking in videos, however, has attracted much less attention. Generally, a tracking-by-detection framework is applied, where face detection and landmark local-ization are employed in every frame in order to avoid drifting. Thus, this solution is equivalent to landmark detection in static imagery. Empirically, a straightforward application of such a framework cannot achieve higher performance , on average, than the one reported for static imagery 1. In this paper, we show for the first time, to the best of our knowledge, that the results of generic face detection and landmark localization can be used to recursively train powerful and accurate person-specific face detectors and landmark localization methods for offline deformable tracking. The proposed pipeline can track landmarks in very challenging long-term sequences captured under arbitrary conditions. The pipeline was used as a semi-automatic tool to annotate the majority of the videos of the 300-VW Challenge 2 .
Conference Paper
Full-text available
Automatic Action Unit (AU) intensity estimation is a key problem in facial expression analysis. But limited research attention has been paid to the inherent class imbalance, which usually leads to suboptimal performance. To handle the imbalance, we propose (1) a novel multiclass under-sampling method and (2) its use in an ensemble. We compare our approach with state of the art sampling methods used for AU intensity estimation. Multiple datasets and widely varying performance measures are used in the literature, making direct comparison difficult. To address these shortcomings, we compare different performance measures for AU intensity estimation and evaluate our proposed approach on three publicly available datasets, with a comparison to state of the art methods along with a cross dataset evaluation.
Article
Full-text available
Together with classification of facial expressions, the rating of their intensities is of major interest. Classical supervised learning techniques require labeling of the intensities, which is labor intensive and requires expert knowledge, but nevertheless is not guaranteed to be objective. We propose a new approach to learn an intensity rating function which does not require expert knowledge, because it simplifies the labeling task by avoiding the difficulty of selecting an absolute intensity value and to keep the labeling consistent for the whole dataset. It is based on a novel kind of ground truth which we call Comparative Labeling. It specifies sample pairs for which the first element is desired to have a lower intensity than the second. We introduce a learning scheme to find an optimal intensity function in respect of the Comparative Labeling and propose performance measures to assess the quality of the learned function. The technique is applied to rate the intensity of facial expressions of posed pain. The evaluation results show that the learned function is well suited for determining dynamic intensity variation over time. We also assess the suitability of the rating as an inter-individual intensity measure by comparing it to the intensity ratings given by human observers.
Conference Paper
Full-text available
This paper addresses the problem of Face Alignment for a single image. We show how an ensemble of regression trees can be used to estimate the face's landmark positions directly from a sparse subset of pixel intensities, achieving super-realtime performance with high quality predictions. We present a general framework based on gradient boosting for learning an ensemble of regression trees that optimizes the sum of square error loss and naturally handles missing or partially labelled data. We show how using appropriate priors exploiting the structure of image data helps with ef-ficient feature selection. Different regularization strategies and its importance to combat overfitting are also investi-gated. In addition, we analyse the effect of the quantity of training data on the accuracy of the predictions and explore the effect of data augmentation using synthesized data.
Article
Computer Vision has recently witnessed great research advance towards automatic facial points detection. Numerous methodologies have been proposed during the last few years that achieve accurate and efficient performance. However, fair comparison between these methodologies is infeasible mainly due to two issues. (a) Most existing databases, captured under both constrained and unconstrained (in-the-wild) conditions have been annotated using different mark-ups and, in most cases, the accuracy of the annotations is low. (b) Most published works report experimental results using different training/testing sets, different error metrics and, of course, landmark points with semantically different locations. In this paper, we aim to overcome the aforementioned problems by (a) proposing a semi-automatic annotation technique that was employed to re-annotate most existing facial databases under a unified protocol, and (b) presenting the 300 Faces In-The-Wild Challenge (300-W), the first facial landmark localization challenge that was organized twice, in 2013 and 2015. To the best of our knowledge, this is the first effort towards a unified annotation scheme of massive databases and a fair experimental comparison of existing facial landmark localization systems. The images and annotations of the new testing database that was used in the 300-W challenge are available from http://ibug.doc.ic.ac.uk/resources/facial-point-annotations/