Conference PaperPDF Available

Facial Action Unit Recognition in the Wild with Multi-Task CNN Self-Training for the EmotioNet Challenge

Authors:

Abstract and Figures

Automatic understanding of facial behavior is hampered by factors such as occlusion, illumination, non-frontal head pose, low image resolution, or limitations in labeled training data. The EmotioNet 2020 Challenge addresses these issues through a competition on recognizing facial action units on in-the-wild data. We propose to combine multi-task and self-training to make best use of the small manually / fully labeled and the large weakly / partially labeled training datasets provided by the challenge organizers. With our approach (and without using additional data) we achieve the second place in the 2020 challenge -- with a performance gap of only 0.05% to the challenge winner and of 5.9% to the third place. On the 2018 challenge evaluation data our method outperforms all other known results.
Content may be subject to copyright.
Facial Action Unit Recognition in the Wild with Multi-Task CNN Self-Training
for the EmotioNet Challenge
Philipp Werner, Frerk Saxen, and Ayoub Al-Hamadi
Neuro-Information Technology Group, Otto von Guericke University Magdeburg, Germany
{Philipp.Werner, Frerk.Saxen, Ayoub.Al-Hamadi}@ovgu.de
Abstract
Automatic understanding of facial behavior is hampered
by factors such as occlusion, illumination, non-frontal head
pose, low image resolution, or limitations in labeled train-
ing data. The EmotioNet 2020 Challenge addresses these
issues through a competition on recognizing facial action
units on in-the-wild data. We propose to combine multi-task
and self-training to make best use of the small manually /
fully labeled and the large weakly / partially labeled train-
ing datasets provided by the challenge organizers. With our
approach (and without using additional data) we achieve
the second place in the 2020 challenge – with a perfor-
mance gap of only 0.05% to the challenge winner and of
5.9% to the third place. On the 2018 challenge evaluation
data our method outperforms all other known results.
1. Introduction
The challenge was run on the EmotioNet database [2],
which comprises (1) a training set of about 944k samples,
which were automatically labeled with 12 facial Action
Units (AUs), (2) an optimization set (opt set) of about 25k
samples, which were manually labeled with 23 AUs – the
same AUs that appear in the test set (listed in Section 3) –,
and (3) a validation and a test set of about 107k and 218k
images respectively, which were manually labeled with the
23 AUs and used to evaluate the approaches of the challenge
participants. Each participant had five submissions on the
validation and one submission on the test set. The used per-
formance measure, called final ranking score, is the mean
of the accuracy and the F1-score.
Our approach for recognizing AUs involves two ideas
that are novel in this context: (1) Multi-task learning,
which here means using two output neurons per AU, one
for each of the training subsets. Even if labels of two train-
This work was funded by the German Federal Ministry of Educa-
tion and Research (BMBF), projects HuBA (03ZZ0470) and Easy Cohmo
(03ZZ0443G). The sole responsibility for the content lies with the authors.
Figure 1. Example images of the EmotioNet dataset. Left: Align-
ment for expression AUs. Right: Alignment for pose AUs.
ing subsets have the same intended meaning, like a specific
AU, there may be labeling biases or differences in labeling
quality, especially if some data have been labeled by an al-
gorithm. Using multi-task learning may help to better cope
with these issues and still benefit from all available data. (2)
Self-training [9] means that a teacher model is trained on a
labeled dataset and used to predict pseudo-labels on a larger
unlabeled (or in our case weakly / partially labeled) dataset.
Afterwards, a student model is trained using both datasets
(with manual labels and pseudo-labels). Introducing noise
in the training of the student model (e.g. by data augmen-
tation and dropout) facilitates to learn beyond the teacher’s
knowledge [9].
2. Methods
Preprocessing: We use the face detection, landmark lo-
calization, and head pose estimation of OpenFace [1] (fol-
lowing suggestions of [8]). To reduce the number of faces
not detected (for which we output AU absence in the chal-
lenge evaluation), we additionally run RetinaFace [4] and
the landmark localization of Bulat and Tzimiropoulos [3]
on the images for which OpenFace failed. We then apply
the OpenFace face registration approach, which is based on
a stable subset of 68 landmarks, without masking out con-
text. We use two different “zooms” as illustrated in Fig. 1:
The one with more facial details is used for the expression
1
P. Werner, F. Saxen, A. Al-Hamadi, "Facial Action Unit Recognition in the Wild with Multi-Task CNN Self-Training for the EmotioNet Challenge", in CVPRW, 2020.
This is the accepted manuscript. The final, published version is available on IEEE Xplore.
(C) 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
AUs, the other with more context is used for the head pose
AUs (which are defined relatively to the body, not the cam-
era view). Both have the resolution 240×240, but are fed
into two distinct CNN models.
As we will see in the experiments, the manually labeled
optimization dataset is a better source for supervised learn-
ing than the larger but automatically (weakly) and incom-
pletely labeled training set. Thus, and to get more validation
attempts for selecting and tuning models, we split the opt
set: 20% of the images are used as a validation set (called
opt-val set) and the remaining 80% are used as training data
(called opt-train set).
Convolutional Neural Network (CNN) Architectures:
We use three architectures: (1) a self designed straight for-
ward CNN which we call OwnNet, (2) MobileNetV3 (large)
[5], and (3) EfficientNet-B0 [6]. With all networks we use an
input resolution of 224×224×3. OwnNet-whas a variable
width factor wand seven 2D convolution layers (Conv),
each followed by batch normalization and ReLU. The first
Conv has 4woutput channels. After each Conv until the
fifth, we apply a 3×3 max-pooling with stride 2 and double
the output channels of the next Conv. The last two Conv
have 128woutput channels and are followed by global av-
erage pooling. In all networks we use a dropout of 0.5 in
front of the final dense output layer, which is activated with
the sigmoid function. For the head pose networks (AU 51-
56) we additionally feed the three head orientation angles
from OpenPose (or the mean pose to fill missing values)
into a dense layer (1024 neurons, ReLU) and concatenate
its outputs with the CNN outputs before the final dense
layer. MobileNetV3 and EfficientNet-B0 models are fine-
tuned starting from the pretrained model provided in the
TensorFlow SLIM and TPU repositories, respectively. The
OwnNet models are trained from scratch using the default
Xavier initialization.
Training: We only train multi-label CNNs, basically with
17 AU labels / output neurons for the expression CNNs and
6 for the pose CNNs. However, in what we call multi-task in
the following, we have one output per AU and dataset, e.g.
we have one output for AU 1 of the opt-train set and one
for AU1 of the training set if we train with both datasets.
For prediction we always use the outputs trained with the
opt-train set, which is more accurately labeled, but follow-
ing the multi-task idea, the performance can benefit from
adding the huge 944k samples training set because it helps
to learn better features. With a batch of Nsamples and a
CNN with Moutputs the loss is calculated as:
L(y, ˆy) =
N
X
n=1
M
X
m=1
λm·wm(yn,m)·l(yn,m ,ˆyn,m),(1)
with ybeing the target label, ˆythe prediction, λma label-
specific weight, and l(y, ˆy)the binary cross-entropy. The
λm-values are tuned to adjust the training speed of the dif-
ferent AUs in order to avoid that some AUs are already
overfitting while others are still underfitted. For each la-
bel there is a class-dependent weighting function wm(y),
which zeros the loss for missing labels (unknown class) and
reduces the negative impact of the class imbalance, which
is common in AU recognition [7]. For this purpose, wm(y)
weights down the majority class samples and weights up the
minority class samples following the imbalancing damp-
ing idea of [7] (with α= 0.5for the expression AUs and
α= 0.7for the pose AUs). All weights are normalized to
not increase the average gradient length. The loss is opti-
mized with stochastic gradient descent. We use a batch size
of 16 and assemble the batches by equally sampling from
the datasets used for training. We apply early stopping with
a fixed number of epochs (OwnNet 500k, MobileNetV3
150k, EfficientNet-B0 300k) and start with a learning rate
of 0.1 (expression CNN) / 0.01 (pose CNN), reducing it by a
factor of 0.33 after half and three quarters of the iterations.
For data augmentation we use random cropping, horizon-
tal flipping, brightness and contrast adjustments, cutout, as
well as occasional downscaling and grayscale conversion.
Additionally, label smoothing (0.2) and weight decay (4e-
6) are used.
Self-Training: Inspired by [9] we use self-training to ben-
efit from the large 944k weakly / partly labeled training set:
We first train a model on the fully labeled opt-train set and
apply it to predict pseudo-labels on the training set. Af-
terwards we train a second model using both, the opt-train
set (with manual labels) and the training set (with pseudo-
labels). This way, the first model acts as a teacher and the
second as a student. In a second iteration, the student model
can be used to update the pseudo-labels and train a second
generation student model.
Ensemble: To improve results further we combine the
predictions of several well-performing student models in
heterogeneous ensembles. The models differ regarding the
pseudo-labels used for training and the CNN architectures.
We fuse the predictions by calculating the mean of the mod-
els’ output scores before rounding the resulting scores for
the final decision.
3. Experiments
Image Alignment: To analyze the impact of more de-
tails vs more context, we trained one common OwnNet8
model for all 23 AUs with the close-up view alignment and
one with the more-context view (left and right in Fig. 1).
The expression AUs performed better with the close-up
Figure 2. Per-AU final ranking scores on the opt-val set: OpenFace-Baseline and OwnNet8-Baseline (see text), the best teacher model T.1
(only trained on optimization set), the best student model A.2, and the fusion model A+B (mean score of all A and B models).
# Model Score
T.0 OwnNet3 teacher (opt-train only) 0.7531
-
OwnNet8 student 0.7647
-
EfficientNet-B0 student 0.7661
T.1 OwnNet8 teacher (opt-train only) 0.7637
B.1
MobileNetV3 student 0.7680
A.1
EfficientNet-B0 student 0.7699
-
EfficientNet-B0 student 0.7694
-
MobileNetV3 student 0.7671
-
OwnNet8 student 0.7666
- MobileNetV3 teacher (opt-train only) 0.7518
T.2 MobileNetV3 teacher (opt-train and training set) 0.7603
-
MobileNetV3 student 0.7623
-
EfficientNet-B0 student 0.7666
B.2
EfficientNet-B0 student 0.7674
A.2
OwnNet8 student 0.7706
-
OwnNet8 student (no multi-task) 0.7634
- EfficientNet-B0 teacher (opt-train only) 0.7602
T.3 EfficientNet-B0 teacher (opt-train and training set) 0.7609
A.3
OwnNet8 student 0.7684
B.3
MobileNetV3 student 0.7680
- Fusion A (mean score of A.1, A.2, A.3) 0.7767
- Fusion B (mean score of B.1, B.2, B.3) 0.7754
- Fusion A+B (mean score of all A and B) 0.7800
Table 1. Final ranking scores on the opt-val set. Indentation and
arrows show the teacher-student relation. T.* are identifiers of the
teacher models, A.* of the best student models in the category, B.*
of the second best student models. All student models have been
trained on the opt-train set (with manual labels) and the train-
ing set (with pseudo-labels generated by the teacher model) with
multi-task learning (if not denoted differently).
view (mean: 0.784 vs 0.772) and the head pose AUs with
the more-context view (mean: 0.554 vs 0.528). Thus, we
trained two CNNs in the following as mentioned in Sec. 2:
one with with close-up view images for expression AUs and
one with more-context view for pose AUs.
Multi-Task Self-Training: Table 1shows validation re-
sults obtained on the opt-val set. Some early teacher models
(OwnNet T.0 and T.1) were trained on the opt-train set only,
without using the 944k samples of the official training set.
After using multi-task learning for the student models, we
also trained teacher models with multi-task learning (using
the opt-train set and the training set with the labels provided
by [2]). These performed better than the respective teacher
models trained on only the opt-train set (compare T.2 and
T.3 with the respective line above). All student models out-
perform their respective teacher models, except the second
generation student models learning from the pseudo-labels
provided by the first generation student A.1. So the self-
training generally improves the results at least for the first
iteration. Comparing A.2 with the row below, which has
been trained without multi-task using the same output neu-
rons for the pseudo-labels of the 944k training set and the
manually labeled opt-train set, we see that multi-task learn-
ing is beneficial in combination with self-training, as the
pseudo-labels are still less accurate than the manual labels.
Ensemble: The last three rows of Table 1list the results
of combining the outputs of several models. All individual
models are outperformed by the three tested ensembles. The
fusion of all A and B models performs best.
Per-AU comparison: The challenge task was to recog-
nize 23 Action Units (AUs). Fig. 2shows the per-AU
results of several models, including two baselines: The
OpenFace-Baseline tests the expression AU output as pro-
vided by OpenFace [1]. The pose AUs (51-56) were pre-
dicted with an RBF-SVM trained on the head orientation
angles provided by OpenFace. The OwnNet8-Baseline is
similar to T.1, but trained with the 944k training set and the
labels automatically created by [2] (instead of the smaller
opt-train set with manual labels). The comparison (1) of
OpenFace with the others shows the benefit of our CNN
approach compared to OpenFace’s classical approach con-
sisting of feature extraction (HOG + landmarks / head pose)
followed by SVM; (2) of OwnNet8-Baseline with T.1 shows
that training with less but high quality labels in this case
is better than relying only on many more but weakly la-
beled samples; and (3) of T.1 and A.2 shows that especially
the head pose AUs (51-56) significantly benefit from self-
training. Fusion A+B consistently improves results com-
pared to A.2, but the benefits differ significantly between
AUs.
EmotioNet Challenge Results: Table 2summarizes the
final ranking scores obtained on the official EmotioNet
2020 and 2018 Challenge validation and test sets. For our
final submission, we retrained all student models involved
Challenge 2020 Challenge 2018
Model / Participant Val. Set Test Set Val. Set Test Set
Our models
- Teacher T.0 0.7143 - 0.7754 -
- Teacher T.1 0.7213 - 0.7828 -
- Student A.1 0.7324 - 0.7873 -
- Fusion A+B 0.7448 - 0.8011 -
-Fusion A+B* 0.7452 0.7301 0.8014 0.7734
Competitors 2020
- TAL 0.7460 0.7306 -0.7722
- UCAS-NTU 0.6363 0.6711 - 0.7377
- UCAS-alibaba - 0.6053 - 0.6428
Best results 2018
- PingAn-GammaLab - - 0.7855 0.7553
- VisionLabs - - 0.6788 0.6718
- MIT - - 0.5995 0.6711
- Univ. of Washington - - 0.6645 0.6300
Table 2. Final ranking scores obtained on the validation and test
set of the EmotioNet 2020 Challenge (and its 2018 predecessor):
Our results, results of the best 2020 competitors, and of the best
2018 challenge participants.
Challenge 2020 Challenge 2018
Participant Accuracy F1 Accuracy F1
TAL 0.9147 0.5465 0.9499 0.5945
Univ. of Magdeburg 0.9124 0.5478 0.9458 0.6009
UCAS-NTU 0.9013 0.4410 0.9485 0.5268
PingAn-GammaLab - - 0.9446 0.5659
VisionLabs - - 0.9207 0.4229
MIT - - 0.9298 0.4125
Table 3. Accuracies and F1-scores of the best participants on the
test set of the EmotioNet 2020 and 2018 Challenges. We (Univ. of
Magdeburg) perform better in F1, TAL is better in accuracy.
in the A+B fusion with the whole opt set instead of the opt-
train subset to benefit from 5k additional manually labeled
samples. This model is denoted “Fusion A+B*” in Table 2.
The table also includes our prior validation attempts and the
best results of the other challenge participants. Our “Fu-
sion A+B*” model achieved the second place in the 2020
challenge, with a test performance that is only 0.05% worse
than the winner (TAL) but 5.9% better than the third place
(NCAS-NTU). On the 2018 challenge data (12 AUs; with-
out AU 10, 15, 18, 24, 28, and 51-56), we outperform
all other known results, including the 2018 challenge win-
ner (PingAn-GammaLab) and the 2020 challenge winner
(TAL). Table 3compares the mean accuracies and F1 per-
formances of the best 2020 and 2018 challenge participants.
4. Conclusion
In this paper we described our approach for facial Ac-
tion Unit (AU) recognition in the wild, which uses: (1) self-
training, (2) multi-task learning, (3) an heterogeneous en-
semble involving three CNN architectures, (4) a weighted
loss for handling data imbalance, and (5) a more-detail
face alignment for expression AUs and a more-context face
alignment for head pose AUs. With this approach we
reached the second place in the EmotioNet 2020 Challenge
(with only a small margin of 0.05% to the winner), with-
out using additional training data next to EmotioNet dataset
provided by the challenge organizers. Further, we achieved
the best result reported so far on the EmotioNet 2018 Chal-
lenge data.
Several experiments showed that self-training can im-
prove AU recognition if a large amount of unlabeled data is
available. This is promising for future works, since acquir-
ing FACS labels is expensive and unlabeled data is avail-
able virtually indefinitely. We can also recommend to apply
multi-task learning if multiple datasets are combined and
the datasets’ AU labels differ regarding their quality.
References
[1] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L. P. Morency. Open-
Face 2.0: Facial behavior analysis toolkit. In FG, 2018. 1,3
[2] C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez.
EmotioNet: An Accurate, Real-Time Algorithm for the Auto-
matic Annotation of a Million Facial Expressions in the Wild.
In CVPR, 2016. 1,3
[3] Adrian Bulat and Georgios Tzimiropoulos. How far are we
from solving the 2d & 3d face alignment problem? (and a
dataset of 230,000 3d facial landmarks). In International Con-
ference on Computer Vision, 2017. 1
[4] Jiankang Deng, Jia Guo, Yuxiang Zhou, Jinke Yu, Irene Kot-
sia, and Stefanos Zafeiriou. RetinaFace: Single-stage Dense
Face Localisation in the Wild. arXiv:1905.00641 [cs.CV],
may 2019. 1
[5] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M.
Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and
H. Adam. Searching for MobileNetV3. In ICCV, 2019. 2
[6] M. Tan and Q. V. Le. EfficientNet: Rethinking Model Scaling
for Convolutional Neural Networks. In ICML, 2019. 2
[7] P. Werner, F. Saxen, and A. Al-Hamadi. Handling Data Im-
balance in Automatic Facial Action Intensity Estimation. In
BMVC, 2015. 2
[8] P. Werner, F. Saxen, and A. Al-Hamadi. Landmark based head
pose estimation benchmark and method. In ICIP, 2017. 1
[9] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-
training with Noisy Student improves ImageNet classification.
arXiv:1911.04252v2 [cs.LG], 2020. 1,2
... This research exclusively concentrates on detecting action units (AUs) at the frame level and does not address motion or sequence-based detection. For a more comprehensive understanding of AU detection studies spanning a wider scope, readers are encouraged to explore surveys such as [34][35][36] and follow the challenges outlined in works like [37][38][39][40]. These resources provide in-depth insights into both historical and contemporary AU detection investigations. ...
Article
Full-text available
The human face displays expressions through the contraction of various facial muscles. The Facial Action Coding System (FACS) is a widely accepted taxonomy that describes all visible changes in the face in terms of action units (AUs). In this study, AUs are examined by finding the most active landmarks of the face and then examining the most representative patch sizes of each landmark for the AU detection task. Sparse learning is employed to learn the most active landmarks for each AU, and then the active landmark patches are fed to ViT and Perceiver mechanisms independently. Experiments indicate that using active landmark patches with their most representative size improves the results when compared to using all the landmarks, especially when it is used on more challenging datasets as a support for the attention mechanism of the classifier. The results demonstrate that the proposed method improves the performance of the employed models and are further supported by experiments conducted across different datasets.
... For instance, EmotioNet's training set includes only automatic labels, while its validation and test sets have manual labels; AffectNet contains a mix of automatic and manual annotations. Some researchers use the automatic annotations despite their noise, while others split the validation set into new training and validation sets, partitioning differently (e.g., 90-10% split in [19] vs. 80-20% split in [20] Evaluating a method's performance on a small test set can lead to inaccurate conclusions. A method that outperforms another on a small test set may not do so on a larger test set. ...
Preprint
Full-text available
Evaluating affect analysis methods presents challenges due to inconsistencies in database partitioning and evaluation protocols, leading to unfair and biased results. Previous studies claim continuous performance improvements, but our findings challenge such assertions. Using these insights, we propose a unified protocol for database partitioning that ensures fairness and comparability. We provide detailed demographic annotations (in terms of race, gender and age), evaluation metrics, and a common framework for expression recognition, action unit detection and valence-arousal estimation. We also rerun the methods with the new protocol and introduce a new leaderboards to encourage future research in affect recognition with a fairer comparison. Our annotations, code, and pre-trained models are available on \hyperlink{https://github.com/dkollias/Fair-Consistent-Affect-Analysis}{Github}.
... Semi-Supervised AU Recognition: To leverage the unlabeled images, self-training was proposed for AU recognition in [36,37], which first train a teacher net on a small subset of clean data and use it to annotate unlabeled images and obtain pseudo-AU labels. Then, they jointly use the clean labels and pseudo labels with high confidence to train the whole network. ...
Conference Paper
Full-text available
Facial action unit (AU) recognition is essential for recognizing fine-grained changes in facial expression, while the demand for a large amount of accurately labeled AU data for training purposes has resulted in high labor costs. Nevertheless, massive face images are widely available and inaccurate labels can be easily obtained, especially as large vision-language pre-training models progress. This paper introduces the Regularized Co-Training (ReCoT) method, which leverages the useful information from both accurately labeled (clean) and inaccurately labeled (noisy) face images to achieve robust AU recognition. ReCoT uses a two-head network in each view, with one for clean data modeling (clean net) and the other for noisy data modeling (noisy net) by learning label noise w.r.t. the clean predictions. Additionally, a selective balanced loss is proposed for the noisy net to learn from noisy labels and alleviate the imbalanced issue in the clean net. Extensive experiments on several AU databases, including EmotioNet, BP4D and DISFA, show that ReCoT effectively leverages noisy AU data to improve the model performance. The code is available: https://github.com/JackYFL/ReCoT_BMVC2023.
... This study only focuses on frame based AU detection, not motion/ video based detection. The reader can refer to the surveys [52,36,81] and follow the challenges [61][62]64,73] for more details on former and up-to-date AU detection studies. Some AU detection and classification techniques have been analyzed below with respect to their structures. ...
... In this regard, training several partial solutions on -in some cases small and under controlled conditions acquired -databases is problematic, as for example, landmark extraction might fail when applied to challenging in-the-wild samples. Hence, similar to head pose detection, landmarkfree approaches that estimate AUs [59], basic emotions or valence/arousal values directly from raw images (end-to-end learning) are more effective on challenging datasets, provided there are enough available training samples [26]. ...
Preprint
Full-text available
Deception detection is an interdisciplinary field attracting researchers from psychology, criminology, computer science, and economics. We propose a multimodal approach combining deep learning and discriminative models for automated deception detection. Using video modalities, we employ convolutional end-to-end learning to analyze gaze, head pose, and facial expressions, achieving promising results compared to state-of-the-art methods. Due to limited training data, we also utilize discriminative models for deception detection. Although sequence-to-class approaches are explored, discriminative models outperform them due to data scarcity. Our approach is evaluated on five datasets, including a new Rolling-Dice Experiment motivated by economic factors. Results indicate that facial expressions outperform gaze and head pose, and combining modalities with feature selection enhances detection performance. Differences in expressed features across datasets emphasize the importance of scenario-specific training data and the influence of context on deceptive behavior. Cross-dataset experiments reinforce these findings. Despite the challenges posed by low-stake datasets, including the Rolling-Dice Experiment, deception detection performance exceeds chance levels. Our proposed multimodal approach and comprehensive evaluation shed light on the potential of automating deception detection from video modalities, opening avenues for future research.
... For example, they can be used in the field of human-computer interaction (HCI) to detect possible interaction partners, in autonomous driving to perceive road users such as pedestrians, or in mobile robot navigation to identify moving obstacles. Furthermore, they are the first component for a large number of recognition systems in many applications, such as face recognition [1], facial expression analysis [2,3], body pose estimation [4], face attribute detection [5], human action recognition [6] and others. In such systems, face and/or person detection are often a prerequisite for the following processing steps; so, their detection rate is crucial for the performance of the overall system. ...
Article
Full-text available
Face and person detection are important tasks in computer vision, as they represent the first component in many recognition systems, such as face recognition, facial expression analysis, body pose estimation, face attribute detection, or human action recognition. Thereby, their detection rate and runtime are crucial for the performance of the overall system. In this paper, we combine both face and person detection in one framework with the goal of reaching a detection performance that is competitive to the state of the art of lightweight object-specific networks while maintaining real-time processing speed for both detection tasks together. In order to combine face and person detection in one network, we applied multi-task learning. The difficulty lies in the fact that no datasets are available that contain both face as well as person annotations. Since we did not have the resources to manually annotate the datasets, as it is very time-consuming and automatic generation of ground truths results in annotations of poor quality, we solve this issue algorithmically by applying a special training procedure and network architecture without the need of creating new labels. Our newly developed method called Simultaneous Face and Person Detection (SFPD) is able to detect persons and faces with 40 frames per second. Because of this good trade-off between detection performance and inference time, SFPD represents a useful and valuable real-time framework especially for a multitude of real-world applications such as, e.g., human–robot interaction.
Article
Full-text available
Deception detection is an interdisciplinary field attracting researchers from psychology, criminology, computer science, and economics. Automated deception detection presents unique challenges compared to traditional polygraph tests, but also offers novel economic applications. In this spirit, we propose an approach combining deep learning with discriminative models for deception detection. Therefore, we train CNNs for the facial modalities of gaze, head pose, and facial expressions, allowing us to compute facial cues. Due to the very limited availability of training data for deception, we utilize early fusion on the CNN outputs to perform deception classification. We evaluate our approach on five datasets, including four well-known publicly available datasets and a new economically motivated rolling dice experiment. Results reveal performance differences among modalities, with facial expressions outperforming gaze and head pose overall. Combining multiple modalities and feature selection consistently enhances detection performance. The observed variations in expressed features across datasets with different contexts affirm the importance of scenario-specific training data for effective deception detection, further indicating the influence of context on deceptive behavior. Cross-dataset experiments reinforce these findings. Notably, low-stake datasets, including the rolling dice Experiment, present more challenges for deception detection compared to the high-stake Real-Life trials dataset. Nevertheless, various evaluation measures show deception detection performance surpassing chance levels. Our proposed approach and comprehensive evaluation highlight the challenges and potential of automating deception detection from facial cues, offering promise for future research.
Chapter
Deception detection is a challenging and interdisciplinary field that has garnered the attention of researchers in psychology, criminology, computer science, and even economics. While automated deception detection presents more obstacles than traditional polygraph tests, it also offers opportunities for novel economic applications. In this study, we propose a novel multimodal approach that combines deep learning with discriminative models to automate deception detection. We tested our approach on two datasets: the Rolling-Dice Experiment, an economically motivated experiment, and a real-life trial dataset for comparison. We utilized video and audio modalities, with video modalities generated through end-to-end learning (CNN). However, for actual deception detection, we employed discriminative approaches due to limited training data in this field. Our results show that the use of multiple modalities and feature selection improves detection results, particularly in the Rolling-Dice Experiment. Furthermore, we observed that due to minimized reactions, deception detection is much more difficult in the Rolling-Dice Experiment than in the high-stake dataset, quantified with an AUC of 0.65 compared to 0.86. Our study highlights the challenges and opportunities of automated deception detection for economic experiments, and our novel multimodal approach shows promise for future research in this field.KeywordsDeception detectionRolling-Dice ExperimentMultimodal approach
Conference Paper
Full-text available
Head pose estimation can help in understanding human behavior or to improve head pose invariance in various face analysis applications. Ready-to-use pose estimators are available with several facial landmark trackers, but their accuracy is commonly unknown. Following the goal to find the best landmark based pose estimator, we introduce a new database (called SyLaHP), propose a new benchmark protocol, and describe and implement a method to learn a pose estimator on top of any landmark detector (called HPFL). The experiments (including cross database) reveal that OpenFace comes with the best pose estimator. Further, HPFL models trained on top of landmark trackers outperform the respective built-in pose estimators. The SyLaHP database, source code, and trained models are publicly available for research.
Conference Paper
Full-text available
This paper investigates how far a very deep neural network is from attaining close to saturating performance on existing 2D and 3D face alignment datasets. To this end, we make the following three contributions: (a) we construct, for the first time, a very strong baseline by combining a state-of-the-art architecture for landmark localization with a state-of-the-art residual block, train it on a very large yet synthetically expanded 2D facial landmark dataset and finally evaluate it on all other 2D facial landmark datasets. (b) We create a guided by 2D landmarks network which converts 2D landmark annotations to 3D and unifies all existing datasets, leading to the creation of LS3D-W, the largest and most challenging 3D facial landmark dataset to date (~230,000 images). (c) Following that, we train a neural network for 3D face alignment and evaluate it on the newly introduced LS3D-W. (d) We further look into the effect of all "traditional" factors affecting face alignment performance like large pose, initialization and resolution, and introduce a "new" one, namely the size of the network. (e) We show that both 2D and 3D face alignment networks achieve performance of remarkable accuracy which is probably close to saturating the datasets used. Demo code and pre-trained models can be downloaded from http://www.cs.nott.ac.uk/~psxab5/face-alignment/
Conference Paper
Full-text available
Research in face perception and emotion theory requires very large annotated databases of images of facial expressions of emotion. Annotations should include Action Units (AUs) and their intensities as well as emotion category. This goal cannot be readily achieved manually. Herein, we present a novel computer vision algorithm to annotate a large database of one million images of facial expressions of emotion in the wild (i.e., face images downloaded from the Internet). First, we show that this newly proposed algorithm can recognize AUs and their intensities reliably across databases. To our knowledge, this is the first published algorithm to achieve highly-accurate results in the recognition of AUs and their intensities across multiple databases. Our algorithm also runs in real-time (>30 images/second), allowing it to work with large numbers of images and video sequences. Second, we use WordNet to download 1,000,000 images of facial expressions with associated emotion keywords from the Internet. These images are then automatically annotated with AUs, AU intensities and emotion categories by our algorithm. The result is a highly useful database that can be readily queried using semantic descriptions for applications in computer vision, af-fective computing, social and cognitive psychology and neu-roscience; e.g., " show me all the images with happy faces " or " all images with AU 1 at intensity c. "
Conference Paper
Full-text available
Automatic Action Unit (AU) intensity estimation is a key problem in facial expression analysis. But limited research attention has been paid to the inherent class imbalance, which usually leads to suboptimal performance. To handle the imbalance, we propose (1) a novel multiclass under-sampling method and (2) its use in an ensemble. We compare our approach with state of the art sampling methods used for AU intensity estimation. Multiple datasets and widely varying performance measures are used in the literature, making direct comparison difficult. To address these shortcomings, we compare different performance measures for AU intensity estimation and evaluate our proposed approach on three publicly available datasets, with a comparison to state of the art methods along with a cross dataset evaluation.
RetinaFace: Single-stage Dense Face Localisation in the Wild
  • Jiankang Deng
  • Jia Guo
  • Yuxiang Zhou
  • Jinke Yu
  • Irene Kotsia
  • Stefanos Zafeiriou
Jiankang Deng, Jia Guo, Yuxiang Zhou, Jinke Yu, Irene Kotsia, and Stefanos Zafeiriou. RetinaFace: Single-stage Dense Face Localisation in the Wild. arXiv:1905.00641 [cs.CV], may 2019. 1
Rethinking Model Scaling for Convolutional Neural Networks
  • M Tan
  • Q V Le
  • Efficientnet
M. Tan and Q. V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In ICML, 2019. 2
  • Q Xie
  • M.-T Luong
  • E Hovy
  • Q V Le
Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Selftraining with Noisy Student improves ImageNet classification. arXiv:1911.04252v2 [cs.LG], 2020. 1, 2