Conference PaperPDF Available

Generalizing to Unseen Head Poses in Facial Expression Recognition and Action Unit Intensity Estimation

Authors:

Abstract and Figures

Facial expression analysis is challenged by the numerous degrees of freedom regarding head pose, identity, illumination, occlusions, and the expressions itself. It currently seems hardly possible to densely cover this enormous space with data for training a universal well-performing expression recognition system. In this paper we address the sub-challenge of generalizing to head poses that were not seen in the training data, aiming at getting along with sparse coverage of the pose subspace. For this purpose we (1) propose a novel face normalization method called FaNC that massively reduces pose-induced image variance; (2) we compare the impact of the proposed and other normalization methods on (a) action unit intensity estimation with the FERA 2017 challenge data (achieving new state of the art) and (b) facial expression recognition with the Multi-PIE dataset; and (3) we discuss the head pose distribution needed to train a pose-invariant CNN-based recognition system. The proposed FaNC method normalizes pose and facial proportions while retaining expression information and runs in less than 2 ms. When comparing results achieved by training a CNN on the output images of FaNC and other normalization methods, FaNC generalizes significantly better than others to unseen poses if they deviate more than 20° from the poses available during training.
Content may be subject to copyright.
Generalizing to Unseen Head Poses in Facial Expression Recognition
and Action Unit Intensity Estimation
Philipp Werner1, Frerk Saxen1, Ayoub Al-Hamadi1, and Hui Yu2
1Neuro-Information Technology Group, Otto-von-Guericke University Magdeburg, Germany
2Visual Computing Group, University of Portsmouth, UK
Abstract Facial expression analysis is challenged by the
numerous degrees of freedom regarding head pose, identity,
illumination, occlusions, and the expressions itself. It currently
seems hardly possible to densely cover this enormous space
with data for training a universal well-performing expression
recognition system. In this paper we address the sub-challenge
of generalizing to head poses that were not seen in the training
data, aiming at getting along with sparse coverage of the
pose subspace. For this purpose we (1) propose a novel face
normalization method called FaNC that massively reduces
pose-induced image variance; (2) we compare the impact of
the proposed and other normalization methods on (a) action
unit intensity estimation with the FERA 2017 challenge data
(achieving new state of the art) and (b) facial expression
recognition with the Multi-PIE dataset; and (3) we discuss the
head pose distribution needed to train a pose-invariant CNN-
based recognition system. The proposed FaNC method nor-
malizes pose and facial proportions while retaining expression
information and runs in less than 2 ms. When comparing results
achieved by training a CNN on the output images of FaNC and
other normalization methods, FaNC generalizes significantly
better than others to unseen poses if they deviate more than
20from the poses available during training. Code and data
are available.
I. INTRODUCTION
Face normalization has been proven to be beneficial across
several domains of face analysis including facial expression
recognition [29], [9], face recognition [17], [52], [45], or gen-
der recognition [17]. In its simplest form, face normalization
(also called face registration or frontalization) compensates
variation in face position, scale, and in-plane rotation. More
advanced methods aim to remove the effects caused by out-
of-plane rotations (head turned away), different facial propor-
tions, expression [52], illumination [54], [45], occlusion [32],
or background. The basic idea is to gain invariance regarding
such nuisance factors by reducing their influence on the
extracted features; this can improve discriminative power for
the recognition task at hand. In facial expression analysis
both head pose and individual differences in facial shape
and texture are a challenge [11]; normalizing these factors
is beneficial if the expression information is preserved, as
it reduces within-class variance. Previous face normalization
approaches, which we discuss in Sec. III, have at least one of
the following limitations: (1) they do not frontalize out-of-
plane poses, (2) they lose expression information or introduce
artifacts, (3) they require training data covering all degrees
This work was funded by the German Federal Ministry of Education and
Research (BMBF), grants 03ZZ0470 and 03ZZ0443G. The sole responsi-
bility for the content lies with the authors.
image with
landmarks
input
in-plane
normalization
source domain
prediction of correspondence
points and visibilities
source domain target domain
texture
warping
output
Fig. 1. Processing chain of the proposed method FaNC.
of freedom (see Abstract), (4) they are too slow for real-time
expression recognition or require heavy GPU computation.
Aiming at head-pose-invariant real-time facial expression
recognition systems, we contribute anovel face normal-
ization method called FaNC (Sec. II). It learns to predict
coordinates and visibilities of correspondence points from
facial landmarks. The predicted information is used to gen-
erate a face image that is normalized regarding pose and
facial proportions. FaNC can be learned and applied on
top of any landmark localizer, also without facial contour
landmarks, and runs in less than 2 ms even on cheap on-
board GPUs. We review related work (Sec. III) and compare
normalization methods’ impact on deep learning based
facial action unit intensity estimation and expression recog-
nition (Sec. IV). To the authors best knowledge, we present
(1) the first extensive analysis of generalization to unseen
head poses and individuals and (2) the first cross database
evaluation in which frontalization was developed and trained
completely on another dataset than the datasets used for
evaluation. We conduct experiments on the FERA 2017
challenge dataset and the Multi-PIE dataset, in which our
proposed FaNC normalization method outperforms others on
previously unseen head poses and individuals. Further, we
discuss which poses are needed in training data to perform
well across others. Data and code are available for research
at http://iikt.ovgu.de/FaNC.html.
II. FACE NOR MAL IZATI ON BAS ED ON LEARNING
CORRESPONDENCES
In this Section we propose Face Normalization based
on learning Correspondences (FaNC), a method that can
be applied on top of any facial landmark localizer. The
core component is the prediction of correspondence point
coordinates and visibilities from automatically detected land-
marks. This mapping can be learned to handle different
face normalization tasks, such as pure frontalization (pose
P. Werner, F. Saxen, A. Al-Hamadi, H. Yu, "Generalizing to Unseen Head Poses in Facial Expression
Recognition and Action Unit Intensity Estimation", in IEEE International Conference on Automatic Face
and Gesture Recognition (FG), 2019.
This is the accepted manuscript. The final, published version is available on IEEE Xplore.
(C) 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for
all other uses, in any current or future media, including reprinting/republishing this material for
advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works.
30 identities
30 facial expressions
82 head poses (angle range: yaw ±45, pitch ±45)
P73,800 images, each with:
– correspondence points and visibilities
– automatically detected landmarks
Fig. 2. SyLaFaN database: 3 degrees of freedom are varied systematically.
compensation), normalization of pose and expression, or
normalization of pose and identity-related factors (facial
proportions). In this paper, we target the latter, since both
pose and identity can be considered nuisance factors for
recognizing facial expression. Fig. 1 gives an overview of the
method. An arbitrary image Fwith facial landmarks lis the
input of the algorithm. Landmarks are normalized through an
in-plane transformation (Sec. II-B), followed by prediction
of correspondence point coordinates in both source domain
(arbitrary image) and target domain (frontal image), see
Sec. II-C, and by prediction of the correspondence points’
visibility (Sec. II-D). Finally, the normalized image is created
from the input image by piecewise affine warping based on
the predicted coordinates, whereas disocclusion is handled by
blending and mirroring (Sec. II-E). For training the method,
we create a synthetic dataset, which is described in the
following section.
A. SyLaFaN Database
We introduce the Synthetic dataset for Landmark based
Face Normalization (SyLaFaN). It contains 73,800 images
rendered using the FaceGen 3D morphable model (3D-
MM similar to [8], https://facegen.com/). Identity,
facial expression, and head poses are varied systematically
(see Table 2). Illumination and occlusion, which are other
challenging factors for face normalization, are not varied in
the dataset, since they are handled better and better with new
landmark localizers and do not change facial shape.
Each of 30 subjects (with varying ethnicity, age, and
gender) is combined with 30 facial expressions (including
basic emotions and phonemes), resulting in 900 meshes (all
created from 3D-MM). Each mesh is rendered in 82 different
head poses, including the frontal pose (0rotation angles)
and 81 other poses covering the angle range of ±45in yaw
(turn right/left) and pitch (turn up/down). The roll angle is
not varied in the dataset, as it can be compensated by in-plane
rotation. For each image, a previously defined subset of the
3D-MM mesh points were projected to the image coordinate
system yielding a set of correspondence points. Due to self-
occlusions in out-of-plane head poses several of them might
be invisible. So, along with the coordinates we provide a
binary visibility flag for each point.
Formally, the database comprises Nsamples with index
i∈ I ={1,2, . . . , N }, each with an image frame Fi
Ra1×a2×cwith a1×a2being the number of pixels and c
the number of channels. For each sample iwe have a set of
Mpcorrespondence points pi,j R2with j= 1, . . . , Mp,
which can be summarized in a vector piR2Mp. Each
point jis semantically equivalent throughout all samples i.
For each correspondence point there is an associated binary
visibility vi,j ∈ {0,1}. The visibilities of sample iare
summarized in vector vi∈ {0,1}Mp. Further, we have Ml
facial landmark points li,j R2with j= 1, . . . , Ml, which
can be summarized in a vector liR2Ml.
The landmarks can be automatically localized with one
of numerous methods, but we include our automatically
detected landmarks in the dataset. See Sec. IV for more
details on the landmarks.
We decided to use an own synthetic database instead of
Multi-PIE [16], FERA17 [39], or BP4D [47] due to the
following reasons: (1) accurate correspondence point coordi-
nates and visibilities are easy to obtain when rendering from
a 3D-MM, (2) we can generate more head pose variation, (3)
we are mainly interested in landmarks and correspondence
points; so low detail in texture, lack of occlusions, and low
variability in lighting are no problem, because those are
handled well by landmark detectors.
B. In-Plane Point Normalization
We register the facial landmarks and correspondence
points with a non-reflective similarity transformation to com-
pensate for in-plane rotation, translation, and scale. The eye
center points of the landmarks l, which we calculate from the
eye corners, are used to estimate the transformation s(x). It
is applied to all pand lcoordinates, ˆp =s(p)and ˆ
l=s(l).
Sec. II-C and II-D only work in this normalized coordinate
system.
C. Correspondence Point Prediction
The task of mapping arbitrary faces (source domain) to
the desired normalized faces (target domain) is defined by
an index mapping function t(i) : N7→ Nthat associates
each sample in our dataset with a corresponding frontal
target sample. The image Fiis associated with the frontal
image Ft(i)and the correspondence points piwith the frontal
correspondence points pt(i). For the task of facial expression
recognition, t(i)selects the sample with frontal pose, same
expression, but from an average identity. I.e. it aims to
normalize geometric differences between individuals, such as
facial proportions, and reduces inter-person variability, which
is beneficial for facial expression analysis.
We learn to predict correspondence point coordinates ˆp
from the normalized landmarks ˆ
l. More precisely, the ground
truth response vector yiof sample iis constructed by
concatenating the correspondence points from the source
domain ˆpi(arbitrary pose) with those from the target domain
ˆpt(i)(associated frontal pose), i.e. yi= [ˆpi,ˆpt(i)].
2
We use a linear model y=Wx +bto learn the mapping,
because it facilitates very fast prediction and has lower
potential for overfitting to our synthetic training dataset. To
cope with non-linearity of the problem, we use non-linear
features. Next to the normalized landmarks ˆ
lwe also use
the landmarks ˇ
lafter being aligned based on the mouth
corner points (instead of eye center). Further, we include the
element-wise squares ˆ
l2and ˇ
l2, i.e. xi= [
ˆ
li,ˇ
li,ˆ
l2
i,ˇ
l2
i]. For
training we decompose WR4Mp×8Mland bR4Mpinto
4Mpmodels (one for each response dimension). The model
parameters are selected by optimizing the L2-regularized L2-
loss for support vector regression with LIBLINEAR [15].
We standardize the feature vectors xbefore training. The
source domain coordinate regressors are only trained with
those images in which the respective correspondence point
is visible, since our warping only uses the coordinates of
visible points.
D. Visibility Prediction
Similar to the previous Section, we learn to predict corre-
spondence point visibilities vfrom the normalized landmarks
ˆ
l, respectively the features xdescribed in the previous
section. Again we use a linear model; this time the parameter
matrices are WRMp×8Mland bRMp, since we have
only one response per correspondence point. Further, the
visibility is binary, so we threshold the responses to get the
final predictions. We optimize the parameters by learning
Mpsupport vector classifier models with LIBLINEAR [15]
using L2-regularized L2-loss. To avoid the imbalanced data
problem [26], we apply random undersampling to balance
the class distributions before training.
E. Texture Warping
Basically, we apply piecewise affine warping based on a
triangle mesh to create the output image. The mesh (see
Fig. 1) has been obtained once by Delaunay triangulation
of the correspondence points from a frontal pose image
of the SyLaFaN database. In contrast to typical piecewise
affine warping, the vertex coordinates not only vary for the
input, but also for the output image space. Further, we use
the predicted correspondence points instead of landmarks.
Disocclusion is handled by blending and mirroring from the
visible facial side.
To warp an image, the predicted source domain correspon-
dence points (see Sec. II-C) are transformed back to the input
image space; the target domain points are transformed to the
output image space. The predicted binary visibilities (see
Sec. II-D) are post-processed as follows: (1) In triangles with
one or two invisible vertices (vi,j = 0), all vertices are set
invisible (vi,j := 0). (2) In the neighboring triangles, visible
vertices (vi,j = 1) are set to be half-visible (vi,j := 0.5).
(3) In the side of the face that has more visible vertices, all
vertices are set visible. After that we warp the texture. In
the first run, the input coordinates of each triangle with any
vertex visibility vi,j <1are set to the coordinates of the
corresponding triangle from the other side, i.e. the texture is
mirrored from the other facial side for those triangles. We
do a second run with alpha blending to avoid strong edges
at boundaries of the mirrored triangles. Each triangle with
any vertex 0< vi,j <1is blended on top of the first run
image with αi,j = 1 vi,j. The blending factor αis linearly
interpolated, eliminating strong edges between visible and
mirrored parts.
III. REL ATE D WORK
Facial expression recognition has been surveyed recently
by Sariyanidis et al. [34]. A variety of methods are used
for normalization. The simplest form is cropping the face
bounding box obtained by face detection and rescaling it to
a canonical size [19], [5] (we later refer to this as FaceDet).
When landmarks are known, another easy option is to only
scale the image [7], which may be sufficient for using local
descriptors around the landmarks. More advanced landmark
based normalization methods are summarized in Table I.
They are based on different landmarks, such as only eye
landmarks, inner landmarks (excluding the facial contour), or
landmarks with facial contour. For some landmark localizers,
facial contour landmarks are not available; further, they are
often less accurate than the inner landmarks. Our proposed
FaNC method can be trained on top of any number of
landmarks. Most methods register the landmarks with a static
reference shape (usually an average face), but they differ
regarding the used transformation: non-reflective similarity
and affine transformations are very common choices.
The first five methods in Table I create the normalized im-
age by warping with a single transformation, which registers
the images to a certain degree, but does not generate a frontal
view. In contrast, the other methods in the table use piecewise
warping or 3D rendering to synthesize a frontal view. Piece-
wise affine warping (PieceAff) to a reference shape is widely
used for frontalization. It offers accurate registration for a
wide range of poses, but has the following limitations: (1)
It removes facial shape information, i.e. differences in facial
proportions and deformations due to expression are lost, (2)
the warping might also drop relevant texture information or
fill large areas from a few pixels, and (3) the method does not
handle occlusions, which leads to artifacts for extreme poses
(see Fig. 4 for examples). Hassner et al. [17] (3dStatic) use a
static 3D model with corresponding 3D landmark positions.
They assume the intrinsic camera parameters to be known
and estimate the extrinsic camera parameters to find the head
pose. Next, they texturize the model with the input image and
render it in frontal pose. Occlusions are handled by blending
with the mirrored version of the model. Wang et al. [40] learn
to map the detected landmarks from arbitrary views to the
frontal view, apply piecewise affine warping to generate a
frontal texture, and handle disocclusions and other artifacts
by synthesizing an appearance image from a pre-defined
Eigen-face space by minimizing the pixel-wise mean squared
error. The first part is similar to our approach, but we not only
map detected landmarks to the target domain, but predict a
denser set of correspondence points in both source and target
domain. Further, our method is fully discriminative and does
not require an optimization for a query image, making it
3
TABLE I
OVE RVI EW OF STATE -OF-TH E-ART LANDMARK BASED REAL-TI ME CAPA BLE FACE N ORMA LIZATI ON ME TH OD S.
Abbreviation Registration input Registration target Texture warping OH Applications
SimEye eye landmarks reference shape NR similarity transf. ×[38], [25], [36], [50], [12]
SimInner inner landmarks reference shape NR similarity transf. ×[48], [13], [49], [27]
SimStable landm. stable under expression [3] reference shape NR similarity transf. ×[3], [4], [28]
AffInner inner landmarks reference shape affine transf. ×[44], [30], [14], [2], [35]
AffStable stable inner landmarks (eye/nose) reference shape affine transf. ×[37], [23], [1]
PieceAff landmarks with facial contour reference shape piecewise affine transf. ×[9], [41], [20], [42]
3dStatic [17] inner landmarks static 3D model 3D rendering X[17]
FaNC (ours) predicted corresp. points predicted corresp. points piecewise affine + blending XSec. IV-B and IV-C
OH: occlusion handling NR: non-reflective
usable for online expression analysis at high frame rates
– in contrast, the optimization part of Wang [40] runs for
more than one minute per image. Next to the landmark-
based methods, there are purely texture-based approaches
to normalize faces [54], [45], [46], which are not in the
focus here. They require expensive hardware to run at high
frame-rates (if possible at all) and huge training datasets with
variation in all degrees of freedom (for generalizing well
across datasets).
IV. EXP E RI MEN TS
In several experiments, we compare the proposed FaNC
with other face normalization methods, analyze generaliza-
tion to unseen poses (and individuals), and analyze the
impact of the poses available in training data. Sec. IV-
A compares qualitative results and runtime of face nor-
malization methods. In Sec. IV-B we experiment with the
FERA17 dataset [39] and compare the results we achieve
in facial action unit intensity estimation when changing the
normalization used for preprocessing the recognition CNN
input. Similarly, Sec.IV-C addresses expression recognition
on the Multi-PIE dataset [16].
Landmark Localization: To localize facial landmarks
(68 points) across a wide range of poses, we train an
ensemble of regression trees based on the method by Kazemi
and Sullivan [21] using the implementation from dlib [22].
The model is trained on multiple datasets (Multi-PIE [16],
afw [53], helen [24], ibug, 300-W [31], 300-VW [10], and
lfpw [6]). The point annotations for ibug, afw, helen, 300-
W, and lfpw are provided by Sagonas et al. [33]. From
the 300-VW dataset we selected the hardest 10 frames of
each video based on the point to point error (normalized by
interocular distance) with a previously trained model. From
the Multi-PIE dataset we used all fully annotated samples
from the camera pose 080 and 190. The resulting model
performed significantly better than the model coming with
dlib [22]. An advantage of our method is that it can benefit
from advances in landmark localization and that using a more
recent approach may improve face normalization results.
SyLaFaN Dataset: Despite the improved model, there
are still moderate to severe landmark localization errors,
especially in extreme head poses. For our experiments we
only use a subset of the SyLaFaN database with lower errors.
To find this set, we calculated the mean distance of landmark
points and associated correspondence points for each sample
SimEye SimInner SimStable AffInner AffStable PieceAff 3dStatic FaNC51 FaNC68 Input ex.
Fig. 3. Normalized images for facial expressions smile (top row), mouth
open (middle), and half closed eyes (bottom). Columns: mean of results of
different methods and one of the images before normalization.
i, sort them by distance, and choose the 75% of samples with
lowest error.
FaNC Training: We trained FaNC with the Mp= 153
correspondence points provided with the SyLaFaN dataset.
Regarding landmarks, we use two variants: FaNC68 with
all Ml= 68 landmarks and FaNC51 with the 51 inner
landmarks (excluding the facial contour points along the
jaw and chin). The coordinate prediction is trained with
= 0.005, C = 0.25, the visibility prediction with C= 1.
We render the normalized images to a resolution of 180×200
pixels for Sec. IV-A and 256 ×256 pixels for Sec. IV-B
and IV-C (same for all other methods). For the cross-dataset
experiments in Sec. IV-B and IV-C we augment the SyLaFaN
training set by mirroring the asymmetric expressions and
train on 30,000 randomly selected samples with Ml= 68.
A. Face Normalization
We qualitatively compare face normalization results of the
methods listed in Table I and shortly discuss runtime. The
3dStatic method was applied with the inner 51 landmarks,
as this performed better than using all 68 landmarks.
Qualitative Results on SyLaFaN: We applied the nor-
malization methods on all images of the SyLaFaN dataset
and calculated the pixel-wise mean images for each expres-
sion (across all combinations of head poses and identities).
Fig. 3 shows the resulting mean images for three facial
expressions (rows). Blur indicates high within-class variation
in the respective region, which is generally undesirable.
The single-transformation methods (first five columns) can
at most register parts of the images accurately – the parts
around the used landmarks if they are few and planar as
in SimEye and AffStable. PieceAff achieves an accurate
4
SimEye SimInner SimStable AffInner AffStable PieceAff 3dStatic FaNC Input 3dStatic FaNC Input 3dStatic FaNC Input
Fig. 4. Normalized face images and input images from LFW database. See Table I for acronyms.
AffInner 3dStatic FaNC Input AffInner 3dStatic FaNC Input AffInner 3dStatic FaNC Input
Fig. 5. Normalized face images and input images from FERA 2017 database. Bottom row are FaNC failure cases, see text.
registration, but most of the expression-induced shape de-
formation is lost. The more advanced methods, 3dStatic and
FaNC yield accurate registration and retain the expression in-
formation at the same time. There is no qualitative difference
between FaNC51, which only uses the 51 inner landmarks,
and FaNC68, which also uses facial contour landmarks.
Qualitative Results on LFW and FERA: Fig. 4 depicts
examples of the Labeled Faces in the Wild (LFW) database
[18]. SimEye is sensitive to the foreshortening effect in
out-of-plane poses, which may significantly alter scale as
in the third row. SimInner and SimStable yield similar
results, whereas SimInner tends to have higher registration
accuracy at the landmarks and SimStable tends to yield
more upright and centered faces. AffInner has more po-
tential to compensate differences in facial proportions, but
may cause unrealistic looking shearing of the image. The
latter effect is even more pronounced in AffStable. PieceAff
suffers from disocclusion artifacts and removes expression
information. 3dStatic and FaNC both handle occlusions by
exploiting symmetry, but FaNC causes less artifacts. Note
that 3dStatic has been developed with the LFW database,
so it is “optimized” for this database. Our FaNC method has
been developed and trained on the SyLaFaN database and we
did not optimize it towards any other database. Fig. 5 shows
examples from the FERA 2017 challenge dataset [39]. If
landmarks are localized well (see top row), FaNC is able to
synthesize high quality frontal views in most of the cases.
If landmarks are inaccurate (bottom row), FaNC’s frontal
images suffer from more artifacts. Further, FaNC is not
able to recover occlusions by the nose in pitch angles (see
bottom right) yielding a long nose and deformations at the
lip. However, 3dStatic generally suffers from more artifacts
(although we resized images to the expected resolution and
tried some other adaptations for improvement). AffInner is
not able to reduce variance between head poses, but does not
cause any artifacts (except some shearing).
Runtime: The FaNC normalization method is designed
to be fast. Essentially, it only needs two matrix multipli-
cations and warping with blending, which can be efficiently
done with any (even a very old) GPU. With our unoptimized
OpenGL 2.0 implementation, warping into a 256×256 image
takes about 1.5 ms with an Intel HD 4000 GPU (integrated
in Intel i7-3770, launched 2012) including data transfers,
similar to PieceAff and all single transformation methods.
For a same sized image, 3dStatic runs for about 100 ms [17].
State of the art texture-based methods require heavy GPU
computation and can achieve high frame rates only with
expensive hardware (if at all).
B. Action Unit Intensity Estimation
We evaluate the effect of face normalization on facial
action unit (AU) intensity estimation and the generalization
to unseen poses with the FG 2017 Facial Expression Recog-
nition and Analysis challenge (FERA 2017) dataset [39],
which is intended to raise the bar for expression recognition
for different view angles of the face. The dataset provides
a training and validation set, with 41 and 20 different
participants, respectively. Each participant was stimulated in
8different scenarios and each scenario is captured from
9different viewing angles (in total 2,952 training and
1,431 validation videos). 7different Action Units (AUs) are
manually labeled for each frame.
Training: We use the NASNet-A architecture [55] and
fine-tune the pretrained NASNet-A Mobile 224 model avail-
able with the tensorflow/slim implementation. Due to the
limited variability in the data (compared to ImageNet), we
5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
mean
unseen
view 1 view 2 view 3 view 4 view 5 view 7 view 8 view 9 view 6
ICC
FaNC (proposed) 3dStatic PieceAff SimStable AffInner Raw
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
view 1
view 2
view 4
view 5
view 7
view 8
view 3
view 6
view 9
ICC
Fig. 6. AU intensity estimation results on unseen poses (solid). Training
was done with view 6 only (top) and with views 3, 6, and 9 (bottom).
0.200
0.300
0.400
0.500
0.600
0.700
1 2 3 4 5 6 7 8 9
ICC
views used for training
NASNet - FaNC
NASNet - 3dStatic
NASNet - PieceAff
NASNet - SimStable
NASNet - AffInner
NASNet - Raw
Valstar [39]
Zhou [51] - Raw
Batista [5]
Werner [43]
Amirian [2]
0.45
0.50
0.55
0.60
0.65
1.E+05 1.E+07 1.E+09
ICC
model parameters
Fig. 7. AU intensity estimation results compared to state of the art. Mean
ICC over all views and AUs depending on number of views used for training
(outer plot) and number of model parameters (inner plot).
cut the network after the 6’th of 12 cells. We append a
fully connected layer with 7neurons, one for each AU (with
linear activation). Further, the stem weights (first part of the
network) are kept fixed to speed up the training. Standard
gradient descent is used to minimize MSE loss for 50,000
iterations (with a mini-batch size of 32 samples). The initial
learning rate is set to 0.1and reduced according to the single
period cosine decay [55] down to 108. For regularization
we set the drop path keep probability to 0.9 and L2 weight
decay to 4·105. To avoid divergence due to huge gradients,
local gradient clipping is applied (max. L2 norm value of 5)
during the first 2,000 iterations. Similar to Zhou et al. [51],
we randomly under-sample the training set for each view
by selecting 6,000 samples per AU (3k with intensity label
0 and 3k with label 1-5). We augment the training data
(42,000 samples per view) by randomly changing brightness,
contrast, and saturation and by randomly flipping the image.
Each trained model is tested on every frame of the entire
validation set to calculate the ICC(3,1) measure (per view
and AU). Training and evaluation is repeated 5 times for
each normalization method and results are averaged.
Generalization to unseen poses: We investigate to
which degree different face normalization methods help with
generalizing to unseen head poses and which views are
needed for training to achieve good results. For this purpose,
we vary the subset of views used for training. The compared
methods include our FaNC, 3dStatic, PieceAff, SimStable,
AffInner, and Raw (using original images depicted in Fig. 6).
Fig. 6 (top) shows the results (mean ICC across all AU)
of training with only the frontal samples (view 6). We
can observe that all methods generalize well to view 5,
which differs 20from the training view in the yaw angle.
However, performance drops significantly for all other views.
On average and in most cases, our proposed FaNC methods
facilitates best generalization to unseen poses. Changes in
pitch (±40) yield lowest performance due to the change
in appearance (e.g. occlusion by nose) that cannot be fully
compensated by any of the methods, but our FaNC method
outperforms the others clearly in all top views (7, 8, and 9).
In Fig. 6 (bottom) we show the results of training with one
view per pitch angle (3, 6, and 9). Compared to training with
the frontal view only, the overall performance improves due
to more training samples and more variability. But enormous
performance drops remain for views that differ 40from
training data in yaw angle (view 1, 4, and 7). Our FaNC
method still outperforms the others on those and the other
unseen views.
Comparison with state of the art: In Fig. 7 we compare
the results we obtain with NASNet to those reported in other
works that address AU intensity estimation on the FERA
2017 dataset. Valstar et al. [39] are the only who tried to
generalize to unseen views (they trained on view 5 and 6),
but their simple challenge baseline system performed poorly
compared to all other works. Amirian et al. [2] and Werner
et al. [43] both greatly outperform the baseline while training
with all views, but the deep learning based approaches by
Batista et al. [5] and Zhou et al. [51] perform significantly
better. Batista et al. [5] fed the cropped face bounding boxes
to a custom network architecture. The FERA 2017 challenge
winners Zhou et al. [51] used the original images (as our
“Nasnet - Raw”) and fine-tuned one VGG16-based network
per action unit. Our NASNet yields similar results with
3dStatic and PieceAff face normalization, but outperforms
all related works for the other face normalization methods.
Even with same performance, NASNet has the advantage of
less model parameters (and memory footprint); both other
networks [5], [51] have 300 times more parameters than
NASNet (see inner plot in Fig. 7). Fig. 7 also shows the
overall results we obtain with different number of views
6
0.300
0.400
0.500
0.600
0.700
0.800
0.900
1.000
-45° -30° -15° +15° +30° +45°
accuracy
FaNC (0.864)
SimStable (0.831)
3dStatic (0.823)
PieceAff (0.789)
FaceDet (0.777)
AffInner (0.714)
0.300
0.400
0.500
0.600
0.700
0.800
0.900
1.000
-45° -30° -15° +15° +30° +45°
accuracy
FaNC (0.899)
3dStatic (0.898)
AffInner (0.897)
SimStable (0.895)
FaceDet (0.894)
PieceAff (0.853)
Fig. 8. Facial expression recognition results on unseen poses (without
shading). Training was done with view 0only (top) and with views 45,
0, and +45(bottom). Mean accuracies across all views in brackets.
Trivial classifier achieves 0.277.
used for training (view 6; views 3, 6, 9; views 1, 3, 4, 6,
7, 9; all views). Our proposed FaNC with NASNet trained
on only frontal images performs better than Batista et al. [5]
(challenge’s second place), who trained on all views. Further,
we observe that there is no improvement between training
with all nine view and six views (combinations of yaw
∈ {−40,0}and pitch ∈ {−40,0,+40}. To analyze
if results with all views would benefit from longer training,
we tried to train for 100k instead of 50k iterations, but found
no significant difference. So we conclude that the additional
views with intermediate yaw angles do not add much and the
model already generalizes well to the intermediate views. See
supplementary material for detailed result tables.
C. Facial Expression Recognition
To further evaluate the effect of face normalization, we
conduct experiments on the Multi-PIE dataset [16]. We use
the data of all 337 subjects in homogeneous illumination
(no. 00) recorded from seven views provided in the dataset
(yaw angle 0,±15,±30,±45). In total these are about
18k images of the following six facial expressions, which
we aim to recognize: neutral expression, smile, surprise,
squint, disgust, and scream. We train NASNet as described
in the previous section. The only differences are the number
of outputs (6, one per class), the loss function (soft-max
cross entropy), the number of iterations (20k), and the initial
learning rate (0.01). We run 5-fold cross validation without
subject overlap between training and test sets and the results
are averaged. The normalization methods are the same as
above, except that we use the face detection bounding box
(FaceDet) instead of full database images (Raw).
Fig. 8 (top) shows the results of training with only
frontal faces. Similar to the results on the FERA dataset,
performance drops significantly if the pose deviates 30or
more from the data seen during training. But again, FaNC
generalizes best to those unseen poses. If we additionally
include ±45to the training set, the network is able to
generalize to the intermediate views without significant per-
formance drops, see Fig. 8 (bottom). In this case, the face
normalization has minor influence on the performance. Only
PieceAff performs significantly worse, probably because it
suffers from artifacts due to a lack of occlusion handling.
V. DISCUSSION AND CONCLUSION
Due to the advent of deep learning, limited amount of
data with high-quality annotations is one of the major
issues now. The previous sections addressed the question
how to achieve head pose invariance with limited training
data. For this purpose we developed the FaNC method to
normalize arbitrary faces to frontal views. In contrast to
most other works in face normalization [17], [52], [45],
[54], [40], we tested our method cross-database, i.e. FaNC
was evaluated on data that was completely unseen during
the development of the method. Normalization of those data
shows that FaNC generalizes well to new data generating
realistic frontal images without significant artifacts in most
of the cases. Based on our experiments on the FERA 2017
and Multi-PIE database, we can clearly recommend to use
FaNC if most of the available training data for the task at
hand is frontal, because it generalizes best to unseen views.
We observed that AU intensity estimation and expression
recognition performance degrades if the tested poses deviate
more than 20from the poses available during training, but
less so with our proposed FaNC method.
The experiments indicated that CNNs are able to gen-
eralize well to unseen poses without sophisticated face
normalization methods if training data is available that covers
the pose space in steps of about 40. We expect that a less
systematic, high variance coverage of the pose space would
have a similar or even better effect on generalization. How-
ever, for training a robust universal expression recognition
system, pose is not the only nuisance factor we need to
vary (but also identity, illumination, occlusion, background,
resolution, sharpness, noise etc.). Generalizing across head
poses with less need for variation in the training data
may help to also address the other factors. Gathering huge
amounts of suitable data for expression recognition is still
challenging, because (1) annotation with high quality labels
is expensive and (2) it is hard to avoid dataset biases and
cover rare events/conditions sufficiently. Gathering 3D data
and rendering in several poses seem to be an alternative to
gathering multiple views, but 3D data are usually incomplete
(e.g. occluded part of head is missing) and inaccurate (at least
in convex parts), impairing realism in out-of-plane views.
Further, in contrast to face normalization the pose augmen-
tation approach cannot benefit from existing 2D data. So we
belief that improving face normalization is still promising.
A direction to advance FaNC is training with more realistic
3D morphable models and/or arbitrary 3D datasets that cover
more variation of identity and expression. Further, there is
room for improvement in handling pitch variation, in which
symmetry does not help for filling disocclusions.
7
REFERENCES
[1] T. Almaev, B. Martinez, and M. Valstar. Learning to transfer:
transferring latent task structures and its application to person-specific
facial action unit detection. In ICCV, 2015.
[2] M. Amirian, M. K¨
achele, G. Palm, and F. Schwenker. Support vector
regression of sparse dictionary-based features for view-independent
action unit intensity estimation. In FG, 2017.
[3] T. Baltrusaitis, M. Mahmoud, and P. Robinson. Cross-dataset learning
and person-specific normalisation for automatic action unit detection.
In FG, 2015.
[4] T. Baltrusaitis, P. Robinson, and L.-P. Morency. OpenFace: an open
source facial behavior analysis toolkit. In IEEE Winter Conference on
Applications of Computer Vision (WACV), 2016.
[5] J. C. Batista, V. Albiero, O. R. P. Bellon, and L. Silva. Aumpnet:
Simultaneous action units detection and intensity estimation on mul-
tipose facial images using a single convolutional neural network. In
FG, 2017.
[6] P. Belhumeur, D. Jacobs, D. Kriegman, and N. Kumar. Localizing
parts of faces using a consensus of exemplars. In CVPR, 2011.
[7] C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez. EmotioNet:
An Accurate, Real-Time Algorithm for the Automatic Annotation of
a Million Facial Expressions in the Wild. In CVPR, 2016.
[8] V. Blanz and T. Vetter. A Morphable Model for the Synthesis of 3d
Faces. In Proc. of the 26th Annual Conf. on Computer Graphics and
Interactive Techniques, SIGGRAPH, 1999.
[9] S. W. Chew, P. Lucey, S. Lucey, J. Saragih, J. F. Cohn, I. Matthews,
and S. Sridharan. In the pursuit of effective affective computing:
The relationship between features and registration. IEEE Trans. on
Systems, Man, and Cybernetics, Part B, 42(4):1006–1016, 2012.
[10] G. G. Chrysos, E. Antonakos, S. Zafeiriou, and P. Snape. Offline
deformable face tracking in arbitrary videos. In ICCVW, 2015.
[11] W.-S. Chu, F. D. L. Torre, and J. F. Cohfcn. Selective transfer machine
for personalized facial action unit detection. In CVPR, 2013.
[12] A. Dapogny, K. Bailly, and S. Dubuisson. Pairwise conditional random
forests for facial expression recognition. In ICCV, 2015.
[13] X. Ding, W.-S. Chu, F. De la Torre, J. F. Cohn, and Q. Wang. Facial
action unit event detection by cascade of tasks. In ICCV, 2013.
[14] S. Eleftheriadis, O. Rudovic, and M. Pantic. Multi-conditional latent
variable model for joint facial action unit detection. In ICCV, 2015.
[15] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin.
LIBLINEAR: A library for large linear classification. Journal of
machine learning research, 9(Aug):1871–1874, 2008.
[16] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie.
Image Vision Comput., 28(5):807–813, May 2010.
[17] T. Hassner, S. Harel, E. Paz, and R. Enbar. Effective Face Frontaliza-
tion in Unconstrained Images. In CVPR, 2015.
[18] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. La-
beled faces in the wild: A database for studying face recognition in
unconstrained environments. Technical Report 07-49, University of
Massachusetts, Amherst, October 2007.
[19] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim. Joint fine-tuning in deep
neural networks for facial expression recognition. In ICCV, 2015.
[20] S. Kaltwang, S. Todorovic, and M. Pantic. Latent Trees for Estimating
Intensity of Facial Action Units. In CVPR, 2015.
[21] V. Kazemi and J. Sullivan. One millisecond face alignment with an
ensemble of regression trees. In CVPR, 2014.
[22] D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine
Learning Research, 10:1755–1758, 2009.
[23] S. Koelstra, M. Pantic, and I. Y. Patras. A Dynamic Texture-
Based Approach to Recognition of Facial Actions and Their Temporal
Models. IEEE Trans. Pattern Anal. Mach. Intell., 32(11):1940–1954,
2010.
[24] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang. Interactive
facial feature localization. In ECCV, 2012.
[25] M. Liu, S. Shan, R. Wang, and X. Chen. Learning Expressionlets on
Spatio-temporal Manifold for Dynamic Facial Expression Recognition.
In CVPR, 2014.
[26] V. L´
opez, A. Fern´
andez, S. Garc´
ıa, V. Palade, and F. Herrera. An
insight into classification with imbalanced data: Empirical results and
current trends on using data intrinsic characteristics. Information
Sciences, 250:113–141, Nov. 2013.
[27] I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Pose-aware face
recognition in the wild. In CVPR, 2016.
[28] F. Ringeval, M. Pantic, et al. AVEC 2017: Real-life Depression, and
Affect Recognition Workshop and Challenge. In Proc. Workshop on
Audio/Visual Emotion Challenge (AVEC), 2017.
[29] O. Rudovic, M. Pantic, and I. Patras. Coupled Gaussian processes
for pose-invariant facial expression recognition. IEEE Trans. Pattern
Anal. Mach. Intell., 35(6):1357–1369, 2013.
[30] O. Rudovic, V. Pavlovic, and M. Pantic. Context-Sensitive Dynamic
Ordinal Regression for Intensity Estimation of Facial Action Units.
IEEE Trans. Pattern Anal. Mach. Intell., 37(5):944–958, 2015.
[31] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and
M. Pantic. 300 faces in-the-wild challenge: database and results. Image
and Vision Computing, 47:3–18, 2016.
[32] C. Sagonas, Y. Panagakis, S. Zafeiriou, and M. Pantic. Robust
statistical face frontalization. In ICCV, 2015.
[33] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. A semi-
automatic methodology for facial landmark annotation. In CVPRW,
2013.
[34] E. Sariyanidi, H. Gunes, and A. Cavallaro. Automatic Analysis of
Facial Affect: A Survey of Registration, Representation, and Recogni-
tion. IEEE Trans. Pattern Anal. Mach. Intell., 37(6):1113–1133, 2015.
[35] F. Saxen, P. Werner, and A. Al-Hamadi. Real vs. Fake Emotion Chal-
lenge: Learning to Rank Authenticity from Facial Activity Descriptors.
In ICCVW, 2017.
[36] T. Senechal, V. Rapp, H. Salam, R. Seguier, K. Bailly, and L. Prevost.
Facial action recognition combining heterogeneous features via mul-
tikernel learning. IEEE Trans. Systems, Man, and Cybernetics, Part
B, 42(4):993–1005, 2012.
[37] M. Valstar, J. Girard, T. Almaev, G. McKeown, M. Mehu, L. Yin,
M. Pantic, and J. Cohn. Fera 2015 - second facial expression
recognition and analysis challenge. In FG, 2015.
[38] M. F. Valstar, B. Jiang, M. Mehu, M. Pantic, and K. Scherer. The first
facial expression recognition and analysis challenge. In FG, 2011.
[39] M. F. Valstar, E. S´
anchez-Lozano, J. F. Cohn, L. A. Jeni, J. M. Girard,
Z. Zhang, L. Yin, and M. Pantic. FERA 2017 - Addressing Head Pose
in the Third Facial Expression Recognition and Analysis Challenge.
In FG, 2017.
[40] Y. Wang, H. Yu, J. Dong, M. Jian, and H. Liu. Cascade support vector
regression-based facial expression-aware face frontalization. In IEEE
Int. Conf. on Image Processing (ICIP), 2017.
[41] Z. Wang, Y. Li, S. Wang, and Q. Ji. Capturing global semantic
relationships for facial action unit recognition. In ICCV, 2013.
[42] P. Werner, A. Al-Hamadi, K. Limbrecht-Ecklundt, S. Walter, S. Gruss,
and H. Traue. Automatic Pain Assessment with Facial Activity
Descriptors. IEEE Trans. on Affective Computing, 8(3):286–299, 2017.
[43] P. Werner, S. Handrich, and A. Al-Hamadi. Facial action unit intensity
estimation and feature relevance visualization with random regression
forests. In Int. Conf. Affective Computing Intelligent Interaction, 2017.
[44] P. Werner, F. Saxen, and A. Al-Hamadi. Handling Data Imbalance
in Automatic Facial Action Intensity Estimation. In British Machine
Vision Conf. (BMVC), 2015.
[45] J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, and J. Kim. Rotating Your
Face Using Multi-Task Deep Neural Network. In CVPR, 2015.
[46] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker. Towards large-
pose face frontalization in the wild. In ICCV, 2017.
[47] X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz,
P. Liu, and J. M. Girard. BP4d-Spontaneous: a high-resolution
spontaneous 3d dynamic facial expression database. Image and Vision
Computing, 32(10):692–706, Oct. 2014.
[48] K. Zhao, W.-S. Chu, F. De la Torre, J. F. Cohn, and H. Zhang. Joint
Patch and Multi-Label Learning for Facial Action Unit Detection. In
CVPR, 2015.
[49] K. Zhao, W.-S. Chu, and H. Zhang. Deep Region and Multi-Label
Learning for Facial Action Unit Detection. In CVPR, 2016.
[50] L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. N. Metaxas.
Learning active facial patches for expression analysis. In CVPR, 2012.
[51] Y. Zhou, J. Pi, and B. E. Shi. Pose-independent facial action unit
intensity regression based on multi-task deep transfer learning. In
FG, 2017.
[52] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. High-Fidelity Pose and
Expression Normalization for Face Recognition in the Wild. In CVPR,
2015.
[53] X. Zhu and D. Ramanan. Face detection, pose estimation and landmark
localization in the wild. In CVPR, 2012.
[54] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identity-
preserving face space. In ICCV, 2013.
[55] B. Zoph, V. Vasudevan, J. Shlens, and Q. Le. Learning Transferable
Architectures for Scalable Image Recognition. In CVPR, 2018.
8
... Recently, great progress has been made in expression intensity estimation. Some works [9][10][11] have attempted to estimate the absolute value of expression intensity by using the intensity of individual action units (AUs). However, the manual labeling of AU intensity is a professional and laborious task that results in insufficient labeled data. ...
Article
Full-text available
Emotional understanding and expression plays a critical role in social interaction. To analyze children’s emotional interaction automatically, this study focuses on developing a novel network architecture and a reliable algorithm for expression intensity estimation to measure children’s facial expression responses to emotional stimuli. The facial expression intensity variation provides temporal dynamic information of facial behavior, which is critical to interpreting the meaning of expression. In order to avoid laborious manual annotations for expression intensity, existing unsupervised methods attempt to identify relative intensity using ordinal information within a facial expression sequence; however, they fail to estimate absolute intensity accurately. Moreover, appropriate features are needed to represent the continuous appearance changes caused by expression intensity to improve the model’s ability to distinguish subtle differences in expression. This study therefore presents a novel semi-supervised method to estimate expression intensity using salient deep learning features. First, the facial expression is represented by the difference response of the convolutional neural network backbone between the target expression and its responding neutral expression, with the goal of suppressing the effects of expression-unrelated features on expression intensity estimation. Then, the pairwise data constructed with ordinal information is input into a Siamese network with a combined hinge loss that guides learning the relative intensity on unlabeled pairwise frames, the absolute intensity of a few labeled key frames, and the intensity range of most unlabeled frames. The average pearson correlation coefficient, intraclass correlation coefficient, and mean absolute error are 0.7683, 0.7405, and 0.1698 on the extended Cohn-Kanade dataset (CK+), and 0.7804, 0.6684, and 0.1864 on the Binghamton University 4D Facial Expression Dataset using the proposed method, results that are superior to the state of the art. The cross-dataset experiment indicates that the proposed method is promising for the analysis of children’s emotional interactions.
... For example, they can be used in the field of human-computer interaction (HCI) to detect possible interaction partners, in autonomous driving to perceive road users such as pedestrians, or in mobile robot navigation to identify moving obstacles. Furthermore, they are the first component for a large number of recognition systems in many applications, such as face recognition [1], facial expression analysis [2,3], body pose estimation [4], face attribute detection [5], human action recognition [6] and others. In such systems, face and/or person detection are often a prerequisite for the following processing steps; so, their detection rate is crucial for the performance of the overall system. ...
Article
Full-text available
Face and person detection are important tasks in computer vision, as they represent the first component in many recognition systems, such as face recognition, facial expression analysis, body pose estimation, face attribute detection, or human action recognition. Thereby, their detection rate and runtime are crucial for the performance of the overall system. In this paper, we combine both face and person detection in one framework with the goal of reaching a detection performance that is competitive to the state of the art of lightweight object-specific networks while maintaining real-time processing speed for both detection tasks together. In order to combine face and person detection in one network, we applied multi-task learning. The difficulty lies in the fact that no datasets are available that contain both face as well as person annotations. Since we did not have the resources to manually annotate the datasets, as it is very time-consuming and automatic generation of ground truths results in annotations of poor quality, we solve this issue algorithmically by applying a special training procedure and network architecture without the need of creating new labels. Our newly developed method called Simultaneous Face and Person Detection (SFPD) is able to detect persons and faces with 40 frames per second. Because of this good trade-off between detection performance and inference time, SFPD represents a useful and valuable real-time framework especially for a multitude of real-world applications such as, e.g., human–robot interaction.
... e nonfrontal faces are caused by head turning and pitching and camera viewpoints changing, and these would be easy to deform face shape significantly and cause FER errors [5]. In deformed facial images, the important features for expression recognition, including the appearance and position of brows, eyes, cheeks, and mouth, are much different from those of front face and cause FER error seriously. ...
Article
Full-text available
Nonfrontal facial expression recognition in the wild is the key for artificial intelligence and human-computer interaction. However, it is easy to be disturbed when changing head pose. Therefore, this paper presents a face rebuilding method to solve this problem based on PRNet, which can build 3D frontal face for 2D head photo with any pose. However, expression is still difficult to be recognized, because facial features weakened after frontalization, which had been widely reported by previous studies. It can be proved that all muscle parameters in frontalization face are more weakened than those of real face, except muscle moving direction on each small area. Thus, this paper also designed muscle movement rebuilding and intensifying method, and through 3D face contours and Fréchet distance, muscular moving directions on each muscle area are extracted and muscle movement is strengthened following these moving directions to intensify the whole face expression. Through this way, nonfrontal facial expression can be recognized effectively.
... In the case of large-out-of-plane head-rotations, the majority of studies attempt face frontalization -effectively mapping the non-frontal faces to a frontal reference frame. For instance, Werner et al. [45] proposed a face normalization method which also depends on the quality of the facial landmark detection and texture-warping. The authors showed that with their face-normalization method they could train betterperforming CNN models for facial expression recognition and AU detection, compared to when no face-normalization was applied. ...
Preprint
Facial action unit recognition has many applications from market research to psychotherapy and from image captioning to entertainment. Despite its recent progress, deployment of these models has been impeded due to their limited generalization to unseen people and demographics. This work conducts an in-depth analysis of performance across several dimensions: individuals(40 subjects), genders (male and female), skin types (darker and lighter), and databases (BP4D and DISFA). To help suppress the variance in data, we use the notion of self-supervised denoising autoencoders to design a method for deep face normalization(DeepFN) that transfers facial expressions of different people onto a common facial template which is then used to train and evaluate facial action recognition models. We show that person-independent models yield significantly lower performance (55% average F1 and accuracy across 40 subjects) than person-dependent models (60.3%), leading to a generalization gap of 5.3%. However, normalizing the data with the newly introduced DeepFN significantly increased the performance of person-independent models (59.6%), effectively reducing the gap. Similarly, we observed generalization gaps when considering gender (2.4%), skin type (5.3%), and dataset (9.4%), which were significantly reduced with the use of DeepFN. These findings represent an important step towards the creation of more generalizable facial action unit recognition systems.
... The ideal method should recognize expression in the wild, which means the recognition method should adapt to multiview human faces [2], specifically, nonfrontal faces [3]; however, this has not been achieved [4]. Nonfrontal faces can seriously distort face images [5], [6]; the eyes, cheeks and mouth, which are very important for expression recognition, are all distorted. Not only their shapes but also their position relationship are changed and distorted [7], [8]. ...
Article
Full-text available
Expression recognition in the wild is easily distorted by nonfrontal and asymmetry faces. In nonfrontal faces, some areas are compressed and distorted. Even after frontalization, these compressed areas may still be blurred and distort expression recognition. Additionally, asymmetrical expressions are common on half or local face areas and produce incorrect expression features. Therefore, this paper presents a half-face frontalization and pyramid Fourier frequency conversion method. Despite the location, range and intensity of incorrect expressions in nonfrontal faces being unknown, according to discrete Fourier transform, it can be proven that the frequency band of the correct expression is much larger than that of incorrect expression on the same face. This can be taken advantage of by pyramid frequency conversion, which is designed based on Fourier frequency conversion. It can adjust incorrect expression frequency in multiscales to take them out off the band-pass of the convolution operations of deep learning and be eliminated completely, whereas correct expression information is reserved. Thus, expressions can be recognized effectively.
... Regularization is one of the key elements of deep learning, allowing to generalize well to unseen data, even when training on a limited training set or with an imperfect optimization procedure [8]. Some widely and successfully used regularization techniques are data augmentation, drop-out, batch normalization, and weight decay, which are also common in expression recognition [11,22]. In addition to these methods, this paper proposes an occlusion-based regularization technique, which consistently improves performance in facial expression recognition and can be combined with any existing regularization technique and network architecture. ...
Chapter
In computer vision, occlusions are mainly known as a challenge to cope with. For instance, partial occlusions of the face may lower the performance of facial expression recognition systems. However, when incorporated into the training, occlusions can be also helpful in improving the overall performance. In this paper, we propose and evaluate occlusion augmentation as a simple but effective regularizing tool for improving the general performance of deep learning based facial expression and action unit recognition systems, even if no occlusion is present in the test data. In our experiments we consistently found significant performance improvements on three databases (Bosphorus, RAF-DB, and AffectNet) and three CNN architectures (Xception, MobileNet, and a custom model), suggesting that occlusion regularization works independently of the dataset and architecture. Based on our clear results, we strongly recommend to integrate occlusion regularization into the training of all CNN-based facial expression recognition systems, because it promises performance gains at very low cost.
... It can be seen that many efforts on different data or signal sources have been made for automatic pain assessment. Some more recent efforts are from the perspective of feature selection [223], developing personalized machine learning models (e.g., [217]), and exploring new behaviors or clues of pain [224,225]. Most of the studies were conducted within a single database and study design, therefore lacking a generalization of methods and results. ...
Thesis
Full-text available
Accurate pain assessment plays an important role in proper pain management, especially among hospitalized people experience acute pain. Pain is subjective in nature which is not only a sensory feeling but could also combine affective factors. Therefore self-report pain scales are the main assessment tools as long as patients are able to self-report. However, it remains a challenge to assess the pain from the patients who cannot self-report. In clinical practice, physiological parameters like heart rate and pain behaviors including facial expressions are observed as empirical references to infer pain objectively. The main aim of this study is to automate such process by leveraging machine learning methods and biosignal processing. To achieve this goal, biopotentials reflecting autonomic nervous system activities including electrocardiogram and galvanic skin response, and facial expressions measured with facial electromyograms were recorded from healthy volunteers undergoing experimental pain stimulus. IoT-enabled biopotential acquisition systems were developed to build the database aiming at providing compact and wearable solutions. Using the database, a biosignal processing flow was developed for continuous pain estimation. Signal features were extracted with customized time window lengths and updated every second. The extracted features were visualized and fed into multiple classifiers trained to estimate the presence of pain and pain intensity separately. Among the tested classifiers, the best pain presence estimating sensitivity achieved was 90% (specificity 84%) and the best pain intensity estimation accuracy achieved was 62.5%. The results show the validity of the proposed processing flow, especially in pain presence estimation at window level. This study adds one more piece of evidence on the feasibility of developing an automatic pain assessment tool from biopotentials, thus providing the confidence to move forward to real pain cases. In addition to the method development, the similarities and differences between automatic pain assessment studies were compared and summarized. It was found that in addition to the diversity of signals, the estimation goals also differed as a result of different study designs which made cross dataset comparison challenging. We also tried to discuss which parts in the classical processing flow would limit or boost the prediction performance and whether optimization can bring a breakthrough from the system’s perspective.
... see [11]. Additionally, we have shown that similarity alignment can improve landmark localization, which e.g. may be used for gaining further head pose invariance through advanced face frontalization [24]. ...
Conference Paper
Full-text available
Current face detection concentrates on detecting tiny faces and severely occluded faces. Face analysis methods, however, require a good localization and would benefit greatly from some rotation information. We propose to predict a face direction vector (FDV), which provides the face size and orientation and can be learned by a common object detection architecture better than the traditional bounding box. It provides a more consistent definition of face location and size. Using the FDV is promising for all succeeding face analysis methods. As an example, we show that facial landmark detection can highly benefit from pre-aligned faces.
Article
Facial expression recognition (FER) plays a vital role in affective cognition. However, there will be some limitations when facing the FER with single facial image data. Considering that extra data contains more information for molding, the facial action unit (AU) can be adopted as privileged information (PI) to assist the FER task. This paper integrates AU information into an end-to-end deep network to support FER training. The proposed privileged action unit network (PAU-Net) gives ways of integrating AU information from the input aspect (type I) and output aspect (type II). Type I of PAU-Net takes AUs as input to guide the facial image network learning, which provides the AU-based emotion recognition result for the image-based FER model. While, type II of PAU-Net utilizes AUs as the output label for shallow layers of the network, which helps the model learn AU-related features and further assists advanced facial expression feature learning in subsequent layers. Note that PI enhances the network during the training and will not occur during the testing. Therefore, the network can still perform robustly with original input data in practice. Experiments are based on the CK+, MMI, and Oulu-CASIA datasets. The experimental results demonstrate the effectiveness of the proposed PAU-Net in FER tasks.
Article
Emotion analysis of students plays an important role in teaching effect evaluation. To develop robust algorithms for emotion analysis of students, a database from real classrooms is required. However, most existing databases were collected from adults and constructed in laboratory settings. In this article, we present a manually-annotated facial action unit database from juveniles in real classrooms. Our database has three main characteristics: (1) it provides numerous education-related action units data from primary and high schools, complementing the vacancy of the publicly available educational action unit databases; (2) it contains 256,220 manually-annotated facial images of 1,796 juveniles, frame-by-frame annotated with 12 action units and 6-level intensities for each action unit; (3) it covers many challenges in the wild, including various head poses, low facial resolution, illuminations, and occlusions, supplementing action unit databases in the wild for research. The baselines for action unit detection and action unit intensity estimation are provided for future references. Especially, we apply the weighted balance loss to solve imbalances within and between labels. Our database will be available to the research community: http://www.dlc.sjtu.edu.cn/rfau .
Conference Paper
Full-text available
The Audio/Visual Emotion Challenge and Workshop (AVEC 2017) "Real-life depression, and affect" will be the seventh competition event aimed at comparison of multimedia processing and machine learning methods for automatic audiovisual depression and emotion analysis, with all participants competing under strictly the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the depression and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of the various approaches to depression and emotion recognition from real-life data. This paper presents the novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline system on the two proposed tasks: dimensional emotion recognition (time and value-continuous), and dimensional depression estimation (value-continuous).
Conference Paper
Full-text available
Automatic facial action unit intensity estimation can be useful for various applications in affective computing. In this paper, we apply random regression forests for this task and propose modifications that improve predictive performance compared to the original random forest. Further, we introduce a way to estimate and visualize the relevance of the features for an individual prediction and the forest in general. We conduct experiments on the FERA 2017 challenge dataset (which outperform the FERA baseline results), show the performance gain by the modifications, and illustrate feature relevance.
Conference Paper
Full-text available
Despite recent advances in face recognition using deep learning, severe accuracy drops are observed for large pose variations in unconstrained environments. Learning pose-invariant features is one solution, but needs expensively labeled large-scale data and carefully designed feature learning algorithms. In this work, we focus on frontalizing faces in the wild under various head poses, including extreme profile views. We propose a novel deep 3D Morphable Model (3DMM) conditioned Face Frontalization Generative Ad-versarial Network (GAN), termed as FF-GAN, to generate neutral head pose face images. Our framework differs from both traditional GANs and 3DMM based modeling. Incorporating 3DMM into the GAN structure provides shape and appearance priors for fast convergence with less training data, while also supporting end-to-end training. The 3DMM-conditioned GAN employs not only the discriminator and generator loss but also a new masked symmetry loss to retain visual quality under occlusions, besides an identity loss to recover high frequency information. Experiments on face recognition, landmark localization and 3D reconstruction consistently show the advantage of our frontalization method on faces in the wild datasets. 1
Article
Full-text available
Despite recent advances in face recognition using deep learning, severe accuracy drops are observed for large pose variations in unconstrained environments. Learning pose-invariant features is one solution, but needs expensively labeled large scale data and carefully designed feature learning algorithms. In this work, we focus on frontalizing faces in the wild under various head poses, including extreme profile views. We propose a novel deep 3D Morphable Model (3DMM) conditioned Face Frontalization Generative Adversarial Network (GAN), termed as FF-GAN, to generate neutral head pose face images. Our framework differs from both traditional GANs and 3DMM based modeling. Incorporating 3DMM into the GAN structure provides shape and appearance priors for fast convergence with less training data, while also supporting end-to-end training. The 3DMM conditioned GAN employs not only the discriminator and generator loss but also a new masked symmetry loss to retain visual quality under occlusions, besides an identity loss to recover high frequency information. Experiments on face recognition, landmark localization and 3D reconstruction consistently show the advantage of our frontalization method on faces in the wild datasets.Detailed results can be refered to: http://cvlab.cse.msu.edu/project-face-frontalization.hmtl.
Article
Developing state-of-the-art image classification models often requires significant architecture engineering and tuning. In this paper, we attempt to reduce the amount of architecture engineering by using Neural Architecture Search to learn an architectural building block on a small dataset that can be transferred to a large dataset. This approach is similar to learning the structure of a recurrent cell within a recurrent network. In our experiments, we search for the best convolutional cell on the CIFAR-10 dataset and then apply this learned cell to the ImageNet dataset by stacking together more of this cell. Although the cell is not learned directly on ImageNet, an architecture constructed from the best learned cell achieves state-of-the-art accuracy of 82.3% top-1 and 96.0% top-5 on ImageNet, which is 0.8% better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS. This cell can also be scaled down two orders of magnitude: a smaller network constructed from the best cell also achieves 74% top-1 accuracy, which is 3.1% better than the equivalently-sized, state-of-the-art models for mobile platforms.
Conference Paper
In this paper, a robust system for viewindependent action unit intensity estimation is presented. Based on the theory of sparse coding, region-specific dictionaries are trained to approximate the characteristic of the individual action units. The system incorporates landmark detection, face alignment and contrast normalization to handle a large variety of different scenes. Coupled with head pose estimation, an ensemble of large margin classifiers is used to detect the individual action units. The experimental validation shows that our system is robust against pose variations and able to outperform the challenge baseline by more than 35%.