Content uploaded by Philipp Werner
Author content
All content in this area was uploaded by Philipp Werner on Nov 01, 2017
Content may be subject to copyright.
2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII)
Facial Action Unit Intensity Estimation and Feature Relevance Visualization
with Random Regression Forests
Philipp Werner Sebastian Handrich Ayoub Al-Hamadi
Institute for Information Technology and Communications, University of Magdeburg, Germany
Email: {Philipp.Werner, Ayoub.Al-Hamadi}@ovgu.de
Abstract—Automatic facial action unit intensity estimation can
be useful for various applications in affective computing. In this
paper, we apply random regression forests for this task and
propose modifications that improve predictive performance
compared to the original random forest. Further, we introduce
a way to estimate and visualize the relevance of the features
for an individual prediction and the forest in general. We con-
duct experiments on the FERA 2017 challenge dataset (which
outperform the FERA baseline results), show the performance
gain by the modifications, and illustrate feature relevance.
1. Introduction
Facial expression reveals information on the affective
state of an observed person, which can be used in e.g.
pain assessment [1], drowsy driver detection, marketing,
or human-robot interfaces [2]. The Facial Action Coding
System (FACS) [3] facilitates to describe and analyze facial
expressions. Based on facial muscles, FACS defines a set
of expression building blocks called action units (AU), each
of which either absent or present in one of five intensities.
Manual FACS coding requires a trained coder and is very
time-consuming. Thus, an automatic coding solution based
on computer vision is desirable.
This work presents an approach for facial action unit
intensity estimation that employs random regression forests.
Random forests [4] feature a high predictive performance,
a low number of parameters (i.e. they are easy to use), very
fast prediction and training (which can be easily parallelized
and is fast without GPU), and it is easy to trade-off accuracy
for speed. Further, random forests can model arbitrary non-
linear dependencies and provide implicit feature selection,
which can be used to learn problem-specific features during
training [5], [6], [7], [8] (similar to deep learning).
We also propose an approach to estimate and visualize
the features’ relevance for an individual prediction. It can
help to understand the learned models and why they output
a certain prediction. This cannot only support in identify-
ing problems and improving the recognition systems, but
also facilitates applications that do not allow for black-box
solutions. E.g. safety-critical or medical applications require
understandable systems. Visualization of what influences the
decision can improve trust in the systems. Further, relevance
measures can be useful for feature selection.
Related Work: Research on facial action unit inten-
sity estimation has been stimulated by the release of
datasets with AU coding, such as Bosphorus [9], UNBC-
McMaster [10], DISFA [11], and BP4D [12], and by
the Facial Expression Recognition and Analysis challenges
(FERA) in 2015 [13] and 2017 [14]. Due to limited space,
we can only mention a few recent works here. Werner et
al. [15] discussed the inherent imbalance problem in AU
intensity estimation (the absence of an AU is much more
common than the presence, which usually leads to a bias
in favor of the absence class and poor performance on the
underrepresented intensities) and suggested to handle it by
altering the class distribution and training Support Vector
Regression ensembles to predict AU intensities. Kaltwang
et al. [16] proposed to build a latent tree, a probabilistic
graphical model to jointly predict the intensities of multiple
AUs. Another graphical model, the context sensitive Condi-
tional Ordinal Random Field, was introduced and applied for
AU intensity estimation by Rudovic et al. [17]. Following
the current trend, Zhao et al. [18] and Benitez-Quiroz et
al. [19] applied deep learning to recognize facial action
units. Zhao et al. [18], [20] also visualized, which facial
regions are important for a trained model, but do neither
estimate AU intensities, nor use tree-based methods. In
general, trees and especially random forests are used rarely
in the facial expression domain. A few exceptions are works
on recognition of pain [1], [21], basic emotions [22], [23],
and micro-expressions [24].
Contributions: To our best knowledge, this is the first
work that applies random forests for AU intensity estima-
tion. We propose three modifications of the original random
regression forest training and prediction [4] (Sec. 2): (1)
for training each individual tree, a sampling strategy is
applied that considers the high imbalance in the class distri-
bution of the training data, (2) tree depth is restricted, and
(3) continuous intensities are predicted by applying kernel
density estimation. Further, we propose a way to estimate
and visualize the relevance of the features for an individual
prediction and the regression forest in general. Experiments
with the challenging FERA 2017 database (Sec. 3) show
that all proposed modifications improve the intensity estima-
tion results significantly. We compare the feature relevance
measure with the random forest variable importance [4] and
visualize it for several examples.
P. Werner, S. Handrich, A. Al-Hamadi, "Facial Action Unit Intensity Estimation and Feature Relevance Visualization with Random Regression Forests", in
International Conference on Affective Computing and Intelligent Interaction (ACII), 2017.
This is the accepted manuscript. The final, published version is available on IEEE Xplore.
(C) 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works.
tree 1 tree T
...
xixi
Q1
nl(x)QT
nl(x)
Figure 1. Random regression forest. The forest is a set of Tregression
trees. Starting at the root node, each tree is traversed by repeatedly
evaluating a weak learner at each split node (black circles) and branching
either to the left or right child node. This is continued until a leaf node
(green circle) is reached.
2. Regression Forest for Action Unit Intensity
Estimation
2.1. Original Random Forest for Regression
For readers not familiar with random forests, we briefly
describe the concept of classical regression trees and forests.
A regression forest is an ensemble of regression trees. As
shown in Fig. 1, a tree is a structure consisting of split and
leaf nodes. Each split node nsin a trained tree represents a
weak learner, defined by the parameter θ= (φ, τ), where φ
is the split attribute and τis a scalar threshold value. Given
Q={(xi, yi)}Na set of N d-dimensional feature vectors
xi∈Rdand their (continuous) labels yi∈R, a prediction
for a particular xican be made, by starting at the root node
and evaluating at each split node the function
w(θ, xi) := f(φ, xi)≤τ. (1)
If w(·)evaluates to true, the path through the tree branches
to the left child node, or to the right, otherwise. This is re-
peated until a leaf node nl(xi)is reached. At each leaf node,
an expected value ¯yis stored. For regression, this is typically
the mean of all samples Qnl⊂Qthat fell into that node
during the training phase (Eq. 2). When using a forest, the
same algorithm is applied to each tree resulting in a set of
leaf nodes L(xi) = {nl(xi)}and the prediction is typically
computed as the mean of the individual predictions (Eq. 3).
¯ynl=1
|Qnl|X
yi∈Qnl
yi(2) ¯y=1
|L| X
nl∈L
¯ynl(3)
A single tree does not generalize well from the training
data and can easily overfit. Therefore, multiple trees are
trained, each learning a different subset of the training data.
There are two major approaches to train these forests: Tree
bagging and random forest [4]. In tree bagging, each tree is
grown on a bootstrapped sample with the same size as the
whole data set by sampling with replacement from the whole
training data. At each split node all attributes are considered
to find the best split attribute, i.e. ˆ
φ=φ. In contrast, for
random forests at each split node only a random subset of
split attributes ˆ
φ⊂φis considered to find the best split
attribute. This prevents the growth of highly correlated trees.
012345
0
2
4
6
8
·104
class (intensity)
frequency
α= 0.5
ni
n−
i
n?
i
Figure 2. Example for altering the class distribution with MIDRUS.
The original random forest as proposed by Breiman [4] also
applies bagging.
During training, the goal is to find partitioning binary
trees that minimize the loss between predicted and ground
truth labels. For regression problems, this loss is typically
given by the mean squared error:
L=X
Qnl∈Q
X
yi∈Qnl
kyi−¯ynlk2(4)
To minimize L, the samples are hierarchically parti-
tioned into left and right partitions, denoted as
Ql(ˆ
φ) = {f(ˆ
φ, xi)≤τ}(5)
Qr(ˆ
φ) = Q\Ql(ˆ
φ), (6)
by finding the split attribute ˆ
φthat minimizes the variance
within the left and right partitions:
φ∗= arg min
ˆ
φ⊂φ
X
yi∈Ql(ˆ
φ)
kyi−¯yk2+X
yi∈Qr(ˆ
φ)
kyi−¯yk2(7)
2.2. Tree Training Modifications
Facial action unit intensity estimation is an imbalanced
classification problem, i.e. an AU is much more often absent
than present in one of the 5 intensities. To handle the
problem, Werner et al. [15] proposed Multiclass Imbalance
Damping Random Under-Sampling (MIDRUS). We apply it
on the random forest training, i.e. instead of training each
tree on a bootstrapped sample (as for original random forest)
we sample without replacement and with an altered class
distribution.
If niis the absolute frequency of class iin the dataset,
then we calculate the number of samples n?
ito select from
class ias follows.
n−
i=ds·(ni)1−αe,with s=nf(2)
(nf(2))1−α,(8)
n?
i=min{ni, n−
i}.(9)
In (8), α∈[0,1] is the imbalance damping parameter.
It controls to which extend the imbalance is reduced, i.e.
α= 1 aims at total balancing of classes, α= 0 keeps
the imbalance, and an αin between reduces it to a certain
degree. With α > 0, the term (ni)1−αcalculates new and
more balanced class ratios. Next, these are scaled by a
AU intensity y
0 1 2 3 4 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
p(yjx)
p(y=jjx)Kb(y!j);j = 0;1; :::; 5
fb(yjx)
argmax fb(yjx)
Figure 3. Continuous AU intensity estimation. Each leaf node in a trained
regression tree stores a distribution of the discrete action unit intensities
of the training samples (not shown). The prediction of all trees in the
regression forest is combined by computing the mean distribution (blue
bars). Kernel density estimation is applied to transfer the discrete labels
to a continuous probability density function fb(red graph). The predicted
AU intensity is obtained as the argument that maximizes fb(dotted line).
common factor s, which ensures that all samples of the
second most frequent class will be selected (f(k)is a sorting
function returning the k’th most frequent class). Fig. 2
illustrates the imbalance damping with an exemplary class
distribution and α= 0.5. Eq. (8) and (9) are taken from
[15] and simplified with β= 1 and k= 2.
Next to the MIDRUS sampling, we also propose to
limit the depth of the tree to (1) reduce training time and
model size, (2) improve estimate of leaf node distribution
(described in the next Section), and (3) avoid overfitting. The
original random forest algorithm as proposed by Breiman [4]
grows the trees to full size without pruning and does not
restrict the tree depth.
2.3. Leaf Nodes and Prediction
Each leaf node in a regression tree stores a prediction
model. In the classical approach, this is the mean of all
samples that fell into this node during the training phase (Eq.
2). Since training the tree structure is however independent
of the prediction model, we are free to use another model.
As described in Sec. 2.2, we limit the size of each tree
during the training phase. We do this also for computational
reasons but mainly in order to train trees that generalize
well. This, however, means that samples with different AU
intensities fall into the same leaf node nl. Using only the
mean value of all samples in Qnlresults in a loss of
information that could otherwise have been used when the
individual predictions need to be combined. Instead, we
follow the suggestion made in [7] and store at each leaf
node the probability distributions of the action unit (AU)
intensities. This allows us to model the uncertainty in the
prediction. Since the training data contains only discrete
intensities (0. . . 5), the probability distribution at leaf node
nlcontaining the samples {(x, yi)} ∈ Qnlcan be described
as normalized histograms (Eq. 10):
pnl(yi=j|x) = |{Qnl|yi=j}|
|Qnl|, with j= 0,1,...,5
(10)
Fig. 3 illustrates how these probability distribution are
combined to an overall prediction. We first calculate the
total probability distribution of a sample by averaging the
distributions in all predicted leaf nodes (Eq. 11).
¯p(y|x) = 1
|L| X
nl∈L
pnl(11)
The final prediction for the action unit intensity could
now be obtained as y∗= arg max ¯p(y|x)or by com-
puting the mean of the distribution ¯p(y|x). Experimental
evaluation, however, showed that a better estimation can be
obtained using Kernel density estimation [25] as follows:
fb(y|x) =
5
X
j=0
p(y=j|x)Kb(y−j), (12)
where Kb(·)is the scaled kernel (we used the Gaussian
kernel) and bis the kernel bandwidth. We, thus, transfer
the discrete histogram into a continuous probability density
function (Fig. 3) allowing for a more precise intensity es-
timation. The final predicted intensity of the action unit is
then obtained as the value that maximizes fb(y|x):
y∗= arg max fb(y|x)(13)
This approach transfers the discrete AU labels into a
continuous output scale.
2.4. Feature Relevance Estimate and Visualization
We estimate the features’ relevances for each individual
prediction by counting the feature queries. I.e. when travers-
ing the trees for predicting the AU intensity of a specific
sample, we count how often each feature occurs as a split
criterion. This builds on the assumption that relevant features
are selected as a split criterion more often than irrelevant
features, which is reasonable if the forest has a sufficient
number of trees. In contrast to the random forest variable
importance [4] (which is calculated during training), this
measure can be calculated for each individual test sample.
To visualize the relevance for an individual sample,
we convert the respective image that was used for feature
extraction to gray-scale. Using the HSV color space, we
create a color overlay in which the saturation represents
the relevance. I.e. the more colorful a region is, the more
relevant it is according to the feature query measure. For this
purpose, we normalize the feature query counts by dividing
by the maximum of the values and project them back to the
image regions from which they were extracted.
3. Experiments
In this section, we evaluate the effect of our random for-
est modification on the AU intensity estimation performance
(Sec. 3.1), we show detailed results on the FERA 2017
challenge benchmark (Sec. 3.2), we compare the proposed
feature relevance measure with the random forest variable
0.46
0.47
0.48
0.49
0.5
0.51
0.52
0.53
0 0.2 0.4 0.6 0.8 1
ICC
imbalance damping α
RF RF+MIDRUS
RF+MIDRUS+DEPTH RF+MIDRUS+DEPTH+KDE
(a) Varying tree sample selection.
0.440
0.450
0.460
0.470
0.480
0.490
0.500
0.510
0.520
0.530
0.540
6 8 10 12 14 16
ICC
Tree depth limit D
RF RF+DEPTH
RF+MIDRUS+KDE RF+MIDRUS+KDE+DEPTH
(b) Varying tree depth limit.
0.42
0.43
0.44
0.45
0.46
0.47
0.48
0.49
0.5
0.51
0.52
0.5 1 1.5 2 2.5 3
ICC
bandwidth b
RF RF+KDE
RF+MIDRUS+DEPTH RF+MIDRUS+DEPTH+KDE
(c) Varying kernel bandwidth.
Figure 4. Influence of Random Forest (RF) modifications and their parameters on predictive performance (mean ICC across 7 AUs). Trained and tested
on the frontal subsets (view 6) of BP4D-FERA17 dataset.
importance (Sec. 3.3), and visualize and discuss the feature
relevance for some exemplary samples (Sec. 3.4).
Dataset: We conduct our experiments on the FG 2017
Facial Expression Recognition and Analysis challenge [14]
dataset (BP4D-FERA17). The dataset provides a training
and validation set, with 41 and 20 different participants,
respectively. Each participant was stimulated in several
tasks, which were designed to elicit different natural facial
expression. Videos are available from 9viewing angles, in
total 2,952 training and 1,431 validation videos with about
1,322k and 680k frames, respectively. Intensities of 7Action
Units (AUs) are manually labeled for each frame.
Features: Features are extracted fully automatically as
follows: (1) We detect the faces with dlib [26] and localize
68 facial landmarks, but only use the inner 51, i.e. we
exclude the chin-line. To support a wide range of poses,
we retrained the landmark localizer coming with dlib [26]
(an ensemble of regression trees [5]) with more datasets:
Multi-PIE [27], afw [28], helen [29], ibug, and 300-W [30].
The resulting model performed significantly better than the
model coming with dlib [26]. (2) The facial landmarks are
registered with a mean face by applying an affine transform
and minimizing the mean squared error. We register the
texture with the same affine transform into a 180 x 200 pixel
image with a between-eye distance of about 100 pixel. (3)
As features we use the aligned landmark coordinates (102
dim.) and the concatenated uniform local binary pattern
histograms [31] (LBPu2
8;1; 5,900 dim.) extracted from the
patches of a regular 10 x 10 grid.
Training and Testing: We train 7 forests, one for each
coded AU. Training is done on the BP4D-FERA17 train-
ing set and test on the BP4D-FERA17 validation set. To
keep training time in manageable magnitude, we restrict
the training and testing set to the frontal videos (view 6)
in Sec. 3.1 and 3.3, which reduces sample counts to one
ninth. In Sec. 3.2 and 3.4 we use all views, but only
train with every second frame (which halves the training
set, but does not loose much information due to temporal
correlation in the videos). We compare the original Ran-
dom Forest [4] (RF) and apply our proposed modifications:
sampling without repetition with altered class distributions
as described in Sec. 2.2 (MIDRUS), limiting the depth of
the trees (DEPTH), and applying kernel density estimation
for prediction as described in Sec. 2.3 (KDE).
If not specified differently, the training parameters were:
K= 10 trees, 2k split candidate features, α= 0.5for
MIDRUS, maximum tree depth of D= 10 for DEPTH, and
bandwidth b= 1.5for KDE.
Performance measure: We report the Interclass Corre-
lation Coefficient (ICC), which is the primary performance
measure in the FERA 2017 challenge [14] and has a range
from 0 (worst) to 1 (best). Following the recommendations
of [15], we round the continuous prediction scale to integers
(the granularity of ground truth), i.e. we calculate ICC(3,1)d.
3.1. Random Forest Modifications
Fig. 4 shows the intensity estimation performance we
obtain when varying some aspects of our proposed random
forest modifications. In Fig. 4a we compare the original
random forest (RF) [4] with some modifications when vary-
ing the imbalance damping parameter αof the MIDRUS
sampling (see Sec. 2.2). RF+MIDRUS improves results
compared to RF (due to sampling without repetition), but
does not show any stable benefit from altering the class
distribution (α > 0). RF+MIDRUS+DEPTH, which also
includes tree depth restrictions improves performance in
most cases and tends to be best with a moderate degree
of damping (0.2≤α≤0.6). If we also apply KDE,
the performance is boosted significantly and the benefit of
MIDRUS damping (α > 0) becomes more apparent and
stable. Fig. 4b shows the effect of limiting the tree depth.
Maximum performance is reached for a maximum tree depth
of 10 or 12. With deeper trees the forest overfits to the
training data, because many training samples are strongly
correlated (due to their extraction from video). In contrast,
0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
AU 1 AU 4 AU 6 AU 10 AU 12 AU 14 AU 17 MEAN
ICC
FERA17 Baseline RF RF+MIDRUS+DEPT H+KDE
Figure 5. Performance comparison on BP4D-FERA17 (all views): FERA
2017 Baseline Results [14], Random Forest (RF) [4], and RF with the
proposed improvements.
TABLE 1. CORR ELATI ON BE TWE EN ME AN FEATU RE QUE RY
FRE QUEN CY AN D VARIABLE IMP ORTANC E [4].
AU 1 AU 4 AU 6 AU 10 AU 12 AU 14 AU 17
PCC 0.299*0.253*0.281*0.262*0.272*0.185*0.198*
PCC: Pearson Correlation Coefficient *p < 0.001
too flat trees are not sufficient for the problem, i.e. the
model capacity is too low. Fig. 4c illustrates the benefit
of the kernel density estimation (KDE). If a (reasonable)
minimum bandwidth is exceeded, RF+KDE outperforms RF
and RF+MIDRUS+DEPTH+KDE outperforms the variant
without KDE.
3.2. AU Intensity Estimation Results (All Views)
To compare the RF and our modified RF with the FERA
baseline results [14], we train and test the models on the data
of all views (not only frontal). As this makes the recognition
problem more difficult, we increased the maximum tree
depth Dfrom 10 to 13. The results are shown in Fig. 5.
For all AUs, RF with modifications performs best, followed
by RF, which also outperforms the baseline results [14].
Results per view are provided in the supplemental material
(including additional performance measures).
3.3. Feature Query Frequency as Variable Rele-
vance Measure
To show that feature query frequency (see Sec. 2.4) is
a valid variable relevance measure, we compare it with
Breiman’s variable importance [4], which is widely ac-
cepted. For this purpose, we train 100 trees to obtain more
stable results for both measures. We calculate the mean
feature query frequency over all test samples as well as
the variable importance for each AU. Table 1 reports the
Pearson correlation coefficients (PCC) of the measures. The
correlations reach high statistical significance for all AUs,
which suggests that both measures give similar results.
3.4. Visualizing Feature Relevance for Individual
Predictions
Fig. 6 visualizes the feature relevance calculated from
query frequency (see Sec. 2.4) for some exemplary frames
and AUs. The first row depicts examples of AU 12 (mouth
corner puller) and AU 6 (cheek raiser). Mouth landmarks
are highly relevant for all images, which is reasonable for
AU 12. For AU 6, which does not affect the upper face, it
shows that the model exploits the high correlation of AU
12 and AU 6 in the data (PCC=0.69 in the training data).
The areas below the eyes are also considered (see column
5 and 6), but features related to AU 12 seems to be better
predictors for AU 6.
In the second row we show challenging examples of the
poor performing AU 1 (inner brow raiser) and AU 4 (brow
lowerer). The relevant features concentrate on the upper
face, which is correct. In column 4, 5, and 7 some wrinkles
were correctly marked to be relevant, but predictions are
erroneous. Reasons may be the poor landmark localization
quality (see eyebrows in column 1, 6, and 7) or that LBP
features drop too much information. As to be seen in column
1 and 6, the homogeneous background regions are marked
as relevant. This indicates that the classifier learns to distin-
guish the views/head poses based on the background. This is
an issue of the database, which probably leads to poor cross
database performance when trained models are applied on
in-the-wild databases with non-homogeneous background.
4. Conclusion
In this paper we applied and improved random regres-
sion forests for facial action unit intensity estimation. How-
ever, quantitative results and visualization suggests that the
precomputed features are not sufficient to properly estimate
the intensity of some AUs. Future works should exploit
the potential of random forests to learn problem-specific
feature representations, which led to big improvements in
other computer vision tasks [5], [6], [7], [8].
Acknowledgments
Funded by German Research Foundation proj. AL 638/3-2.
References
[1] P. Werner, A. Al-Hamadi, K. Limbrecht-Ecklundt, S. Walter, S. Gruss,
and H. Traue, “Automatic Pain Assessment with Facial Activity
Descriptors,” IEEE Trans. on Affective Computing, no. 99, 2016.
[2] F. D. l. Torre and J. F. Cohn, “Facial Expression Analysis,” in
Visual Analysis of Humans, T. B. Moeslund, A. Hilton, V. Krger,
and L. Sigal, Eds. Springer London, Jan. 2011, pp. 377–409.
[3] P. Ekman, W. Friesen, and J. Hager, Facial Action Coding System:
The Manual on CD ROM. A Human Face, 2002.
[4] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp.
5–32, 2001.
[5] V. Kazemi and J. Sullivan, “One millisecond face alignment with
an ensemble of regression trees,” in Computer Vision and Pattern
Recognition (CVPR), 2014, pp. 1867–1874.
1234567
Figure 6. Visualization of feature relevance by color saturation. Red arrow-like structures centered at blue circles illustrate the relevance of the landmark’s
xand ycomponent. Other red structures illustrate the relevance of Local Binary Pattern features. Best viewed in color.
[6] G. Fanelli, M. Dantone, J. Gall, A. Fossati, and L. V. Gool, “Random
Forests for Real Time 3d Face Analysis,” Int. Journal. of Computer
Vision, vol. 101, no. 3, pp. 437–458, 2013.
[7] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio,
A. Blake, M. Cook, and R. Moore, “Real-time human pose recogni-
tion in parts from single depth images,” Communications of the ACM,
vol. 56, no. 1, pp. 116–124, 2013.
[8] S. Handrich and A. Al-Hamadi, “Localizing body joints from single
depth images using geodetic distances and random tree walk,” in Int.
Conf. on Image Processing (ICIP), 2017, accepted.
[9] A. Savran, N. Alyz, H. Dibekliolu, O. eliktutan, B. Gkberk, B. Sankur,
and L. Akarun, “Bosphorus Database for 3d Face Analysis,” in
Biometrics and Identity Management, 2008, no. 5372, pp. 47–56.
[10] P. Lucey, J. F. Cohn, K. M. Prkachin, P. E. Solomon, and I. Matthews,
“Painful Data: The UNBC-McMaster Shoulder Pain Expression
Archive Database,” in Automatic Face & Gesture Recognition and
Workshops (FG), 2011, pp. 57–64.
[11] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn,
“Disfa: A spontaneous facial action intensity database,” Affective
Computing, IEEE Transactions on, vol. 4, no. 2, pp. 151–160, 2013.
[12] X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz,
P. Liu, and J. M. Girard, “BP4d-Spontaneous: a high-resolution spon-
taneous 3d dynamic facial expression database,” Image and Vision
Computing, vol. 32, no. 10, pp. 692–706, 2014.
[13] M. Valstar, J. Girard, T. Almaev, G. McKeown, M. Mehu, L. Yin,
M. Pantic, and J. Cohn, “Fera 2015-second facial expression recog-
nition and analysis challenge,” IEEE International Conference on
Automatic Face and Gesture Recognition, 2015.
[14] M. F. Valstar, E. Snchez-Lozano, J. F. Cohn, L. A. Jeni, J. M.
Girard, Z. Zhang, L. Yin, and M. Pantic, “FERA 2017 - Addressing
Head Pose in the Third Facial Expression Recognition and Analysis
Challenge,” arXiv:1702.04174 [cs], 2017.
[15] P. Werner, F. Saxen, and A. Al-Hamadi, “Handling Data Imbalance
in Automatic Facial Action Intensity Estimation,” in British Machine
Vision Conference (BMVC), 2015, pp. 124.1–124.12.
[16] S. Kaltwang, S. Todorovic, and M. Pantic, “Latent Trees for Es-
timating Intensity of Facial Action Units,” in IEEE Conference on
Computer Vision and Pattern Recognition, 2015.
[17] O. Rudovic, V. Pavlovic, and M. Pantic, “Context-Sensitive Dynamic
Ordinal Regression for Intensity Estimation of Facial Action Units,”
IEEE Trans. Pattern A. & Machine Int., vol. 37/5, pp. 944–958, 2015.
[18] K. Zhao, W.-S. Chu, and H. Zhang, “Deep Region and Multi-Label
Learning for Facial Action Unit Detection,” in Computer Vision and
Pattern Recognition, 2016, pp. 3391–3399.
[19] C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “EmotioNet:
An Accurate, Real-Time Algorithm for the Automatic Annotation of
a Million Facial Expressions in the Wild,” in Computer Vision and
Pattern Recognition, 2016, pp. 5562–5570.
[20] K. Zhao, W.-S. Chu, F. De la Torre, J. F. Cohn, and H. Zhang, “Joint
Patch and Multi-Label Learning for Facial Action Unit Detection,”
in Computer Vision and Pattern Recognition, 2015, pp. 2207–2216.
[21] M. K¨
achele, M. Amirian, P. Thiam, P. Werner, S. Walter, G. Palm, and
F. Schwenker, “Adaptive confidence learning for the personalization
of pain intensity estimation systems,” Evolving Systems, vol. 8, no. 1,
pp. 71–83, 2017.
[22] M. K. A. E. Meguid and M. D. Levine, “Fully automated recognition
of spontaneous facial expressions in videos using random forest
classifiers,” IEEE Transactions on Affective Computing, vol. 5, no. 2,
pp. 141–154, Apr. 2014.
[23] A. Dapogny, K. Bailly, and S. Dubuisson, “Pairwise conditional
random forests for facial expression recognition,” in IEEE Int. Conf.
on Computer Vision, 2015, pp. 3783–3791.
[24] T. Pfister, X. Li, G. Zhao, and M. Pietikinen, “Recognising sponta-
neous facial micro-expressions,” in 2011 International Conference on
Computer Vision, Nov. 2011, pp. 1449–1456.
[25] E. Parzen, “On estimation of a probability density function and
mode,” Ann. Math. Statist., vol. 33, no. 3, pp. 1065–1076, 1962.
[26] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal of Ma-
chine Learning Research, vol. 10, pp. 1755–1758, 2009.
[27] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie,”
Image Vision Comput., vol. 28, no. 5, pp. 807–813, May 2010.
[28] X. Zhu and D. Ramanan, “Face detection, pose estimation and
landmark localization in the wild,” in CVPR, 2012.
[29] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang, “Interactive
facial feature localization,” in ECCV, 2012, pp. 679–692.
[30] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and
M. Pantic, “300 faces in-the-wild challenge: database and results,”
Image and Vision Computing, vol. 47, pp. 3 – 18, 2016.
[31] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local
binary patterns: Application to face recognition,” IEEE Trans. Pattern
Analys. & Machine Int. (PAMI), vol. 28, no. 12, pp. 2037–2041, 2006.