Conference PaperPDF Available

Cross-Database Facial Expression Recognition Based on Fine-Tuned Deep Convolutional Network



Content may be subject to copyright.
Cross-Database Facial Expression Recognition
Based on Fine-Tuned Deep Convolutional Network
Marcus Vinicius Zavarez, Rodrigo F. Berriel and Thiago Oliveira-Santos
Universidade Federal do Espirito Santo, Brazil
Abstract—Facial expression recognition is a very important
research field to understand human emotions. Many facial ex-
pression recognition systems have been proposed in the literature
over the years. Some of these methods use neural network
approaches with deep architectures to address the problem.
Although it seems that the facial expression recognition problem
has been solved, there is a large difference between the results
achieved using the same database to train and test the network
and the cross-database protocol. In this paper, we extensively
investigate the performance influence of fine-tuning with cross-
database approach. In order to perform the study, the VGG-
Face Deep Convolutional Network model (pre-trained for face
recognition) was fine-tuned to recognize facial expressions con-
sidering different well-established databases in the literature:
CK+, JAFFE, MMI, RaFD, KDEF, BU3DFE, and AR Face. The
cross-database experiments were organized so that one of the
databases was separated as test set and the others as training,
and each experiment was ran multiple times to ensure the
results. Our results show a significant improvement on the use
of pre-trained models against randomly initialized Convolutional
Neural Networks on the facial expression recognition problem,
for example achieving 88.58%, 67.03%, 85.97%, and 72.55%
average accuracy testing in the CK+, MMI, RaFD, and KDEF,
respectively. Additionally, in absolute terms, the results show an
improvement in the literature for cross-database facial expression
recognition with the use of pre-trained models.
Facial expression recognition is a very important research
field to understand human emotions. The human brain can
recognize facial expressions only by the face characteristics.
Although recognition of facial expressions seems to be a
simple task for humans, it is quite difficult to be performed
by computers.
In the facial expression recognition (FER) problem, there
are six basic universal [2] expressions that are recognized in
several different cultures and are widely used in the literature:
fear, sad, angry, disgust, surprise and happy. Some works in
the literature also take the neutral expression in consideration,
which sums up to the seven expressions also widely used in
the literature. Many facial expression recognition systems have
been proposed in the literature over the years [3]–[9]. Although
it seems that the facial expression recognition problem has
been solved, there is a large difference between the results
achieved using the intra and the cross-database protocols. In
the intra-database protocol (i.e., training in one database and
testing in a subject-independent set of the same database),
the current methods already achieve high accuracies, reaching
around 95% [3], [4], [6]. In the other hand, methods evaluated
using the cross-database protocol (i.e., training in one or
more databases and evaluating in different databases) do not
report high accuracies, ranging between 40% and 66% [3]–[9].
In this context, the evaluation of a method using the intra-
database protocol seems to provide limited insight into the
generalization capability of such method.
Some of the methods employed nowadays for facial expres-
sion recognition [6], [8], [10] use neural network approaches
with deep architectures to address the problem. One specific
type of deep network is the Convolutional Neural Networks
(CNNs) proposed by [11]. A usual constraint of CNNs is the
need of big amounts of data to ensure the convergence of their
training algorithms ir order to achieve good accuracies. How-
ever, not all applications have the necessary amount of data to
train these models, either due to costs of data acquisition or
other application constraints. To deal with this limitation, some
works apply fine-tuning techniques to transfer learning from
one problem to the other. In these cases, instead of randomly
initializing the weights of a CNN, these procedures use the
weights of a CNN that has been previously trained with an
extensive set of images to speed up the convergence of the
training algorithm. Some methods [12]–[14] employed fine-
tuning techniques for the facial expression recognition problem
with intra-database protocol.
Nevertheless, facial expression recognition methods (even
those using deep neural networks) struggle to achieve high
accuracies when evaluated using the cross-database protocol.
Moreover, most of the related works (especially those using
cross-database protocol) do not perform an extensive exper-
imentation (usually training in a single database and testing
in another one, i.e., using few databases). This also limits the
evaluation of the generalization of these methods. In addition,
there are many publicly available databases in the literature,
but there is no consensus for the cross-database protocol
evaluation. Some of these works use databases that are not
freely available. Besides that, it is even hard to ensure that the
very same database is being used in the same manner. As some
databases are converted from video files, different authors end
up using different techniques to extract the images of interest,
which may mislead the comparisons. Given these facts, there
is still need for more extensive investigation using the cross-
database protocol. Even though the environment is controlled
within database (frontal face images, same ethnicity subjects,
similar light conditions, no occurrence of occlusion, among
others), it is not controlled across databases.
2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
In this paper, we propose an extensive experimentation
to evaluate the performance of a fine-tuned deep neural
network in the facial expression recognition problem us-
ing the cross-database protocol. To perform this study, we
fine-tuned the VGG network pre-trained in a face recogni-
tion database (VGG-Face, as originally proposed) to recog-
nize facial expressions considering different well-established
databases in the literature: AR Face Database [15], Extended
Cohn-Kanade Database (CK+) [16], Binghamton University
3D Facial Expression (BU3DFE) [17], The Japanese Female
Facial Expression (JAFFE) [18], MMI [19], Radboud Faces
Database (RaFD) [20] and Karolinska Directed Emotional
Faces (KDEF) [21] databases. These datasets altogether com-
prise more than 6,200 images from subjects of different
ethnicities, genders, and ages in a variety of environments
and in both spontaneous and posed facial expressions. The
experiments were organized so that one of the databases
was separated as test set and the others as training sets
(i.e. leave-one-out). In addition, the experiments were ran
multiple times to account for the randomness of the proposed
method and to allow for a more robust analysis of the results.
Indeed, results showed variation among different seeds (e.g.
the accuracy of the 10 runs of the VGG-Random with MMI
varied from 46.67% to 59.74%), which indicates the need for
multiple runs when evaluating Deep Neural Networks. When
comparing fine-tuned models with randomly initialized ones,
results showed that, in general, fine-tuned models perform
better. However, there are some cases where randomly initial-
ized models performed better (e.g. evaluating in the JAFFE
database) than fine-tuned ones. Finally, in absolute terms, our
models achieved state-of-art results for most of the databases:
88.58%, 67.03%, 85.97%, and 72.55% of average accuracy
(CK+, MMI, RaFD, and KDEF, respectively). The only case
our models did not outperform the literature was for the JAFFE
database (44.32%), which is a highly biased database in terms
of gender and ethnicity, and also yields the worst results in
the literature.
There are several methods proposed to address the facial
recognition problem in the literature. Many of them focus
evaluation within the same database and therefore can only
prove their effectiveness within the same conditions of the
training database (e.g. [6], [22], [23]). Among these works,
there are still some that do not ensure a separation of subjects
in the training and test sets, and therefore they cannot even
ensure effectiveness within the same database. Instead, they
give a false impression of high accuracy (e.g. [24]–[27]). In
this section, focus is given to related works performing cross-
database evaluations. It is important to note that some of the
works in the literature exclude the neutral expression from the
problem, but these are not the focus of this study.
Shan et al. [3] proposed a support vector machine (SVM)
combined with other machine learning methods, and evaluated
their method using both intra and cross-database protocols.
The authors used CK+, MMI, and JAFFE databases in a 10-
fold cross-validation using the intra-database protocol. Their
method achieved an average of 91.4%, 86.9%, and 81.0%
of accuracy, respectively. Moreover, they performed cross-
database experiments training with CK+ database and testing
in the MMI and JAFFE databases. Their method achieved an
accuracy of 51.1% and 41.3%, respectively. Zhang et al. [4]
use a multiclass support vector machine (SVM) based on a
multiple kernel learning (MKL) to perform facial expression
recognition and they evaluated their system with the CK+
and MMI databases. Using the intra-database protocol, they
achieved 93.6% and 92.8% of accuracy, respectively. For
cross-database experiments, when training with MMI database
and testing in the CK+ database, their system achieved 61.2%
of accuracy. In addition, their method achieved 66.9% of
accuracy when training with CK+ database and testing in the
MMI. This indicates that intra-database experiments are easier
than cross-database, and they cannot be used to generalize the
performance of a method.
Lopes et al. [6] proposed a method that uses a combination
of image pre-processing and Convolutional Neural Networks.
These pre-processing steps include spatial normalization, im-
age cropping, and intensity normalization, and help to extract
specific features for expression recognition. The authors also
employed a data augmentation procedure that included the
generation of synthetic samples to cope with the lack of data.
Cross-database experiments were also performed, but they
were not the focus of their work. They trained their model
only with CK+ and evaluated on the JAFFE and BU3DFE
databases, achieving an accuracy of 37.36% and 42.25%,
respectively. A Boosted Deep Belief Network (BDBN) was
proposed by [8]. Their method proposed a composition of
weak binary classifiers having each of them responsible for
classifying one expression. The BDBN combines feature
learning, feature selection, and classifier construction in a
unified framework. A cross-database approach was used to
evaluate the generalization of their method, i.e. their BDBN
was trained with the CK+ database and tested in the JAFFE
database, achieving a performance rate of 68.00%. However,
this performance rate is not directly comparable with the other
methods in the literature, because they used a composition of
binary accuracies to derive this metric. Mayer et al. [23] uses
a support vector machine to classify the expressions, also used
cross-database protocol in their evaluation. Their experiments
comprises one-versus-one comparisons of the CK, MMI, and
FEEDTUM databases (i.e. only one database used to train and
only another one to test). When training with MMI and testing
in the CK+, their system achieved 60.8%; and when training
with CK+ for training and testing in the MMI their system
achieved 53.2% of accuracy.
In recent studies, [7] proposed a Convolutional Neural
Network that consists of two convolutional layers followed
by max pooling and four Inception modules. The authors
used seven standard databases (Multi-PIE, MMI, CK+, DISFA,
FERA, SFEW and FER2013) to perform a leave-one-out
experiment having each of the databases as testing set. Their
method achieved an accuracy of 64.2% when testing in the
Random Noise
Data Augmentation
Test Trained
Input Image
Position of
the Eyes
Input Image
Fig. 1. The proposed method starts with pre-processing to generate the input images for the CNN. Therefore, the original image is firstly transformed into
gray scale. Subsequently, a data augmentation step is performed to increase the number of images in the database (this step is only performed during training).
Later, a spatial normalization is performed to correct major rotation, translation, and scale problems using the position of the center of the eyes. Finally, the
CNN can be trained to generate a model and be later used to predict the facial expression of a given image.
CK+ database and training with the other six databases, and
55.6% when testing in the MMI database and training with
other databases. It is important to note that the Multi-PIE
database is not publicly available, therefore it is hard to
reproduce their results. Hasani et al. [28] proposed a spatio-
temporal two-parts network that uses a deep neural network
and a conditional random fields (CRF) module to recognize
facial expressions in sequence of images (i.e. videos). The
DNN-based network contains three InceptionResNet modules
and two fully-connected layers that capture spatial relations
of facial expression on images. The CRF module captures the
temporal relation between the frames. The authors use CK+,
MMI, and FERA databases with leave-one-out approach. The
method reported an accuracy of 73.91% and 68.51% to the
CK+ and MMI databases respectively. However, considering
single-frame evaluation (i.e. without the CRF module) such
as approached in this work, their system reported 64.81% and
52.83% of accuracy for the CK+ and MMI, respectively.
As can be seen, most of the works in the literature do not
provide a way to reproduce their results. Moreover, for the
video databases, it is even hard to ensure the same set of
images are used across different works. In addition, many of
these works only employ the cross-database protocol using
only one database during training and only one for test.
Regardless the differences in the evaluation protocols, the
cross-database accuracies are still very low when compared
to the intra-database. In this context, there is still need for a
more extensive experimentation, using multiple databases, and
allowing for reproducibility.
The proposed facial expression recognition system com-
prises two main modules (Figure 1): a pre-processing step
and the Convolutional Neural Network. The first module is
responsible for preparing the input image, augmenting the
samples, and performing a spatial normalization. The second
module is responsible for performing the training of the model
and later classifying the image in one of the seven allowed
expressions. The system receives an image of a face with the
position of the center of each eye as input, and outputs the
expression with the highest confidence.
A. Pre-processing
The pre-processing begins with the conversion of the input
image to grayscale. This step is performed to minimize the
variation of the images between the databases, given that
some of them are already in grayscale. As the Convolutional
Neural Network (ConvNet) described later expects a 3-channel
input image, this grayscale image is replicated in the three
channels. Subsequently, an offline data augmentation step is
performed to increase the number of images in the database.
The number of samples generated varies according to each
combination of databases used in the training phase. Later, a
spatial normalization is performed to correct major rotation,
translation, and scale problems using the position of the center
of the eyes. This step is also performed for all the images
of every database (both training and test sets). The following
subsection describes each of these steps in details.
1) Offline Data Augmentation: ConvNets require large sets
of data in order to be able to generalize to a given prob-
lem. However, publicly available databases for facial expres-
sion recognition do not have enough images to address this
problem. Simard et al. [29] proposed a data augmentation
technique to increase the database through generation of
synthetic samples for each original image. Inspired in this
technique, the following operations were applied offline as
data augmentation: a random noise was added to the position
of the eyes and skew. For each image in the original database,
10 synthetic images were generated, where a random noise
was added to the position of the eyes on 70% of the synthetic
images and a random skew was applied on the remaining 30%.
The rotation, translation, and scale procedures consist in
adding a random noise in the position of the eyes before
performing the spatial normalization. Therefore, the spatial
normalization with the random noise added to the position
of the eyes is equivalent to performing a small rotation,
translation, and/or scaling. The noise is randomly generated by
sampling from a Gaussian distribution with standard deviation
equals to 10% of the distance between the eyes.
The skew procedure consists of changing the corners of
the image to generate a distortion. Firstly, the side of the
image (left, right, top, bottom) in which the skew will be
applied is randomly chosen. Secondly, the amount of skew
applied to the image is selected by sampling from a uniform
distribution varying between 2% and 15% of the length of the
side. Finally, the two corner points are changed to generate a
distorted image. It is important to note that the skew operation
is performed after the correction of rotation, translation, and
scale described in the next section.
Figure 3 shows the distribution of all expressions of the
databases used in this study. As it can be seen, some ex-
pressions have more samples than the others. This difference
between the amount of samples of each expression can make
the training of the network to give more weight to the
expression with more samples. To minimize the problem of
having a biased trained model, the databases of each training
configuration were balanced before training. Therefore, the
appropriate number of extra synthetic images was generated
for each expression in each database. This extra number of
synthetic images is referred as complement. The expressions
were balanced proportionally according to the one with the
most samples considering all databases as in Equation 1:
M=max(Te),eE (2)
where, Cd
eis the amount of complementary data of the
expression ein the database d,nd
eis the number of images of
the expression ein the database d,Mis number of images
of the expression with most samples, Teis the number of
images of the expression econsidering all databases used
during training, D.
2) Spatial Normalization: The spatial normalization com-
prises three steps: rotation correction, cropping, and resiz-
ing of the image to the expected input of the network.
The rotation correction used in this work follows the same
procedure explained in [30]. Essentially, it aligns all faces
with the horizon based on the position of both eyes. After
the rotation correction, all images are cropped in order to
spatially normalize them. The cropping operation reduces the
background part of the image to achieve the same aspect of the
images expected by our models (given the employment of fine-
tuning, as explained later). In the spatial normalization, each
image is cropped according to the distance between the eyes,
centralized in the midpoint between the eyes. The cropping
Cohn-Kanade (CK+)
Original Spatial Normalization Random Noise Skew
Fig. 2. Example of image from the CK+ database and its synthetic samples.
area is determined by four boundaries: on the sides (left and
right), the images are cropped based on 2.2 times the distance
between the eyes for each side from the center; on the top,
the boundary is determined by 2 times the distance between
the eyes from the center; and on the bottom, it is 2.5 times
the distance between the eyes from the center. These factors
were empirically determined in order to have the input image
similar to the images that were used to train the original VGG-
Face. The resize was performed on the cropped images using
a bilinear interpolation, resulting in down-sampled images of
256 ×256 pixels.
3) On-the-fly Data Augmentation: In addition to the offline
data augmentation, some procedures are also applied on-the-
fly during training phase: random crops and mirroring. These
procedures are applied after the spatial normalization, where
the input image is normalized to a 256×256 image. Then, this
image is randomly cropped into a 227 ×227 image. The final
image also can be randomly horizontally mirrored. In the test
phase there is no data augmentation, but it is necessary to crop
the image according to the trained input size. A centralized
crop of 227 ×227 is performed, instead of random crops, and
no mirroring is applied.
B. Convolutional Neural Network
The proposed system uses a Convolutional Neural Network
to perform facial expression recognition: the VGG network,
proposed by Simonyan and Zisserman [31]. They proposed
the VGG CNN architecture that comes in two versions: VGG-
16 and VGG-19 (i.e. 16 and 19 layers, respectively). In
this work, the VGG-16 is used and referred to as VGG
only. It has about 138 millions parameters and comprises 13
convolutional layers, followed by 3 fully-connected layers. The
first two fully-connected layers have 4,096 outputs and the last
has 2,622 outputs. Since this architecture was not originally
proposed for the facial expression recognition problem, it is
necessary to adapt the output layer to have 7 outputs (one for
each expression) instead of the original 2,622 units.
For this network, two different initializations were evalu-
ated: i) with random values (using Xavier algorithm [32]),
and ii) with pre-trained weights. Pre-trained and randomly
initialized networks usually differ in terms of the size of the
initial learning rate. Random weights usually require higher
base learning rate values to enable the gradient finding a good
minimum. Pre-initialized weights usually require lower base
learning rate values because they are already in the direction
of the minimum. The latter training procedure is referred as
For the fine-tuned models, the weights of the pre-trained
VGG-Face model were loaded into the network. VGG-Face
is the name of the model publicly released by the authors of
the VGG network. This model was originally trained for the
facial recognition task using a database of celebrity faces. As
already explained, the last layer was changed from 2,622 to 7
units, therefore their weights had to be randomly initialized for
all cases. In addition to this difference between the randomly
initialized and fine-tuned models, the base learning rate is also
different. For the models with randomly initialized weights,
all layers of the architecture were trained with the same initial
base learning rate value. On the other hand, the learning rate
of the output layer (i.e. the 7 outputs) is set to 10 times the
base learning that is used in the previous pre-trained layers, in
the case of fine-tuning. The random initialized network uses a
base learning rate of 102and the fine-tuned one uses as base
learning rate the value that the original VGG-Face stopped, i.e.
104. More details about the experimental setup is presented
in the section IV.
In this work, an extensive experimentation with cross-
database facial expression recognition is presented and the
influence of fine-tuning a CNN pre-trained for a different task
in a similar domain (face recognition) is investigated. For
this investigation, seven widely used databases were chosen
to train and test the models using the cross-database leave-
one-out approach. The experimental methodology is detailed
in the next subsections. Firstly, the databases used on the
experimentation are presented. Subsequently, the experiments
are described in details. After that, the metrics used on the
experimentation are shown. Finally, the setup used during the
experimentation is presented.
A. Databases
To achieve our goal, seven databases widely used in the liter-
ature were selected to perform the cross-database experiments:
AR Face [15], CK+ [16], BU3DFE [17], JAFFE [18], MMI
[19], RaFD [20], and KDEF [21] databases. Figure 3 shows
the distribution of the expressions in each database. Moreover,
Figure 2 shows examples of one subject of each database and
the result of the offline data augmentation. Along with CK+,
JAFFE, and AR Face databases, are available the files that
contains the position of the eyes. The position of the eyes of
MMI, RaFD, KDEF, and BU3DFE databases were manually
annotated with aid of a face tracker algorithm.
1) CK+: The Extended Cohn–Kanade (CK+) database [16]
consists of 100 university students aged from 18 to 30 years,
resulting in 1,236 images. From those, 65% were female, 15%
were African-American and 3% were Asian or Latino. The
database comes from videos and each subject was instructed
to perform expressions that begin and end with the neutral,
i.e. expressions are posed.
Fig. 3. Distribution of the expressions for all databases used.
2) JAFFE: The Japanese Female Facial Expression
(JAFFE) database [18] contains 213 images from 10 Japanese
female subjects. In this database, there are about 4 images in
each one of the six basic expressions and one image of the
neutral expression from each subject.
3) MMI: The MMI database [19] contains video sessions
with people showing emotions. In total, 32 subjects from the
235 sessions with labeled emotions were selected. From these
sessions, frames with the frontal face of the subject showing
the emotion (one of the six expressions or the neutral) were
extracted, resulting in the 390 images used in the experiments.
4) RaFD: The Radboud Faces Database (RaFD) [20] con-
sists of 67 models (including Caucasian males and females,
Caucasian children, both boys and girls, and Moroccan Dutch
males). In total, 1,407 images from this database in the six
expressions plus the neutral were used in the experiments.
5) KDEF: The Karolinska Directed Emotional Faces
(KDEF) [21] database consists of 70 actors (35 male, 35
female), aged from 20 to 30 years. Were used 980 pictures of
human facial expressions from six emotions plus the neutral.
Only the frontal face images were used in the experiments
and the eyes position are annotated manually with face-tracker
6) BU3DFE: The Binghamton University 3D Facial Ex-
pression (BU-3DFE) database [17] contains 1,191 images from
58 female subjects, from several ethnicities, including White,
Black, East-Asian, Indian and Hispanic Latino. This database
was used only in the training phase because it is not complete
(only with females).
7) AR Face: The AR Face database [15] contains frontal
face images of 126 people (55.6% are male and 44.4% are
female) over different facial expressions, lighting conditions
and occlusion. No restrictions were applied to the participants
in relation to clothing, hairstyles, makeup, etc. The AR Face
does not have all expressions such as the other databases. Only
three expressions (angry, happy, surprise) and the neutral were
used, remaining 1,018 images in the database.
B. Experiment
Two methods were evaluated in this experiment: i) VGG
with randomly initialized weights (hence VGG-Random), ii)
fine-tuned VGG (hence VGG-FineTuning). All models were
trained using Stochastic Gradient Descent (SGD) with a Step
Down policy for the learning rate, decreasing it three times
during the training (one for each of the three epochs). The
maximum of three epochs was chosen because the models
graphically showed convergence after these number of epochs
in empirical experiments.
An extensive experimentation was designed to evaluate
the different models trained in the proposed system. The
experimentation comprises a cross-database leave-one-out ap-
proach. For this experiment, groups were created to extensively
train and test each model. Each training set comprises a
combination of six databases, always leaving one out. As
BU3DFE and AR Face are always in the training sets, 5
groups were created in total. For each training set, the data
augmentation is performed and the expressions are balanced.
The number of samples in each test set remains unchanged,
i.e. the models are evaluated in the original database only with
spatial normalization.
To reduce the influence of random factors in the exper-
iments, each model combination was run 10 times with a
different seed. The seeds were kept fixed among different
methods within the same run. To reduce the randomness of the
evaluation process, the algorithm used in the backpropagation
of the weights were chosen to be deterministic. Note that even
for the fine tuning there are some randomness in the process, as
for example the weights of the output layer. The performance
metrics of each method is presented in the results.
C. Metrics
Two performance metrics are reported: micro-averaged ac-
curacy and macro-averaged accuracy. These two metrics were
chosen because the micro-averaged accuracy is more com-
monly used in the literature, but it does not account for the
unbalance of the classes in the databases. On the other hand,
the macro-averaged accuracy takes the unbalance into account,
which is the case as can be seen in the Figure 3.
In addition to the accuracy, a statistical analysis is performed
to verify if the improvements were statistically significant.
The paired t-test was used to estimate the significance of the
pairwise comparisons considering the 10 runs. This test was
performed for all methods used in the experiment. Differences
were considered statistically significant for p-value <0.01.
D. Setup
All the experiments were carried out using an Intel Core i7
4770 3.4 GHz with 16GB of RAM and a NVIDIA Tesla K40
with 12GB of memory. The environment of the experiments
was Linux Ubuntu 14.04, with the NVIDIA CUDA Framework
7.5 and the cuDNN library 5.1.
The pre-processing (color conversion, data augmentation,
and spatial normalization) was implemented using OpenCV
and C++. The training and test phases were done using the
NVIDIA fork of Caffe framework [33]. Some modifications
were made in this version of the Caffe framework to ensure
a deterministic behavior of the backproagation of the weights
in order to be able to compare the results with the same seed.
Basically, the convolutional layer of the Caffe framework was
changed to ensure that deterministic algorithm were chosen
during the backpropagation of the weights when using cuDNN
(default behavior in non-deterministic).
Most of the works in the literature do not provide a way
to reproduce their results, neither a detailed explanation of
the conversion of the video databases. Therefore, in order to
allow for reproducibility, we publicly released: i) a script to
automatically convert video databases (e.g., CK+ and MMI)
into the samples we used in this work, ii) a script to preprocess
all databases in order to reproduce the same samples we used
in the experimentation, iii) pre-trained models (the best of
each method), and iv) a script to perform the inference and
reproduce the results hereby reported.1Given these releases,
we also expect to allow for fairer comparisons in the future.
This experiment evaluates two methods (VGG-FineTuning
and VGG-Random) through an extensive experimentation with
different groups of databases. Each group uses 6 databases
for training and the remaining one for test. As two of these
databases were used only during training, five of the most
commonly used databases in the literature were used to test.
Each group is named after the database used as test set. The
results of this experiment are summarized in the Table I. As it
can be seen, the VGG-FineTuning achieved, on average, the
best performance, reporting up to 88.58% (CK+ database).
Group VGG Model (%)
FineTuning Min Max Random Min Max
CK+ 88.58 ±0.43 87.8 89.1 78.09 ±2.26 76.0 83.0
JAFFE 44.32 ±2.45 40.4 50.2 49.62 ±3.71 42.3 54.9
MMI 67.03 ±1.66 64.1 69.5 55.64 ±4.06 46.7 59.7
RaFD 85.97 ±0.43 85.3 86.8 83.68 ±2.74 78.4 87.1
KDEF 72.55 ±0.72 71.3 73.7 69.03 ±1.97 65.2 72.1
The results show VGG-FineTuning achieved better average
accuracy in 4 out of 5 groups: CK+, MMI, KDEF, and RaFD.
As it can be seen in the Table I, the results vary significantly
between the groups, i.e. the performance of these models using
the cross-database protocol is dependent on similarity of the
test set conditions to the training set. This variation indicates
that testing in a single database is not enough to assess
the performance of a given model in the facial expression
recognition task. There was only one database that the VGG-
FineTuning did not perform better: JAFFE. It is important to
note that JAFFE is a highly biased database in terms of gender
and ethnicity, i.e. it comprises only Japanese female subjects.
In addition, the results in the JAFFE database were the worst
among the test groups. The result for the JAFFE database
indicates that using fine-tuning may not always lead to the best
results. Additionally to the variation between the databases,
there is also another variation that is widely ignored in the
literature caused by the random initialization. To measure this
effect, 10 runs were performed. It can also be seen in the
Table I that the fine-tuned models presented lower standard
deviation when compared to the VGG-Random (e.g., 0.43 and
2.26, respectively, for the CK+), which indicates they tend to
be more stable. Moreover, the JAFFE database also presented
the highest standard deviation in the fine-tuned models with
a variation of 9.85%. In addition, the MMI database got
the highest variation for VGG-Random model: 13.07%. The
statistical analysis performed between these two models show
that only for the RaFD database the results did not show a
significance, and all the others presented a p-value lower than
0.01, thus this shows that, in average, VGG-FineTuning is
better. Finally, the results of the VGG-Random for the CK+
were computed using only 8 out of 10 runs, because two of
them did not converge at all (with the same seeds used in the
other methods for these runs). Therefore, in order to be fairer,
this specific result (VGG-Random on the CK+ database) was
calculated ignoring these 2 outliers. This shows that random
initialization, as mentioned, tends to be more unstable.
Additionally, the VGG model was fine-tuned with freezing
method (i.e. keeping the weights of the convolutional layers
fixed and optimizing the last three fully connected layers), but
the results about the same as the random initialization and
therefore are not reported here.
Comparing results with the literature of cross-database
facial expression recognition is very difficult. This is mostly
because different authors can use different sets of databases
and, for some cases (e.g. databases originally in video), it
is hard to ensure the same images are being used across
comparisons. Even though, some authors [23], [28] do perform
such comparisons. Therefore, we present a comparison to
show that in absolute terms, our method outperforms the
results reported in the literature for most of the databases. The
results presented in this section corresponds to micro-averaged
accuracy, thats commonly used in the literature. Although, our
results calculated with macro-averaged accuracy did not gets
bigger difference between metrics, and giving less difference
in VGG-FineTuning model. It is important to notice that RaFD
and KDEF databases, as can be seen in Figure 3, are balanced,
then did not show difference results between the two metrics.
As it can be seen in the Table II, our models achieved state-of-
the-art results for most of the databases used in our evaluation.
The Table II shows that our VGG-FineTuning models (pre-
trained using the VGG-Face) present state-of-the-art results
for 4 out of 5 databases: CK+, MMI, RaFD, and KDEF.
For the CK+ test set, our models reported 88.58% ±0.43 of
accuracy (best run with 89.1%), which represents a significant
improvement (+23.77%) in the literature (64.81% [28]). For
the MMI test set, our models reported 67.03% ±1.66 of
accuracy, this result is 14.20% higher than [28] (using single
frame results). For the RaFD test set, our models reported
85.97% ±0.43 of accuracy (best run with 86.8%). This result
is 30.12% higher than the best result of the literature (55.85%
[34]). Finally, although the KDEF have been extensively used
in the intra-database protocol [35], there are no use of this
database using the cross-database protocol. Therefore our
results can be used as a baseline for future comparisons. Our
models achieved 72.55% ±0.72 of accuracy (best run with
Looking deeper into the results of the MMI database,
there are two facts worth mentioning. Firstly, the best results,
[28] with CRF, were achieved using a method that considers
the temporal relation between the frames (MMI is originally
a video database). As our models process a single frame
instead of a temporal sequence, it would be fairer to compare
them with the other method the same author proposed that
only process a single frame at a time. This model achieves
Group Method Train Database Accuracy
da Silva and Pedrini [5] JAFFE 48.20%
da Silva and Pedrini [5] BOSPHORUS 57.60%
Zhang et al. [4] MMI 61.20%
Hasani et al. [28] FERA/MMI 64.81%
Hasani et al. [28] (CRF) FERA/MMI 73.91%
Mollahosseini et al. [7] 6 databases*64.20%
Our 6 databases88.58% ±0.43
Shan et al. [3] CK 41.30%
Ali et al. [34] RaFD 48.67%
da Silva and Pedrini [5] CK 42.30%
Our 6 databases44.32% ±2.45
Shan et al. [3] CK 51.10%
Zhang et al. [4] CK 66.90%
Hasani et al. [28] FERA/CK 52.83%
Hasani et al. [28] (CRF) FERA/CK 68.51%
Mollahosseini et al. [7] 6 databases*55.60%
Our 6 databases67.03% ±1.66
Ali et al. [34] JAFFE 52.15%
Ali et al. [34] TFEID 55.85%
Our 6 databases85.97% ±0.43
KDEF Our 6 databases72.55% ±0.72
*Trained with all (MultiPIE, MMI, CK+, DISFA, FERA, SFEW, and
FER2013) except the test set.
Trained with all (CK+, JAFFE, MMI, RaFD, KDEF, BU3DFE, ARFace)
except the test set.
The result reported is not directly comparable with our results.
52.83% of accuracy (15.68% less than the other), which is
14.20% worse than ours. Secondly, in another perspective,
considering our average and standard deviation, our model
yields 67.03% ±1.66 for the MMI, which is on pair with
the best results of [28]. Finally, in this context, we can state
that our model also achieves state-of-the-art results for the
MMI, especially considering a single frame at a time. For the
JAFFE database, the best run of our model reported 50.23%
of accuracy, which is higher (+1.56%) than the best result of
the literature (48.67%). It is worth noticing that JAFFE also
shows the worst results across the literature.
The results of the literature (Table II) also confirm that the
performance of the models using the cross-database protocol
varies according different databases, therefore it really seems
inappropriate to generalize the performance of a model using
a single database. All these comparisons should be carefully
performed, since most of the methods of the literature do not
release their models neither a way to reproduce their results.
In this paper, we extensively investigated the use of a fine-
tuned CNN architecture in the facial expression recognition
problem using the cross-database protocol. In the proposed
system, we fine-tuned the VGG network pre-trained in a
face recognition database to recognize facial expressions
considering seven different well-established databases in the
literature. These datasets comprise more than 6,200 images
from subjects different ethnicities, genders, and ages in a
variety of environments and in both spontaneous and posed
facial expressions. Our results showed that employing fine-
tuning to a CNN pre-trained on a similar domain is, on
average, better than training from scratch. Moreover, fine-
tuned models also yield more stable performance, i.e. with
lower variance considering the intrinsic randomness of the
initialization. In the comparison with the literature, the VGG-
FineTuning models achieved state-of-the-art results for most
of test sets in terms of absolute value, especially for the CK+
and RaFD database (88.58% and 85.97%, respectively) where
there was a significant improvement in the literature (+23.77%
and +30.12%, respectively). Although we cannot conclude the
evaluated method is better than the methods of the literature
due to the lack of standardization in the evaluation protocols,
we can conclude the presented results are the state-of-the-art
for cross-database facial expression recognition.
We would like to thank Fundo de Apoio a Pesquisa (FAP)
of UFES for the support and CAPES for the scholarships. We
gratefully acknowledge the support of NVIDIA Corporation
with the donation of the Tesla K40 GPU used for this research.
[1] A. Mehrabian, “Communication without words,” Communication The-
ory, pp. 193–200, 2008.
[2] P. Ekman and W. V. Friesen, “Constants across cultures in the face and
emotion.” Journal of personality and social psychology, vol. 17, no. 2,
p. 124, 1971.
[3] C. Shan, S. Gong, and P. W. McOwan, “Facial expression recognition
based on local binary patterns: A comprehensive study,” Image and
Vision Computing, vol. 27, no. 6, pp. 803–816, 2009.
[4] X. Zhang, M. H. Mahoor, and S. M. Mavadati, “Facial expression
recognition using lp-norm mkl multiclass-svm,” Machine Vision and
Applications, vol. 26, no. 4, pp. 467–483, 2015.
[5] F. A. M. da Silva and H. Pedrini, “Effects of cultural characteristics
on building an emotion classifier through facial expression analysis,
Journal of Electronic Imaging, vol. 24, no. 2, pp. 023 015–023 015, 2015.
[6] A. T. Lopes, E. de Aguiar, A. F. D. Souza, and T. Oliveira-Santos, “Facial
expression recognition with convolutional neural networks: Coping with
few data and the training sample order,Pattern Recognition, 2016.
[7] A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeper in facial
expression recognition using deep neural networks,” in Applications of
Computer Vision (WACV), IEEE Winter Conference on, 2016, pp. 1–10.
[8] P. Liu, S. Han, Z. Meng, and Y. Tong, “Facial expression recognition via
a boosted deep belief network,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2014, pp. 1805–1812.
[9] P. Burkert, F. Trier, M. Z. Afzal, A. Dengel, and M. Liwicki, “Dexpres-
sion: Deep convolutional neural network for expression recognition,
arXiv preprint arXiv:1509.05371, 2015.
[10] Z. Yu and C. Zhang, “Image based static facial expression recognition
with multiple deep network learning,” in Proceedings of the 2015 ACM
on International Conference on Multimodal Interaction. ACM, 2015,
pp. 435–442.
[11] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E.
Hubbard, and L. D. Jackel, “Handwritten Digit Recognition with a Back-
Propagation Network,” in Advances in Neural Information Processing
Systems 2, D. S. Touretzky, Ed. Morgan-Kaufmann, 1990, pp. 396–404.
[12] X. Zhao, X. Liang, L. Liu, T. Li, Y. Han, N. Vasconcelos, and S. Yan,
“Peak-piloted deep network for facial expression recognition,” in ECCV.
Springer, 2016, pp. 425–442.
[13] G. Levi and T. Hassner, “Emotion recognition in the wild via convolu-
tional neural networks and mapped binary patterns,” in Proceedings of
the 2015 ACM on International Conference on Multimodal Interaction.
ACM, 2015, pp. 503–510.
[14] H. Ding, S. K. Zhou, and R. Chellappa, “Facenet2expnet: Regularizing
a deep face recognition net for expression recognition,arXiv preprint
arXiv:1609.06591, 2016.
[15] A. M. Martinez, “The AR face database,” CVC technical report, vol. 24,
[16] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews,
“The Extended Cohn-Kanade Dataset (CK+): A complete dataset for
action unit and emotion-specified expression,” in 2010 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition -
Workshops, June 2010, pp. 94–101.
[17] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3d facial
expression database for facial behavior research,” in Automatic face and
gesture recognition, 2006. FGR 2006. 7th international conference on.
IEEE, 2006, pp. 211–216.
[18] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, “Coding facial
expressions with gabor wavelets,” in Automatic Face and Gesture
Recognition, 1998. Proceedings. Third IEEE International Conference
on. IEEE, 1998, pp. 200–205.
[19] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database
for facial expression analysis,” in Multimedia and Expo, 2005. ICME
2005. IEEE International Conference on. IEEE, 2005, pp. 5–pp.
[20] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and
A. van Knippenberg, “Presentation and validation of the radboud faces
database,” Cognition and emotion, vol. 24, no. 8, pp. 1377–1388, 2010.
[21] D. Lundqvist, A. Flykt, and A. ¨
Ohman, “The karolinska directed
emotional faces (kdef),” CD ROM from Department of Clinical Neu-
roscience, Psychology section, Karolinska Institutet, pp. 91–630, 1998.
[22] Y. Lv, Z. Feng, and C. Xu, “Facial expression recognition via deep
learning,” in Smart Computing (SMARTCOMP), 2014 International
Conference on. IEEE, 2014, pp. 303–308.
[23] C. Mayer, M. Eggers, and B. Radig, “Cross-database evaluation for facial
expression recognition,” Pattern recognition and image analysis, vol. 24,
no. 1, pp. 124–132, 2014.
[24] I. Song, H.-J. Kim, and P. B. Jeon, “Deep learning for real-time robust
facial expression recognition on a smartphone,” in Consumer Electronics
(ICCE), 2014 IEEE International Conference on, 2014, pp. 564–567.
[25] H. Y. Patil, A. G. Kothari, and K. M. Bhurchandi, “Expression invariant
face recognition using local binary patterns and contourlet transform,”
Optik-International Journal for Light and Electron Optics, vol. 127,
no. 5, pp. 2670–2678, 2016.
[26] Z. Wang, Q. Ruan, and G. An, “Facial expression recognition using
sparse local fisher discriminant analysis,” Neurocomputing, vol. 174, pp.
756–766, 2016.
[27] S. Arivazhagan, R. A. Priyadharshini, and S. Sowmiya, “Facial expres-
sion recognition based on local directional number pattern and anfis
classifier,” in 2014 International Conference on Communication and
Network Technologies, Dec 2014, pp. 62–67.
[28] B. Hasani and M. H. Mahoor, “Spatio-temporal facial expression
recognition using convolutional neural networks and conditional random
fields,” arXiv preprint arXiv:1703.06995, 2017.
[29] P. Y. Simard, D. Steinkraus, J. C. Platt et al., “Best practices for
convolutional neural networks applied to visual document analysis.” in
ICDAR, vol. 3. Citeseer, 2003, pp. 958–962.
[30] A. T. Lopes, E. de Aguiar, and T. Oliveira-Santos, “A facial expression
recognition system using convolutional networks,” in Graphics, Patterns
and Images (SIBGRAPI), 28th Conference on, 2015, pp. 273–280.
[31] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[32] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks.” in Aistats, vol. 9, 2010, pp. 249–256.
[33] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
[34] G. Ali, M. A. Iqbal, and T.-S. Choi, “Boosted nne collections for
multicultural facial expression recognition,Pattern Recognition, vol. 55,
pp. 14–27, 2016.
[35] M. Xu, W. Cheng, Q. Zhao, L. Ma, and F. Xu, “Facial expression recog-
nition based on transfer learning from deep convolutional networks,” in
Natural Computation (ICNC), 2015 11th International Conference on.
IEEE, 2015, pp. 702–708.
... Endeavours addressing the cross-dataset evaluation can be found in [33,54]. Mollahosseini et al. [33] proposed a deep neural network architecture to address the face FER problem across multiple well-known standard face datasets. ...
... The CK+ dataset was one of the datasets that obtained more accurate results in both experiments: 93.2% and 64.2% respectively. In the work by Zavarez et al. [54], the performance influence of fine-tuning with the cross-dataset approach was investigated. In order to perform this study, the VGGFace Deep Convolutional Network model (pre-trained for face recognition) was fine-tuned to recognize facial expressions. ...
... Most works focus on proposing a CNN and using a single-dataset approach (Table 2), and only a few works deepen in a cross-dataset approach ( Table 3). In the case of cross-dataset, we find a similar approach to our work, Zavarez et al. [54] mixed six well-known datasets into one training dataset to test with another well-known dataset. Table 1 Accuracy results of recent models in the literature. ...
Full-text available
Recognizing facial expressions is a challenging task both for computers and humans. Although recent deep learning-based approaches are achieving high accuracy results in this task, research in this area is mainly focused on improving results using a single dataset for training and testing. This approach lacks generality when applied to new images or when using it in in-the-wild contexts due to diversity in humans (e.g., age, ethnicity) and differences in capture conditions (e.g., lighting or background). The cross-datasets approach can overcome these limitations. In this work we present a method to combine multiple datasets and we conduct an exhaustive evaluation of a proposed system based on a CNN analyzing and comparing performance using single and cross-dataset approaches with other architectures. Results using the proposed system ranged from 31.56% to 61.78% when used in a single-dataset approach with different well-known datasets and improved up to 73.05% when using a cross-dataset approach. Finally, to study the system and humans’ performance in facial expressions classification, we compare the results of 253 participants with the system. Results show an 83.53% accuracy for humans and a correlation exists between the results obtained by the participants and the CNN.
... The other advantage of fine-tuning is the reduced count of epochs required to converge the error. Zavarez et al. [44] have compared the differences with and without fine-tuning on the VGG-Face model, and the results demonstrated that fine-tuning the VGG-Face that has already been trained on a similar domain is superior to training from scratch. The authors have also performed cross-database facial expression recognition and the fine-tuned VGG-Face has given state-of-the-art results on various cross-database tests on well-established datasets like CK+ [2], MMI [45], The Radboud Faces Database (RaFD) [46], KDEF [4], JAFFE [47], and AR Face [48]. ...
Full-text available
Facial expressions are a prevalent way to recognize human emotions, and automatic facial expression recognition (FER) has been a significant task in cognitive science, artificial intelligence, and computer vision. The critical issue with the design of the FER model is the strong intra-class correlation of different emotions. The accuracy of the FER model is reduced due to other problems such as the variations in expressing the emotions, variations in lighting, and different ethnic biases. The latest convolutional neural network-based FER models have shown significant improvement in accuracy score but lack distinguishing the micro-expressions. This paper proposed a multi-input hybrid FER model that considers both hand-engineered and self-learnt features to classify facial expressions. The VGG-Face and the histogram of oriented gradients (HOG) features are derived from the faces to distinguish various facial expression patterns. The fusion of deep (VGG-Face) and hand-engineered (HOG) features has shown improved accuracy compared to the conventional CNN models. The results obtained showed that the proposed model’s accuracy scores outperformed the accuracy scores of the other popular FER models on three facial expression datasets. Extended Cohn–Kanade (CK+), Yale-Face, and Karolinska directed emotional faces (KDEF) datasets are used to determine the model’s classification efficiency. The proposed model scored 98.12%, 95.26%, and 96.36% accuracy using a fivefold cross-validation process on the CK+, Yale-Face and KDEF datasets.
... The research was conducted using the FER-2013 (35887 face images), SFEW (1766 face images), and RAF (15339 face images) databases, and the proposed model achieves an accuracy of 73.7%, 55.7%, and 86.3%, respectively. In a study, Zavarez et al. 51 developed a¯ne-tuned VGGFace-based CNN model for facial expression recognition using a variety of well-established datasets. They used a cross-dataset approach to investigate the extensive e®ect of¯ne-tuning models on performance. ...
Human facial expressions are an essential and fundamental component for expressing the state of the human mind. The automatic analysis of these nonverbal facial expressions has become a fascinating and quite challenging problem in computer vision, with its application in different areas, such as psychology, human–machine interaction, health, and augmented reality. Recently, deep learning (DL) has become a widespread technique for studying human nonverbal facial sentiment expressions, and some research attempts have been made to propose a certain model on this topic. The purpose of this paper is to apply the appropriate convolutional neural network (CNN) approach by adding several layers of different dimensions, which allows the CNN approach to efficiently classify human facial sentiment expressions with data augmentation capable of recognizing seven basic human facial expressions: anger, sadness, fear, disgust, happiness, surprise, and neutral. In particular, this study mainly proposes a convolution neural network architecture, as well as learning factors that minimise the memory space and total training time of the proposed network due to the shallow architecture of the model. Following that, we demonstrated our proposed model’s network complexity, computational cost, and classification accuracy on the three benchmark datasets: FER2013, KDEF, and JAFFE. As a result, our proposed approach achieves accuracy of [Formula: see text], [Formula: see text], [Formula: see text] in the FER2013, KDEF, and JAFFE, respectively, which is better compared to other state-of-the-art approaches.
... As a noninvasive and less costly instrument, FERS with AI-based algorithms also reduced the concerns of inter-and intrarater variation in the interview-based manually scored NPI [10,13]. The accuracy of the customized, video-based FERS was 86% in the Karolinska Directed Emotional Faces (KDEF) dataset; the results were comparable to those developed by histograms of oriented gradients (HOG) of facial feature extraction and other convolutional neural network (CNN) models (Table 6) [10,[21][22][23][24][25][26]. ...
Background: Behavioral and psychological symptoms of dementia (BPSD) affect 90% of persons with dementia (PwD), resulting in various adverse outcomes and aggravating care burdens among their caretakers. This study aimed to explore the potential of artificial intelligence-based facial expression recognition systems (FERS) in predicting BPSDs among PwD. Methods: A hybrid of human labeling and a preconstructed deep learning model was used to differentiate basic facial expressions of individuals to predict the results of Neuropsychiatric Inventory (NPI) assessments by stepwise linear regression (LR), random forest (RF) with importance ranking, and ensemble method (EM) of equal importance, while the accuracy was determined by mean absolute error (MAE) and root-mean-square error (RMSE) methods. Results: Twenty-three PwD from an adult day care center were enrolled with ≥ 11,500 FERS data series and 38 comparative NPI scores. The overall accuracy was 86% on facial expression recognition. Negative facial expressions and variance in emotional switches were important features of BPSDs. A strong positive correlation was identified in each model (EM: r = 0.834, LR: r = 0.821, RF: r = 0.798 by the patientwise method; EM: r = 0.891, LR: r = 0.870, RF: r = 0.886 by the MinimPy method), and EM exhibited the lowest MAE and RMSE. Conclusions: FERS successfully predicted the BPSD of PwD by negative emotions and the variance in emotional switches. This finding enables early detection and management of BPSDs, thus improving the quality of dementia care.
Full-text available
The development of a Social Intelligence System based on artificial intelligence is one of the cutting edge technologies in Assistive Robotics. Such systems need to create an empathic interaction with the users; therefore, it os required to include an Emotion Recognition (ER) framework which has to run, in near real-time, together with several other intelligent services. Most of the low-cost commercial robots, however, although more accessible by users and healthcare facilities, have to balance costs and effectiveness, resulting in under-performing hardware in terms of memory and processing unit. This aspect makes the design of the systems challenging, requiring a trade-off between the accuracy and the complexity of the adopted models. This paper proposes a compact and robust service for Assistive Robotics, called Lightweight EMotion recognitiON (LEMON), which uses image processing, Computer Vision and Deep Learning (DL) algorithms to recognize facial expressions. Specifically, the proposed DL model is based on Residual Convolutional Neural Networks with the combination of Dilated and Standard Convolution Layers. The first remarkable result is the few numbers (i.e., 1.6 Million) of parameters characterizing our model. In addition, Dilated Convolutions expand receptive fields exponentially with preserving resolution, less computation and memory cost to recognize the distinction among facial expressions by capturing the displacement of the pixels. Finally, to reduce the dying ReLU problem and improve the stability of the model, we apply an Exponential Linear Unit (ELU) activation function in the initial layers of the model. We have performed training and evaluation (via one-and five-fold cross validation) of the model with five datasets available in the community and one mixed dataset created by taking samples from all of them. With respect to the other approaches, our model achieves comparable results with a significant reduction in terms of the number of parameters.
Full-text available
Among the main characteristics of an individual with autism spectrum disorder are repetitive behavioral patterns, deficiencies in social interaction and both verbal and nonverbal communication present since childhood. The ability to recognize mental states from facial expressions plays a vital role in social interaction and interpersonal communication. In recent years, several studies have been carried out with the aim of motivating individuals to use computer technologies to learn emotions in order to improve social interactions. In this paper, a game that can support the development of emotional and social skills is presented for people with autism spectrum disorder. Our game allows people to develop the ability to recognize and express basic emotions: joy, sadness, anger, disgust, surprise and fear. Experiments were performed on a public domain image database and with a group of individuals from an educational institution, in order to evaluate the performance of the proposed tool. The results showed that the use of our approach improved these capabilities in individuals with autism spectrum disorder.
Full-text available
Although real-time facial emotion recognition is a hot topic research domain in the field of human-computer interaction, state-of-the-art available datasets still suffer from various problems such as some unrelated photos like document photos, unbalanced number of photos in each class, and some misleading images that can affect negatively on correct classification. 3RL dataset was created which contains about 24K images and will be publically available, to overcome previously available datasets problems. 3RL dataset is labelled with five basic emotions: happiness, fear, sadness, disgust, and anger. Moreover, we compared 3RL dataset with other most famous state-of-the-art datasets (FER dataset, CK+ dataset), we have applied the most common used algorithms in previous works, SVM and CNN. Results have shown a noticeable improvement of generalization on 3RL dataset. Experiments have shown an accuracy of up to 91.4% on 3RL dataset using CNN where results on FER2013, CK+ are respectively (approximately from 60% to 85%).
Although real-time facial emotion recognition is a hot topic research domain in the field of human-computer interaction, state-of-the-art available datasets still suffer from various problems such as some unrelated photos like document photos, unbalanced number of photos in each class, and some misleading images that can affect negatively on correct classification. 3RL dataset was created which contains about 24K images and will be publically available, to overcome previously available datasets problems. 3RL dataset is labelled with five basic emotions: happiness, fear, sadness, disgust, and anger. Moreover, we compared 3RL dataset with other most famous state-of-the-art datasets (FER dataset, CK+ dataset), we have applied the most common used algorithms in previous works, SVM and CNN. Results have shown a noticeable improvement of generalization on 3RL dataset. Experiments have shown an accuracy of up to 91.4% on 3RL dataset using CNN where results on FER2013, CK+ are respectively (approximately from 60% to 85%).
The research objectives of this thesis concern the development of new concepts for image segmentation and region classification for image analysis. This involves implementing new descriptors, whether color, texture, or shape, to characterize regions and propose new deep learning architectures for the various applications linked to facial analysis. We restrict our focus on face recognition and person-independent facial expressions classification tasks, which are more challenging, especially in unconstrained environments. Our thesis lead to the proposal of many contributions related to facial analysis based on handcrafted and deep architecture.We contributed to face recognition by an effective local features descriptor referred to as Mixed Neighborhood Topology Cross Decoded Patterns (MNTCDP). Our face descriptor relies on a new neighborhood topology and a sophisticated kernel function that help to effectively encode the person-related features. We evaluated the proposed MNTCDP-based face recognition system according to well-known and challenging benchmarks of the state-of-the-art, covering individuals' diversity, uncontrolled environment, variable background and lighting conditions. The achieved results outperformed several state-of-the-art ones.As a second contribution, we handled the challenge of pose-invariant face recognition (PIFR) by developing a Generative Adversarial Network (GAN) based image translation to generate a frontal image corresponding to a profile one. Hence, this translation makes the recognition much easier since most reference databases include only frontal face samples. We made an End-to-End deep architecture that contains the GAN for translating profile samples and a ResNet-based classifier to identify the person from its synthesized frontal image. The experiments, which we conducted on an adequate dataset with respect to person-independent constraints between the training and testing, highlight significant improvement in the PIFR performance.Our contributions to the facial expression recognition task cover both static and dynamic-based scenarios. The static-based FER framework relies on extracting textural and shape features from specific face landmarks that carry enough information to detect the dominant emotion. We proposed a new descriptor referred to as Orthogonal and Parallel-based Directions Generic Query Map Binary Patterns (OPD-GQMBP) to efficiently extract emotion-related textural features from 49 landmarks (regions of 32 by 32 pixels). These features are combined with shape ones computed by using Histogram of Oriented Gradients (HOG) descriptor on a binary mask representing the interpolation of the 49 landmarks. The classification is done through the SVM classifier. The achieved Person-Independent performance on five benchmarks with respect to Leave One Subject Out protocol demonstrated the effectiveness of the overall proposed framework against deep and handcrafted state-of-the-art ones. On the other hand, dynamic FER contribution incorporates Long Term Short Memory (LSTM) deep network to encode the temporal information efficiently with a guiding attention map to focus on the emotion-related landmarks and guarantee the person-independent constraint. We considered four samples as inputs representing the evolution of the emotion to its peak. Each sample is encoded through a ResNet-based stream, and the four streams are joined by an LSTM block that predicts the dominant emotion. The experiments conducted on three datasets for dynamic FER showed that the proposed deep CNN-LSTM architecture outperforms the state-of-the-art.
Full-text available
Automatic facial expression recognition is an active research area that has attracted much attention from both academics and practitioners of different fields. However, in reality, the problem of noise interference and cross-dataset expression recognition generally degrade the performance of recognition methods, we investigate the problems above, and propose a facial expression recognition approach from the perspective of deep subspace filter learning combined with discriminative classification criterion. Specifically, to derive an effective expression-related feature representation, we construct the filtered dictionaries based on deep subspace filter learning structure that corresponds to extract different expressions. Also, considering the similarities and discriminations existed in the filtered dictionaries, we further present a flexible classification criterion that adopt a dynamic weight to increase the adaptation between filtered dictionaries. To sum up, the proposed approach has more discriminative power from the aspect of representation and classification. Comprehensive experiments carried out using several public datasets, including JAFFE, CK+, and KDEF datasets, confirm that the proposed approach is superior compared to several state-of-the-art methods.
Full-text available
Automated Facial Expression Recognition (FER) has been a challenging task for decades. Many of the existing works use hand-crafted features such as LBP, HOG, LPQ, and Histogram of Optical Flow (HOF) combined with classifiers such as Support Vector Machines for expression recognition. These methods often require rigorous hyperparameter tuning to achieve good results. Recently Deep Neural Networks (DNN) have shown to outperform traditional methods in visual object recognition. In this paper, we propose a two-part network consisting of a DNN-based architecture followed by a Conditional Random Field (CRF) module for facial expression recognition in videos. The first part captures the spatial relation within facial images using convolutional layers followed by three Inception-ResNet modules and two fully-connected layers. To capture the temporal relation between the image frames, we use linear chain CRF in the second part of our network. We evaluate our proposed network on three publicly available databases, viz. CK+, MMI, and FERA. Experiments are performed in subject-independent and cross-database manners. Our experimental results show that cascading the deep network architecture with the CRF module considerably increases the recognition of facial expressions in videos and in particular it outperforms the state-of-the-art methods in the cross-database experiments and yields comparable results in the subject-independent experiments.
Full-text available
Facial expression recognition has been an active research area in the past ten years, with growing application areas including avatar animation, neuromarketing and sociable robots. The recognition of facial expressions is not an easy problem for machine learning methods, since people can vary significantly in the way they show their expressions. Even images of the same person in the same facial expression can vary in brightness, background and pose, and these variations are emphasized if considering different subjects (because of variations in shape, ethnicity among others). Although facial expression recognition is very studied in the literature, few works perform fair evaluation avoiding mixing subjects while training and testing the proposed algorithms. Hence, facial expression recognition is still a challenging problem in computer vision. In this work, we propose a simple solution for facial expression recognition that uses a combination of Convolutional Neural Network and specific image pre-processing steps. Convolutional Neural Networks achieve better accuracy with big data. However, there are no publicly available datasets with sufficient data for facial expression recognition with deep architectures. Therefore, to tackle the problem, we apply some pre-processing techniques to extract only expression specific features from a face image and explore the presentation order of the samples during training. The experiments employed to evaluate our technique were carried out using three largely used public databases (CK+, JAFFE and BU-3DFE). A study of the impact of each image pre-processing operation in the accuracy rate is presented. The proposed method: achieves competitive results when compared with other facial expression recognition methods −96.76% of accuracy in the CK+ database - it is fast to train, and it allows for real time facial expression recognition with standard computers.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We present a novel method for classifying emotions from static facial images. Our approach leverages on the recent success of Convolutional Neural Networks (CNN) on face recognition problems. Unlike the settings often assumed there, far less labeled data is typically available for training emotion classification systems. Our method is therefore designed with the goal of simplifying the problem domain by removing confounding factors from the input images, with an emphasis on image illumination variations. This, in an effort to reduce the amount of data required to effectively train deep CNN models. To this end, we propose novel transformations of image intensities to 3D spaces, designed to be invariant to monotonic photometric transformations. These are applied to CASIA Webface images which are then used to train an ensemble of multiple architecture CNNs on multiple representations. Each model is then fine-tuned with limited emotion labeled training data to obtain final classification models. Our method was tested on the Emotion Recognition in the Wild Challenge (EmotiW 2015), Static Facial Expression Recognition sub-challenge (SFEW) and shown to provide a substantial, 15.36% improvement over baseline results (40% gain in performance).
Conference Paper
We report our image based static facial expression recognition method for the Emotion Recognition in the Wild Challenge (EmotiW) 2015. We focus on the sub-challenge of the SFEW 2.0 dataset, where one seeks to automatically classify a set of static images into 7 basic emotions. The proposed method contains a face detection module based on the ensemble of three state-of-the-art face detectors, followed by a classification module with the ensemble of multiple deep convolutional neural networks (CNN). Each CNN model is initialized randomly and pre-trained on a larger dataset provided by the Facial Expression Recognition (FER) Challenge 2013. The pre-trained models are then fine-tuned on the training set of SFEW 2.0. To combine multiple CNN models, we present two schemes for learning the ensemble weights of the network responses: by minimizing the log likelihood loss, and by minimizing the hinge loss. Our proposed method generates state-of-the-art result on the FER dataset. It also achieves 55.96% and 61.29% respectively on the validation and test set of SFEW 2.0, surpassing the challenge baseline of 35.96% and 39.13% with significant gains.
Conference Paper
In this work, an efficient algorithm for facial expression recognition using a local feature descriptor, Local Binary Pattern (LBP), Local Directional Number Pattern (LDN) and Soft Computing Technique, Adaptive Neuro-Fuzzy Inference Systems (ANFIS) is presented. In the first experiment local binary pattern is computed using the input image.In the second experiment, the face image is subjected to a Kirsch compass mask that gives the directional information of the image and with the help of masked output Local Directional Number Pattern (LDN) code is computed. The obtained LBP and LDN image is divided into several regions and the distribution of the LBP and LDN features are extracted from them. These features are then concatenated into a feature vector, which is used for ANFIS training and classification. The experimental evaluation of the presented method is carried out using Japanese Female Facial Expression Database (JAFFE) and Indian Face Database (IFD). The results obtained from the experiments prove that the presented method successfully recognize the facial expression variations.
Relatively small data sets available for expression recognition research make the training of deep networks for expression recognition very challenging. Although fine-tuning can partially alleviate the issue, the performance is still below acceptable levels as the deep features probably contain redun- dant information from the pre-trained domain. In this paper, we present FaceNet2ExpNet, a novel idea to train an expression recognition network based on static images. We first propose a new distribution function to model the high-level neurons of the expression network. Based on this, a two-stage training algorithm is carefully designed. In the pre-training stage, we train the convolutional layers of the expression net, regularized by the face net; In the refining stage, we append fully- connected layers to the pre-trained convolutional layers and train the whole network jointly. Visualization shows that the model trained with our method captures improved high-level expression semantics. Evaluations on four public expression databases, CK+, Oulu-CASIA, TFD, and SFEW demonstrate that our method achieves better results than state-of-the-art.
Conference Paper
Objective functions for training of deep networks for face-related recognition tasks, such as facial expression recognition (FER), usually consider each sample independently. In this work, we present a novel peak-piloted deep network (PPDN) that uses a sample with peak expression (easy sample) to supervise the intermediate feature responses for a sample of non-peak expression (hard sample) of the same type and from the same subject. The expression evolving process from non-peak expression to peak expression can thus be implicitly embedded in the network to achieve the invariance to expression intensities. A special-purpose back-propagation procedure, peak gradient suppression (PGS), is proposed for network training. It drives the intermediate-layer feature responses of non-peak expression samples towards those of the corresponding peak expression samples, while avoiding the inverse. This avoids degrading the recognition capability for samples of peak expression due to interference from their non-peak expression counterparts. Extensive comparisons on two popular FER datasets, Oulu-CASIA and CK+, demonstrate the superiority of the PPDN over state-of-the-art FER methods, as well as the advantages of both the network structure and the optimization strategy. Moreover, it is shown that PPDN is a general architecture, extensible to other tasks by proper definition of peak and non-peak samples. This is validated by experiments that show state-of-the-art performance on pose-invariant face recognition, using the Multi-PIE dataset.
Conference Paper
It is well-known that deep models could extract robust and abstract features. We propose a efficient facial expression recognition model based on transfer features from deep convolutional networks (ConvNets). We train the deep ConvNets through the task of 1580-class face identification on the MSRA-CFW database and transfer high-level features from the trained deep model to recognize expression. To train and test the facial expression recognition model on a large scope, we built a facial expression database of seven basic emotion states and 2062 imbalanced samples depending on four facial expression databases (CK+, JAFFE, KDEF, Pain expressions form PICS). Compared with 50.65% recognition rate based on Gabor features with the seven-class SVM and 78.84% recognition rate based on distance features with the seven-class SVM, we achieve average 80.49% recognition rate with the seven-class SVM classifier on the self-built facial expression database. Considering occluded face in reality, we test our model in the occluded condition and demonstrate the model could keep its ability of classification in the small occlusion case. To increase the ability further, we improve the facial expression recognition model. The modified model merges high-level features transferred from two trained deep ConvNets of the same structure and the different training sets. The modified model obviously improves its ability of classification in the occluded condition and achieves average 81.50% accuracy on the self-built facial expression database.