PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Annotating a qualitative large-scale facial expression dataset is extremely difficult due to the uncertainties caused by ambiguous facial expressions, low-quality facial images, and the subjectiveness of annotators. These uncertainties lead to a key challenge of large-scale Facial Expression Recognition (FER) in deep learning era. To address this problem, this paper proposes a simple yet efficient Self-Cure Network (SCN) which suppresses the uncertainties efficiently and prevents deep networks from over-fitting uncertain facial images. Specifically, SCN suppresses the uncertainty from two different aspects: 1) a self-attention mechanism over mini-batch to weight each training sample with a ranking regularization, and 2) a careful relabeling mechanism to modify the labels of these samples in the lowest-ranked group. Experiments on synthetic FER datasets and our collected WebEmotion dataset validate the effectiveness of our method. Results on public benchmarks demonstrate that our SCN outperforms current state-of-the-art methods with \textbf{88.14}\% on RAF-DB, \textbf{60.23}\% on AffectNet, and \textbf{89.35}\% on FERPlus. The code will be available at \href{}{}.
Content may be subject to copyright.
Suppressing Uncertainties for Large-Scale Facial Expression Recognition
Kai Wang1,2, Xiaojiang Peng1, Jianfei Yang3, Shijian Lu3, and Yu Qiao1
1ShenZhen Key Lab of Computer Vision and Pattern Recognition, SIAT-SenseTime Joint
Lab,Shenzhen Institutes of Advanced Technology, Chinese Academy of Science
2University of Chinese Academy of Sciences, China
3Nanyang Technological University Singapore
Annotating a qualitative large-scale facial expression
dataset is extremely difficult due to the uncertainties caused
by ambiguous facial expressions, low-quality facial images,
and the subjectiveness of annotators. These uncertainties
lead to a key challenge of large-scale Facial Expression
Recognition (FER) in deep learning era. To address this
problem, this paper proposes a simple yet efficient Self-
Cure Network (SCN) which suppresses the uncertainties ef-
ficiently and prevents deep networks from over-fitting un-
certain facial images. Specifically, SCN suppresses the
uncertainty from two different aspects: 1) a self-attention
mechanism over mini-batch to weight each training sam-
ple with a ranking regularization, and 2) a careful rela-
beling mechanism to modify the labels of these samples in
the lowest-ranked group. Experiments on synthetic FER
datasets and our collected WebEmotion dataset validate the
effectiveness of our method. Results on public benchmarks
demonstrate that our SCN outperforms current state-of-the-
art methods with 88.14% on RAF-DB, 60.23% on Affect-
Net, and 89.35% on FERPlus. The code will be available
1. Introduction
Facial expression is one of the most natural, powerful
and universal signals for human beings to convey their emo-
tional states and intentions [7,38]. Automatically recogniz-
ing facial expression is also important to help the computer
understand human behavior and to interact with them. In
the past decades, researchers have made significant progress
on facial expression recognition (FER) with algorithms and
large-scale datasets, where datasets can be collected in lab-
Equally-contributed first authors (,
Corresponding author (
Figure 1: Illustration of uncertainties on real-world facial
images from RAF-DB. The right samples are extremely dif-
ficult for machines and even human which are better to be
suppressed in training.
oratory or in the wild, such as CK+ [29], MMI [39], Oulu-
CASIA [47], SFEW/AFEW [10], FERPlus [4], AffectNet
[32], EmotioNet [11], RAF-DB [22], etc.
However, for the large-scale FER datasets collected from
the Internet, it is extremely difficult to annotate with high
quality due to the uncertainties caused by the subjective-
ness of annotators as well as ambiguous in-the-wild facial
images. As illustrated in Figure 1, the uncertainties increase
from high-quality and evident facial expressions to low-
quality and micro expressions. These uncertainties usually
lead to inconsistent labels and incorrect labels, which are
suspending the progress of large-scale Facial Expression
Recognition (FER), especially for the one of data-driven
deep learning based FER. Generally, training with uncer-
tainties of FER may lead to the following problems. First,
it may result in over-fitting on the uncertain samples which
may be mislabeled. Second, it is harmful for a model to
learn useful facial expression features. Third, a high ratio
of incorrect labels even makes the model disconvergence in
the early stage of optimization.
To address these issues, we propose a simple yet efficient
method, termed as Self-Cure Network (SCN), to suppress
the uncertainties for large-scale facial expression recogni-
tion. The SCN consists of three crucial modules: self-
arXiv:2002.10392v2 [cs.CV] 6 Mar 2020
attention importance weighting, ranking regularization, and
noise relabeling. Given a batch of images, a backbone CNN
is first used to extract facial features. Then the self-attention
importance weighting module learns a weight for each im-
age to capture the sample importance for loss weighting. It
is expected that uncertain facial images are assigned low im-
portance weights. Further, the ranking regularization mod-
ule ranks these weights in descending order, splits them
into two groups (i.e. high importance weights and low im-
portance weights), and regularizes the two groups by en-
forcing a margin between the average weights of the two
groups. This regularization is implemented with a loss func-
tion, termed as Rank Regularization loss (RR-Loss). The
ranking regularization module ensures that the first module
learns meaningful weights to highlight certain samples (e.g.
reliable annotations) and to suppress uncertain samples (e.g.
ambiguous annotations). The last module is a careful rela-
beling module that attempts to relabel these samples from
the bottom group by comparing the maximum predicted
probabilities to the probabilities of given labels. A sample is
assigned to a pseudo label if the maximum prediction prob-
ability is higher than the one of given label with a margin
threshold. In addition, since the main evidence of uncer-
tainties is the incorrect/noisy annotation problem, we col-
lect an extreme noisy FER dataset from the Internet, termed
as WebEmotion, to investigate the effect of SCN with ex-
treme uncertainties.
Overall, our contributions can be summarized as follows,
We innovatively pose the uncertainty problem in facial
expression recognition, and propose a Self-Cure Net-
work to reduce the impact of uncertainties.
We elaborately design a rank regularization to super-
vise the SCN to learn meaningful importance weights,
which also provides a reference for the relabeling mod-
We extensively validate our SCN on synthetic FER
data and a new real-world uncertain emotion dataset
(WebEmotion) collected from the Internet. Our
SCN also achieves performance 88.14% on RAF-DB,
60.23% on AffectNet, and 89.35% on FERPlus, which
set new records on them.
2. Related Work
2.1. Facial Expression Recognition
Generally, a FER system mainly consists of three stages,
namely face detection, feature extraction, and expression
recognition. In face detection stage, several face detectors
like MTCNN [44] and Dlib [2]) are used to locate faces in
complex scenes. The detected faces can be further aligned
alternatively. For feature extraction, various methods are
designed to capture facial geometry and appearance features
caused by facial expressions. According to the feature type,
they can be grouped into engineered features and learning-
based features. For the engineered features, they can be
further divided into texture-based local features, geometry-
based global features, and hybrid features. The texture-
based features mainly include SIFT [34], HOG [6], His-
tograms of LBP [35], Gabor wavelet coefficients [26], etc.
The geometry-based global features are mainly based on the
landmark points around noses, eyes, and mouths. Combin-
ing two or more of the engineered features refers to the hy-
brid feature extraction, which can further enrich the repre-
sentation. For the learned features, Fasel [12] finds that a
shallow CNN is robust to face poses and scales. Tang [37]
and Kahou et al. [21] utilize deep CNNs for feature extrac-
tion, and win the FER2013 and Emotiw2013 challenge, re-
spectively. Liu et al. [27] propose a Facial Action Units
based CNN architecture for expression recognition. Re-
cently, both Li et al. [25] and Wang et al. [42] have de-
signed region-based attention networks for pose and occlu-
sion aware FER, where the regions are either cropped from
landmark points or fixed positions.
2.2. Learning with Uncertainties
Uncertainties in the FER task mainly come from am-
biguous facial expressions, low-quality facial images, in-
consistent annotations, and incorrect annotations (i.e. noisy
labels). Particularly, learning with noisy labels is exten-
sively studied in the computer vision community while the
other two aspects are rarely explored. In order to handle
noisy labels, one intuitive idea is to leverage a small set of
clean data that can be used to assess the quality of the labels
during the training process [40,23,8], or to estimate the
noise distribution [36], or to train the feature extractors [3].
Li et al. [23] propose a unified distillation framework using
‘side’ information from a small clean dataset and label re-
lations in knowledge graph, to ‘hedge the risk’ of learning
from noisy labels. Veit et al.[41] use a multi-task network
that jointly learns to clean noisy annotations and to clas-
sify images. Azadi et al.[3] select reliable images by an
auxiliary image regularization for deep CNNs with noisy
labels. Other methods do not need a small clean dataset
but they may assume extra constrains or distributions on
the noisy samples [31], such as a specific loss for randomly
flipped labels [33], regularizing the deep networks on cor-
rupted labels by a MentorNet [20], and other approaches
that model the noise with a softmax layer by connecting
the latent correct labels to the noisy ones [13,43]. For the
FER task, Zeng et al. [43] first consider the inconsistent
annotation problem among different FER datasets, and pro-
pose to leverage these uncertainties to improve FER. In con-
trast, our work focuses on suppressing these uncertainties
to learn better facial expression features.
3. Self-Cure Network
To learn robust facial expression features with uncertain-
ties, we propose a simple yet efficient Self-Cure Network
(SCN). In this section, we first provide an overview of the
SCN, and then present its three modules. We finally present
the detailed implementation of SCN.
3.1. Overview of Self-Cure Network
Our SCN is built upon traditional CNNs and consists of
three crucial modules: i) self-attention importance weight-
ing, ii) ranking regularization, and iii) relabeling, as shown
in Figure 2.
Given a batch of face images with some uncertain sam-
ples, we first extract the deep features by a backbone
network. The self-attention importance weighting mod-
ule assigns an importance weight for each image using
a fully-connected (FC) layer and the sigmoid function.
These weights are multiplied by the logits for a sample re-
weighting scheme. To explicitly reduce the importance of
uncertain samples, a rank regularization module is further
introduced to regularize the attention weights. In the rank
regularization module, we first rank the learned attention
weights and then split them into two groups, i.e. high and
low importance groups. We then add a constraint between
the mean weights of these groups by a margin-based loss,
which is called rank regularization loss (RR-Loss). To fur-
ther improve our SCN, the relabeling module is added to
modify some of the uncertain samples in the low impor-
tance group. This relabeling operation aims to hunt more
clean samples and then to enhance the final model. The
whole SCN can be trained in an end-to-end manner and eas-
ily added into any CNN backbones.
3.2. Self-Attention Importance Weighting
We introduce the self-attention importance weighting
module to capture the contributions of samples for training.
It is expected that certain samples may have high impor-
tance weights while uncertain ones have low importance.
Let F= [x1,x2,...,xN]RD×Ndenotes the facial fea-
tures of Nimages, the self-attention importance weighting
module takes Fas input, and output an importance weight
for each feature. Specifically, the self-attention importance
weighting module is comprised of a linear fully-connected
(FC) layer and a sigmoid activation function, which can be
formulated as ,
where αiis the importance weight of the i-th sample, Wa
is the parameters of the FC layer used for attention, and σis
the sigmoid function. This module also provides reference
for the other two modules.
Logit-Weighted Cross-Entropy Loss. With the atten-
tion weights, we have two simple choices to perform loss
weighting inspired by [17]. The first choice is to multiply
the weight of each sample by the sample loss. In our case,
since the weights are optimized in an end-to-end manner
and are learned from the CNN features, they are doomed to
be zeros as this trival solution makes zero loss. MentorNet
[20] and other self-paced learning methods [19,30] solve
this problem by alternating minimization, i.e. optimize one
at a time while the other is held fixed. In this paper, we
choose the logit-weighted one of [17] which is shown to
be more efficient. For a multi-class Cross-Entropy loss,
we call our weighted loss as Logit-Weighted Cross-Entropy
loss (WCE-Loss), which is formulated as,
LW CE =1
log eαiW>
j=1 eαiW>
where Wjis the j-th classifier. As suggested in [28], the
LW CE has a positive correlation with the α.
3.3. Rank Regularization
The self-attention weights in the above module can be
arbitrary in (0, 1). To explicitly constrain the importance
of uncertain samples, we elaborately design a rank regular-
ization module to regularize the attention weights. In the
rank regularization module, we first rank the learned atten-
tion weights in descending order and then split them into
two groups with a ratio β. The rank regularization ensures
that the mean attention weight of high-importance group is
higher than the one of low-importance group with a margin.
Formally, we define a rank regularization loss (RR-Loss)
for this purpose as follows,
LRR =max{0, δ1(αHαL)},(3)
αi, αL=1
where δ1is a margin which can be a fixed hyper parameter
or a learnable parameter, αHand αLare the mean values of
the high importance group with βN=Msamples and the
low importance group with NMsamples, respectively.
In training, the total loss function is Lall =γLRR + (1
γ)LW CE where γis a trade-off ratio.
3.4. Relabeling
In the rank regularization module, each mini-batch is di-
vided into two groups, i.e. the high-importance and the low-
importance groups. We experimentally find that the uncer-
tain samples usually have low importance weights, thus an
intuitive idea is to design a strategy to relabel these samples.
Figure 2: The pipeline of our Self-Cure Network. Face images are first fed into a backbone CNN for feature extraction.
The self-attention importance weighting module learns sample weights from facial features for loss weighting. The rank
regularization module takes as input the sample weights and constrain them with a ranking operation and a margin-based loss
function. The relabeling module hunts reliable samples by comparing maximum predicted probabilities to the probabilities
of given labels. Mislabeled samples are marked in red solid rectangles and ambiguous samples in green dash ones. It is
worth noting that SCN mainly resorts to the re-weighting operation to suppress these uncertainties and only modifies some
of the uncertain samples.
Table 1: The statistics of our WebEmotion.
Category Happy Sad Surprise Fear Angry Disgust Contempt Neutral Total
# Videos 4,231 5,670 4,573 5,328 5,668 5,197 5,266 5,406 41,339
# Clips 27,854 29,667 27,418 29,822 31,483 20,764 6,454 26,687 200,149
The main challenge to modify these annotations is to know
which annotation is incorrect.
Specifically, our relabeling module only considers the
samples in the low-importance group and is performed on
the Softmax probabilities. For each sample, we compare the
maximum predicted probability to the probability of given
label. A sample is assigned to a new pseudo label if the
maximum prediction probability is higher than the one of
given label with a threshold. Formally, the relabeling mod-
ule can be defined as,
y0=lmax if Pmax PgtI nd > δ2,
lorg otherwise,(5)
where y0denotes the new label, δ2is a threshold, Pmax is
the maximum predicted probability, and PgtInd is the pre-
dicted probability of the given label. lorg and lmax are the
original given label and the index of the maximum predic-
tion, respectively.
In our system, uncertain samples are expected to ob-
tain low importance weights thus to degrade their nega-
tive impacts with re-weighting, and then fall into the low-
importance group, and finally may be corrected as certain
samples by relabeling. Those corrected samples may obtain
high important weights in the next epoch. We expect the
network can be cured by itself with either re-weighting or
relabeling, which is the reason why we call our method as
self-cured network.
3.5. Implementation
Pre-processing and facial features. In our SCN, face
images are detected and aligned by MTCNN [45] and fur-
ther resized to 224 ×224 pixels. The SCN is implemented
with Pytorch toolbox and the backbone network is ResNet-
18 [16]. By default, the ResNet-18 is pre-trained on the
MS-Celeb-1M face recognition dataset and the facial fea-
tures are extracted from its last pooling layer.
Training. We train our SCN in an end-to-end manner
with 8 Nvidia Titan 2080ti GPU, and set the batch size as
1024. In each iteration, the training images are divided into
two groups including 70% high importance samples and
30% low importance samples by default. The margin δ1
between the mean value of high and low importance groups
can be either set at 0.15 by default or designed as a learnable
parameter. Both strategies will be evaluated in the ensuing
Experiments. The whole network is jointly optimized with
RR-Loss and WCE-Loss. The ratio of the two losses is em-
pirically set at 1:1, and its influence will be studied in the
ensuing ablation study of Experiments. The leaning rate
is initialized as 0.1 which is further divided by 10 after 15
epochs and 30 epochs, respectively. The training stops at 40
epochs. The relabeling module is included for optimization
from the 10th epoch, where the relabeling margin δ2is set
at 0.2 by default.
4. Experiments
In this section, we first describe three public datasets and
our WebEmotion dataset. We then demonstrate the robust-
ness of our SCN under uncertainties of both synthetic and
real-world noisy facial expression annotations. Further, we
conduct ablation studies with qualitative and quantitative re-
sults to show the effectiveness of each module in SCN. Fi-
nally, we compare our SCN to the state-of-the-art methods
on public datasets.
4.1. Datasets
RAF-DB [22] contains 30,000 facial images annotated
with basic or compound expressions by 40 trained human
coders. In our experiment, only images with six basic ex-
pressions (neutral, happiness, surprise, sadness, anger, dis-
gust, fear) and neutral expression are used which leads to
12,271 images for training and 3,068 images for testing.
The overall sample accuracy is used for measurement.
FERPlus [4] is extended from FER2013 as used in the
ICML 2013 Challenges. It is a large-scale dataset collected
by the Google search engine. It consists of 28,709 training
images, 3,589 validation images and 3,589 test images, all
of which are resized to 48×48 pixels. Contempt is included
which leads to 8 classes in this dataset. The overall sample
accuracy is used for measurement
AffectNet [32] is by far the largest dataset that provides
both categorical and Valence-Arousal annotations. It con-
tains more than one million images from the Internet by
querying expression-related keywords in three search en-
gines, of which 450,000 images are manually annotated
with eight expression labels as in FERPlus. It has imbal-
anced training and test sets as well as a balanced validation
set. The mean class accuracy on the validation set is used
for measurement.
The collected WebEmotion. Since the main evidence
of uncertainties is the incorrect/noisy annotation problem,
we collect an extreme noisy FER dataset from the Inter-
net, termed as WebEmotion, to investigate the effect of
SCN with extreme uncertainties. The WebEmotion is a
video dataset (though we use it as image data by assign-
ing labels to frames) downloaded from YouTube with a set
of keywords including 40 emotion-related words, 45 coun-
tries from Asia, Europe, Africa, America, and 6 age-related
words (i.e. baby, lady, woman, man, old man, old woman).
It consists of the same 8 classes with FERPlus, where each
class is connected to several emotion-related keywords, e.g.
Happy is connected to the keywords happy, funny, ecstatic,
smug, and kawaii. To obtain meaningful correlation be-
tween the keywords and the searched videos, only the top
20 crawled videos with less then 4 minutes are selected.
This leads to around 41,000 videos which are further seg-
mented into 200,000 video clips with a constraint that a face
(detected by MTCNN) appears at least 5 seconds. For eval-
uation, we only use WebEmotion for pretraining since an-
notating is extremely difficult. Table 1shows the statistics
of WebEmotion. The meta videos and video clips will be
public to the research community.
4.2. Evaluation of SCN on Synthetic Uncertainties
The uncertainties of FER mainly come from ambiguous
facial expressions, low-quality facial images, inconsistent
annotations, and incorrect annotations (i.e. noisy labels).
Considering that only noisy labels can be analyzed quanti-
tatively, we explore the robustness of SCN with three levels
of label noises including the ratio of 10%, 20%, and 30%
to RAF-DB, FERPLus, and AffectNet datasets. Specifi-
cally, we randomly choose 10%, 20%, and 30% of train-
ing data for each category and randomly change their la-
bels to others. In Table 2, we use ResNet-18 as CNN back-
bone and compare our SCN to the baseline (traditional CNN
training without considering label noises) with two training
schemes: i) training from scratch and ii) fine-tuning with
a pretrained model on Ms-Celeb-1M [15]. We also com-
pare our SCN to two state-of-the-art noise-tolerant meth-
ods on RAF-DB, namely CurriculumNet [14] and Meta-
Cleaner [46].
As shown in Table 2, our SCN consistently improves the
baseline by a large margin. For scheme i) with noise ratio
30%, our SCN outperforms the baseline by 13.80% , 1.07%,
and 1.91% on RAF-DB, FERPLus, and AffectNet, respec-
tively. For scheme ii) with noise ratio 30%, our SCN still
gain improvements of 2.20%, 2.47%, and 3.12% on these
datasets though the performance of them are relatively high.
For both schemes, the benefit from SCN becomes more ob-
vious as the noise ratio increases up. CurriculumNet de-
signs training curriculum by measuring data complexity us-
ing cluster density which can avoid training noisy-labeled
data in early stages. MetaCleaner aggregates the features of
several samples in each class into a weighted mean feature
for classification which can also weaken the noisy-labeled
samples. Both CurriculumNet and MetaCleaner improve
the baseline largely but are still inferior to the SCN which
is simpler. Another interesting find is that the improve-
ment of SCN on RAF-DB is much higher than on other
Figure 3: Visualization of the learned importance weights in our SCN, we show these weights on randomly selected images
with original labels (1st row) and synthetic noisy labels before and after relabeling (2nd row and 3rd row).
Table 2: The evaluation of SCN on synthetic noisy FER
datasets. ‘Pretrain’ means we use a pretrained model from
face recognition, otherwise we train from scratch.
Pretrain SCN Noise(%) RAF-DB AffectNet FERPlus
×CurriculumNet [14] 10 68.5 - -
×MetaCleaner [46] 10 68.45 - -
× × 10 61.43 44.68 77.15
×X10 70.26 45.23 78.53
×CurriculumNet [14] 20 61.23 - -
×MetaCleaner [46] 20 61.35 - -
× × 20 55.5 41.00 71.88
×X20 63.50 41.63 72.46
×CurriculumNet [14] 30 57.52 - -
×MetaCleaner [46] 30 58,89 - -
× × 30 46.81 38.35 68.54
×X30 60.61 39.42 70.45
X×10 80.81 57.18 83.39
X X 10 82.18 58.58 84.28
X×20 78.18 56.15 82.24
X X 20 80.10 57.25 83.17
X×30 75.26 52.58 79.34
X X 30 77.46 55.05 82.47
datasets. It may be explained by the following reasons. On
the one hand, RAF-DB consists of compound facial expres-
sions and is annotated by 40 people with crowdsourcing,
which make the data annotations more inconsistent. Thus,
our SCN may also gain improvement on the original RAF-
DB without synthetic label noises. On the other hand, Af-
fectNet and FERPlus are annotated by experts, thus less in-
consistent labels are involved, leading to less improvement
on RAF-DB.
Visualization of αin SCN. To further investigate the
effectiveness of our SCN under noisy annotations, we vi-
sualize the importance weight αduring the training phase
of SCN on RAF-DB with noise ratio 10% . In Figure 3,
Table 3: The effect of SCN on WebEmotion for pretraining.
The 2nd column indicates finetuning with or without SCN.
WebEmoition SCN RAF-DB AffectNet FERPlus
× × 72.00 46.58 82.4
w/o SCN ×78.97 56.43 84.20
w/o SCN X80.42 57.23 85.13
SCN X82.45 58.45 85.97
the first row indicates the importance weights when SCN is
trained with original labels. The images of the second row
are annotated with synthetic corrupted labels, and we use
SCN (without Relabel module) to train the synthetic noisy
dataset. Indeed, the SCN regards those label-corrupted im-
ages as noises and automatically suppresses the weights of
them. After sufficient training epochs, the relabeling mod-
ule are added into SCN, and these noisy-labeled images are
relabeled (of course many others may be not relabeled since
we have relabeling constraint). After several other epochs,
the importance weights of them become high (the 3rd row),
which demonstrates that our SCN can ‘self-cure’ the cor-
rupted labels. It is worth noting that the new labels from
relabeling module may be inconsistent with “ground-truth”
labels (see the 1st, 4th, and 6th columns) but they are also
reasonable in visualization.
4.3. Exploring SCN on Real-World Uncertainties
Synthetic noisy data proves the effectiveness of the ‘self-
curing’ ability of SCN. In this section, we apply our SCN
to real-world FER datasets which can include all types of
SCN on WebEmotion for pretraining. Our collected
WebEmotion dataset consists of massive noises since the
Figure 4: Ten examples of RAF-DB (w/o synthetic noisy
labels) with low importance weights. Each column corre-
sponds to a basic emotion. One can guess their labels and
the ground-truth labels of RAD-DB are included in the text.
Table 4: SCN on real-world FER datasets. The improve-
ments of SCN suggests that these public datasets more or
less suffer from uncertainties.
Pretrain SCN RAF-DB AffectNet FERPlus
× × 72.00 46.58 82.4
×X78.31 47.28 83.42
×CurriculumNet [14] 74.67 - -
×MetaCleaner [46] 77.18 - -
X×84.20 58.5 86.80
X X 87.03 60.23 88.01
searching keywords are regarded as labels. To better vali-
date the effect of SCN on real-world noisy data, we apply
our SCN to WebEmotion for pretraining and then finetune
the model on target datasets. We show the comparison ex-
periments in Table 3. From the 1st and the 2nd rows, we
can see that pretraining on WebEmotion without SCN im-
proves the baseline by 6.97%, 9.85%, and 1.80% on RAF-
DB, FERPlus and AffectNet, respectively. Fine-tuning with
SCN on target datasets obtains gains ranged from 1% to
2%. Pretraining on WebEmotion with SCN further boosts
the performance from 80.42% to 82.45% on RAF-DB. This
suggests that SCN learns robust features on WebEmotion
which is better for further fine-tuning.
SCN on Original FER datasets. We further conduct
experiments on original FER datasets to evaluate our SCN
since these datasets inevitably suffer from uncertainties
such as ambiguous facial expressions, low-quality facial im-
ages, etc. Results are shown in Table 4. When training
from scratch, our proposed SCN improves the baseline con-
sistently with gains of 6.31%, 0.7%, and 1.02% on RAD-
DB, AffectNet, and FERPlus, respectively. MetaCleaner
also boosts the baseline on RAF-DB but slightly worse than
our SCN. With pretraining, we still obtain improvements of
2.83%, 1.73%, and 1.21% on these datasets. The improve-
ment of SCN and MetaCleaner suggests that there indeed
exists uncertainties in those datasets. To validate our spec-
ulation, we rank the importance weights of RAF-DB, and
show some examples with low importance weights in Fig-
Table 5: Evaluation of the three modules in SCN.
Weight Rank Relabel RAF-DB RAF-DB (pretrain)
× × × 72.00 84.20
× × X71.25 83.78
×X×74.15 85.14
X× × 76.26 86.09
X X ×76.57 86.63
X X X 78.31 87.03
Table 6: Evaluation of the ratio γbetween RR-Loss and
0.2 0.3 0.5 0.6 0.8
76.12% 76.35% 78.31% 76.57% 71.75%
ure 4. The ground-truth labels from top-left to bottom-right
are surprise, neutral, neutral, sad, surprise, surprise, neu-
tral, surprise, neutral, surprise. We find that images with
low quality and occlusion are difficult to annotate and are
more likely to have low-importance weights in SCN.
4.4. Ablation Studies
Evaluation of the three modules in SCN. To evaluate
the effect of each module of SCN, we design an ablation
study to investigate WCE-Loss, RR-Loss and Relabel mod-
ules on RAF-DB. We show the experimental results in Table
5. Several observations can be concluded in the following.
First, for both training schemes, a naive relabeling mod-
ule (2nd row) added into the baseline (1st row) can degrade
performance slightly. This may be explained by that many
relabeling operations are wrong from the baseline model.
It indirectly indicates that our elaborately-designed relabel-
ing in the low-importance group with rank regularization
is more effective. Second, when adding one module, we
obtain the highest improvement by WCE-Loss which im-
proves the baseline from 72% to 76.26% on RAF-DB. This
suggests that the re-weighting is the most contributed mod-
ule for our SCN. Third, the RR-Loss and the relabeling
module can further boost WCE-Loss by 2.15%.
Evaluation of the ratio γ. In Table 6, we evaluate the
effect of different ratios between the RR-Loss and WCE-
Loss. We find that setting equal weight for each loss
achieves the best results. Increasing the weight of RR-Loss
from 0.5 to 0.8 dramatically degrades performance which
suggests that WCE-Loss is more important.
Evaluation of δ1and δ2.δ1is a margin parameter
to control the mean margin between the high- and low-
importance groups. For fixed setting, we evaluate it from
0 to 0.30. Figure 5(left) shows the results for both fixed
and learned δ1. The default δ1= 0.15 obtains the best per-
formance, which shows that the margin should be an ap-
propriate value. We also design a learnable paradigm of
Figure 5: Evaluation of the margin δ1and δ2, and the ratio βon the RAF-DB dataset.
Table 7: Comparison to the state-of-the-art results.These results are trained using label distributions. +Oversampling is used
since AffectNet is imbalanced. RAF-DB and AffectNet are jointly used for training. Note that IPA2LT tests with 7 classes
on AffectNet.
(a) Comparison on RAF-DB.
Method Acc.
DLP-CNN [22] 84.22
IPA2LT [43] 86.77
gaCNN [24] 85.07
RAN [42] 86.90
Our SCN (ResNet18) 87.03
Our SCN (ResNet18) 88.14
(b) Comparison on AffectNet.
Method mean Acc.
Upsample [32] 47.00
Weighted loss [32] 58.00
IPA2LT[43] (7 cls) 55.71
RAN [42] 52.97
RAN+[42] 59.5
Our SCN+(ResNet18) 60.23
(c) Comparison on FERPlus
Method Acc.
PLD[5] 85.1
ResNet+VGG [18] 87.4
SeNet50[1] 88.8
RAN [42] 88.55
RAN-VGG16[42] 89.16
Our SCN (ResNet18/IR50) 88.01/89.35
δ1, and initialize it to 0.15. The learnable δ1converges to
0.142 ±0.05 and the performances are 77.76% and 69.45%
in original and noise RAF-DB datasets, respectively.
δ2is a margin to determine when to relabel a sample.
The default δ2is 0.2. We evaluate δ2from 0 to 0.5 on orig-
inal RAF-DB, and show the results in Figure 5(middle).
δ2= 0 means we relabel a sample if the max prediction
probability is larger than the probability of the given label.
Small δ2leads to a lot of incorrect relabeling operations
which may hurt performance significantly. Large δ2leads
to few relabeling operations which converges to no relabel-
ing. We get the best performance in 0.2.
Evaluation of the β.βis the ratio of high importance
samples in a minibatch. We study different ratios from 0.9
to 0.5 in both synthetic noisy and original RAF-DB dataset.
The results are shown in Figure 5(right). Our default ratio
is 0.7 that achieves the best performance. Large βdegrades
the ability of SCN since it considers few of the data is un-
certain. Small βleads to over-consideration of uncertainties
which decreases the training loss unreasonably.
4.5. Comparison to the State of the Art
Table 7compares our method to several state-of-
the-art methods on RAF-DB, AffectNet, and FERPlus.
IPA2LT [43] introduces the latent ground-truth idea for
training with inconsistent annotations across different FER
datasets. gaCNN [24] leverages a patch-based attention net-
work and a global network. RAN[42] utilizes face regions
and original face with a cascade attention network. gaCNN
and RAN are time-consuming due to the cropped patches
and regions. Our proposed SCN does not increase any cost
in inference. Our SCN outperforms these recent state-of-
the-art methods with 88.14%,60.23%, and 89.35% (with
IR50 [9]) on RAF-DB, AffectNet, and FERPlus, respec-
5. Conclusion
This paper presents a self-cure network (SCN) to sup-
press the uncertainties of facial expression data thus to
learn robust feature for FER. The SCN consists of three
novel modules including self-attention importance weight-
ing, ranking regularization, and relabeling. The first module
learns a weight for each facial image with self-attention to
capture the sample importance for training and is used for
loss weighting. The ranking regularization ensures that the
first module learns meaningful weights to highlight certain
samples and suppress uncertain samples. The relabeling
module attempts to identify mislabeled samples and modify
their labels. Extensive experiments on three public datasets
and our collected WebEmotion show that our SCN achieves
state-of-the-art results and can handle both synthetic and
real-world uncertainties effectively.
6. Acknowledge
This work is partially supported byScience and Tech-
nology Service Network Initiative of Chinese Academy of
Sciences (KFJ-STS-QYZX-092), Guangdong Special Sup-
port Program (2016TX03X276), and National Natural Sci-
ence Foundation of China (U1813218, U1713208), Shen-
zhen Basic Research Program (JCYJ20170818164704758,
CXB201104220032A), the Joint Lab of CAS-HK.
[1] Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, and An-
drew Zisserman. Emotion recognition in speech using cross-
modal transfer in the wild. arXiv preprint arXiv:1808.05561,
2018. 8
[2] Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satya-
narayanan. Openface: A general-purpose face recognition
library with mobile applications. Technical report, CMU-
CS-16-118, CMU School of Computer Science, 2016. 2
[3] Samaneh Azadi, Jiashi Feng, Stefanie Jegelka, and Trevor
Darrell. Auxiliary image regularization for deep cnns with
noisy labels. arXiv preprint:1511.07069, 2015. 2
[4] Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and
Zhengyou Zhang. Training deep networks for facial expres-
sion recognition with crowd-sourced label distribution. In
ACM ICMI, 2016. 1,5
[5] Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and
Zhengyou Zhang. Training deep networks for facial expres-
sion recognition with crowd-sourced label distribution. In
ACM ICMI, pages 279–283, 2016. 8
[6] N. Dalal and B. Triggs. Histograms of oriented gradients for
human detection. In CVPR, 2005. 2
[7] Charles Darwin and Phillip Prodger. The expression of the
emotions in man and animals. Oxford University Press,
USA, 1998. 1
[8] Mostafa Dehghani, Aliaksei Severyn, Sascha Rothe, and
Jaap Kamps. Avoiding your teacher’s mistakes: Training
neural networks with controlled weak supervision. arXiv
preprint 1711.00313, 2017. 2
[9] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos
Zafeiriou. Arcface: Additive angular margin loss for deep
face recognition. In CVPR, pages 4690–4699, 2019. 8
[10] Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom
Gedeon. Static facial expression analysis in tough condi-
tions: Data, evaluation protocol and benchmark. In ICCV,
pages 2106–2112, 2011. 1
[11] C Fabian Benitez-Quiroz, Ramprakash Srinivasan, and
Aleix M Martinez. Emotionet: An accurate, real-time al-
gorithm for the automatic annotation of a million facial ex-
pressions in the wild. In CVPR, pages 5562–5570, 2016. 1
[12] B. Fasel. Robust face analysis using convolutional neural
networks. In ICPR, pages 40–43, 2002. 2
[13] Jacob Goldberger and Ehud Ben-Reuven. Training deep
neural-networks using a noise adaptation layer. 2016. 2
[14] Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang,
Dengke Dong, Matthew R. Scott, and Dinglong Huang. Cur-
riculumnet: Weakly supervised learning from large-scale
web images. In ECCV, September 2018. 5,6,7
[15] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and
Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for
large-scale face recognition. CoRR, abs/1607.08221, 2016.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In CVPR,
pages 770–778, 2016. 4
[17] Wei Hu, Yangyu Huang, Fan Zhang, and Ruirui Li. Noise-
tolerant paradigm for training face recognition cnns. In
CVPR, pages 11887–11896, 2019. 3
[18] Christina Huang. Combining convolutional neural networks
for emotion recognition. In 2017 IEEE MIT Undergraduate
Research Technology Conference (URTC), pages 1–4, 2017.
[19] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan,
Shiguang Shan, and Alexander Hauptmann. Self-paced
learning with diversity. In NIPS, pages 2078–2086, 2014.
[20] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and
Li Fei-Fei. Mentornet: Learning data-driven curriculum
for very deep neural networks on corrupted labels. arXiv
preprint:1712.05055, 2017. 2,3
[21] Samira Ebrahimi Kahou, Christopher Pal, Xavier Bouthillier,
Pierre Froumenty, Roland Memisevic, Pascal Vincent, Aaron
Courville, Yoshua Bengio, and Raul Chandias Ferrari. Com-
bining modality specific deep neural networks for emotion
recognition in video. In International Conference on Multi-
modal Interaction, pages 543–550, 2013. 2
[22] Shan Li, Weihong Deng, and JunPing Du. Reliable crowd-
sourcing and deep locality-preserving learning for expres-
sion recognition in the wild. In CVPR, pages 2852–2861,
2017. 1,5,8
[23] Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao,
Jiebo Luo, and Li-Jia Li. Learning from noisy labels with
distillation. In ICCV, pages 1910–1918, 2017. 2
[24] Yong Li, Jiabei Zeng, Shiguang Shan, and Xilin Chen. Oc-
clusion aware facial expression recognition using cnn with
attention mechanism. TIP, 28(5):2439–2450, 2018. 8
[25] Y. Li, J. Zeng, S. Shan, and X. Chen. Occlusion aware facial
expression recognition using cnn with attention mechanism.
IEEE Transactions on Image Processing, 28(5):2439–2450,
May 2019. 2
[26] Chengjun Liu and H. Wechsler. Gabor feature based classi-
fication using the enhanced fisher linear discriminant model
for face recognition. IEEE Transactions on Image Process-
ing, 11(4):467–476, April 2002. 2
[27] Mengyi Liu, Shaoxin Li, Shiguang Shan, and Xilin Chen.
Au-inspired deep networks for facial expression feature
learning. Neurocomputing, 159(C):126–136, 2015. 2
[28] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha
Raj, and Le Song. Sphereface: Deep hypersphere embedding
for face recognition. In CVPR, pages 212–220, 2017. 3
[29] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih,
Zara Ambadar, and Iain Matthews. The extended cohn-
kanade dataset (ck+): A complete dataset for action unit and
emotion-specified expression. In CVPRW, pages 94–101,
2010. 1
[30] Fan Ma, Deyu Meng, Qi Xie, Zina Li, and Xuanyi Dong.
Self-paced co-training. In ICML, pages 2275–2284, 2017. 3
[31] Volodymyr Mnih and Geoffrey E Hinton. Learning to label
aerial images from noisy data. In ICML, pages 567–574,
2012. 2
[32] Ali Mollahosseini, Behzad Hasani, Mohammad H Mahoor,
and Mohammad H Mahoor. Affectnet: A database for facial
expression, valence, and arousal computing in the wild. TAC,
10(1):18–31, 2017. 1,5,8
[33] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Raviku-
mar, and Ambuj Tewari. Learning with noisy labels. In NIPS,
pages 1196–1204, 2013. 2
[34] Pauline C. Ng and Steven Henikoff. Sift: predicting amino
acid changes that affect protein function. Nucleic Acids Re-
search, 31(13):3812–3814, 2003. 2
[35] Caifeng Shan, Shaogang Gong, and Peter W. McOwan. Fa-
cial expression recognition based on local binary patterns:
A comprehensive study. Image and Vision Computing,
27(6):803 – 816, 2009. 2
[36] Sainbayar Sukhbaatar and Rob Fergus. Learning from noisy
labels with deep neural networks. arXiv preprint:1406.2080,
2(3):4, 2014. 2
[37] Yichuan Tang. Deep learning using linear support vector ma-
chines. Computer Science, 2013. 2
[38] Y-I Tian, Takeo Kanade, and Jeffrey F Cohn. Recogniz-
ing action units for facial expression analysis. T-PAMI,
23(2):97–115, 2001. 1
[39] Michel Valstar and Maja Pantic. Induced disgust, happi-
ness and surprise: an addition to the mmi facial expres-
sion database. In Proc. 3rd Intern. Workshop on EMOTION
(satellite of LREC): Corpora for Research on Emotion and
Affect, page 65. Paris, France, 2010. 1
[40] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhi-
nav Gupta, and Serge Belongie. Learning from noisy large-
scale datasets with minimal supervision. In CVPR, pages
839–847, 2017. 2
[41] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhi-
nav Gupta, and Serge Belongie. Learning from noisy large-
scale datasets with minimal supervision. In CVPR, July
2017. 2
[42] Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng,
and Yu Qiao. Region attention networks for pose and
occlusion robust facial expression recognition. arXiv
preprint:1905.04075, 2019. 2,8
[43] Jiabei Zeng, Shiguang Shan, Xilin Chen, and Xilin Chen.
Facial expression recognition with inconsistently annotated
datasets. In ECCV, pages 222–237, 2018. 2,8
[44] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection
and alignment using multitask cascaded convolutional net-
works. IEEE Signal Processing Letters, 23(10):1499–1503,
Oct 2016. 2
[45] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao.
Joint face detection and alignment using multitask cascaded
convolutional networks. IEEE Signal Processing Letter,
23(10):1499–1503, 2016. 4
[46] Weihe Zhang, Yali Wang, and Yu Qiao. Metacleaner: Learn-
ing to hallucinate clean representations for noisy-labeled vi-
sual recognition. In CVPR, June 2019. 5,6,7
[47] Guoying Zhao, Xiaohua Huang, Matti Taini, Stan Z Li, and
Matti Pietik¨
aInen. Facial expression recognition from near-
infrared videos. Image and Vision Computing, 29(9):607–
619, 2011. 1
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Occlusion and pose variations, which can change facial appearance significantly, are two major obstacles for automatic Facial Expression Recognition (FER). Though automatic FER has made substantial progresses in the past few decades, occlusion-robust and pose-invariant issues of FER have received relatively less attention, especially in real-world scenarios. This paper addresses the real-world pose and occlusion robust FER problem in the following aspects. First, to stimulate the research of FER under real-world occlusions and variant poses, we annotate several in-the-wild FER datasets with pose and occlusion attributes for the community. Second, we propose a novel Region Attention Network (RAN), to adaptively capture the importance of facial regions for occlusion and pose variant FER. The RAN aggregates and embeds varied number of region features produced by a backbone convolutional neural network into a compact fixed-length representation. Last, inspired by the fact that facial expressions are mainly defined by facial action units, we propose a region biased loss to encourage high attention weights for the most important regions. We validate our RAN and region biased loss on both our built test datasets and four popular datasets: FERPlus, AffectNet, RAF-DB, and SFEW. Extensive experiments show that our RAN and region biased loss largely improve the performance of FER with occlusion and variant pose. Our method also achieves state-of-the-art results on FERPlus, AffectNet, RAF-DB, and SFEW. Code and the collected test data will be publicly available.
Facial expression recognition in the wild is challenging due to various un-constrained conditions. Although existing facial expression classifiers have been almost perfect on analyzing constrained frontal faces, they fail to perform well on partially occluded faces that are common in the wild. In this paper, we propose a Convolution Neutral Network with attention mechanism (ACNN) that can perceive the occlusion regions of the face and focus on the most discriminative unoccluded regions. ACNN is an end to end learning framework. It combines the multiple representations from facial regions of interest (ROIs). Each representation is weighed via a proposed Gate Unit that computes an adaptive weight from the region itself according to the unobstructed-ness and importance. Considering different RoIs, we introduce two versions of ACNN: patch based ACNN (pACNN) and global-local based ACNN (gACNN). pACNN only pays attention to local facial patches. gACNN integrates local representations at patch-level with global representation at image-level. The proposed ACNNs are evaluated on both real and synthetic occlusions, including a self-collected facial expression dataset with real-world occlusions (FED-RO), two largest in-the-wild facial expression datasets (RAF-DB and AffectNet) and their modifications with synthesized facial occlusions. Experimental results show that ACNNs improve the recognition accuracy on both the non-occluded faces and occluded faces. Visualization results demonstrate that, compared with the CNN without Gate Unit, ACNNs are capable of shifting the attention from the occluded patches to other related but unobstructed ones. ACNNs also outperform other state-of-the-art methods on several widely used in-the-lab facial expression datasets under the cross-dataset evaluation protocol.
Conference Paper
Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.