Dynamic Attention-Controlled Cascaded Shape Regression Exploiting Training Data Augmentation and Fuzzy-Set Sample Weighting

Conference Paper (PDF Available) · July 2017with 385 Reads
DOI: 10.1109/CVPR.2017.392
Conference: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), At Honolulu, Hawaii
Abstract
We present a new Cascaded Shape Regression (CSR) architecture, namely Dynamic Attention-Controlled CSR (DAC-CSR), for robust facial landmark detection on unconstrained faces. Our DAC-CSR divides facial landmark detection into three cascaded sub-tasks: face bounding box refinement, general CSR and attention-controlled CSR. The first two stages refine initial face bounding boxes and output intermediate facial landmarks. Then, an online dynamic model selection method is used to choose appropriate domain-specific CSRs for further landmark refinement. The key innovation of our DAC-CSR is the fault-tolerant mechanism, using fuzzy set sample weighting for attention-controlled domain-specific model training. Moreover, we advocate data augmentation with a simple but effective 2D profile face generator, and context-aware feature extraction for better facial feature representation. Experimental results obtained on challenging datasets demonstrate the merits of our DAC-CSR over the state-of-the-art.
Dynamic Attention-controlled Cascaded Shape Regression Exploiting Training
Data Augmentation and Fuzzy-set Sample Weighting
Zhen-Hua Feng1Josef Kittler1William Christmas1Patrik Huber1Xiao-Jun Wu2
1Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford GU2 7XH, UK
2School of IoT Engineering, Jiangnan University, Wuxi 214122, China
{z.feng, j.kittler, w.christmas, p.huber}@surrey.ac.uk, wu xiaojun@jiangnan.edu.cn
Abstract
We present a new Cascaded Shape Regression (CSR)
architecture, namely Dynamic Attention-Controlled CSR
(DAC-CSR), for robust facial landmark detection on uncon-
strained faces. Our DAC-CSR divides facial landmark de-
tection into three cascaded sub-tasks: face bounding box
refinement, general CSR and attention-controlled CSR. The
first two stages refine initial face bounding boxes and out-
put intermediate facial landmarks. Then, an online dy-
namic model selection method is used to choose appropri-
ate domain-specific CSRs for further landmark refinement.
The key innovation of our DAC-CSR is the fault-tolerant
mechanism, using fuzzy set sample weighting, for attention-
controlled domain-specific model training. Moreover, we
advocate data augmentation with a simple but effective 2D
profile face generator, and context-aware feature extraction
for better facial feature representation. Experimental re-
sults obtained on challenging datasets demonstrate the mer-
its of our DAC-CSR over the state-of-the-art methods.
1. Introduction
Facial Landmark Detection (FLD), also known as face
alignment, is a prerequisite for many automatic face anal-
ysis systems, e.g. face recognition [3, 33, 34], expression
analysis [13, 14] and 2D-3D inverse rendering [1, 20, 21,
23, 28, 48]. Facial landmarks provide accurate shape infor-
mation with semantic meaning, enabling geometric image
normalisation and feature extraction for use in the remain-
ing stages of a face processing pipeline. This is crucial for
high-fidelity face image analysis. As the technology of FLD
for constrained faces has already been well developed, the
current trend is to address FLD for unconstrained faces in
the presence of extreme variations in pose, expression, illu-
mination and partial occlusion [2, 4, 24, 25, 30].
More recently, unconstrained FLD has seen huge
progress owing to the state-of-the-art Cascaded Shape Re-
Face Box
Refinement
General
CSR
Domain-specific
CSRs
Figure 1. The pipeline of our proposed DAC-CSR.
gression (CSR) architecture [6, 12, 15, 29, 46]. The key to
the success of CSR is to construct a strong regressor from a
set of weak regressors arranged in a cascade. This architec-
ture greatly improves the performance of FLD in terms of
generalisation capacity and accuracy. However, in the light
of very recent studies [35, 39, 42, 46, 47], the capacity of
CSR appears to be saturating, especially for unconstrained
faces with extreme appearance variations. For example, the
FLD error of state-of-the-art CSR-based methods increases
from around 3% (error in percent of the inter-ocular dis-
tance) on the Labelled Face Parts in the Wild (LFPW) [2]
dataset to 6.5% on the more challenging Caltech Occluded
Faces in the Wild (COFW) [4] dataset. This degradation
has three main reasons: 1) The modelling capacity of the
existing CSR architecture is limited. 2) CSR is sensitive to
the positioning of face bounding boxes used for landmark
initialisation. 3) The volume of available training data is
insufficient. Can these limitations be overcome, especially
for unconstrained faces exhibiting extreme appearance vari-
ations? We offer an encouraging answer by presenting a
new Dynamic Attention-Controlled CSR (DAC-CSR) ar-
chitecture with a dynamic domain selection mechanism and
a novel training strategy which benefits from training data
augmentation and fuzzy set training sample weighting.
Fig. 1 depicts a simplified overview of the proposed
DAC-CSR architecture. Its innovation is in linking three
types of regressor cascades performing in succession: 1)
face bounding box refinement for better landmark initialisa-
tion, 2) an initial landmark update using a general CSR, and
1
3) a final landmark refinement by dynamically selecting an
attention-controlled domain-specific CSR that is optimised
to improve landmark location estimates. The new architec-
ture decomposes the task at hand into three cascaded sub-
tasks that are easier to handle.
In contrast to previous multi-view models, e.g. [39, 46],
the key innovation of our DAC-CSR is its in-built fault-
tolerant mechanism. The fault tolerance is achieved by
means of an innovative training strategy for attention-
controlled model training of the set of domain-specific
CSRs performing the final shape update refinement. Rather
than using samples from just a single domain, each domain-
specific regressor cascade is trained using all the train-
ing samples. However, their influence is controlled by a
domain-specific fuzzy membership function which weighs
samples from the relevant domain more heavily than all the
other training samples. An annealing schedule of domain-
specific fuzzy membership functions progressively sharp-
ens the relative weighting of in-domain and out-of-domain
training samples in favour of the in-domain set for succes-
sive stages of each domain-specific cascade.
Each test sample progresses through the system of cas-
cades. Prior to each of the domain-specific cascade stages,
the domain of attention is selected dynamically based on
the current shape estimate. The proposed training strat-
egy guarantees that each domain-specific cascaded regres-
sor can cope with out-of-domain test samples and is en-
dowed with the capacity to update the shape in the correct
direction even if the current domain has been selected sub-
ject to labelling error. This is the essence of error tolerance
of the proposed system.
An important contributing factor to the promising perfor-
mance of our DAC-CSR is training data augmentation. Our
innovation here is to use a 2D face model for synthesising
extreme profile face poses (out of plane rotation) with real-
istic background. Furthermore, we propose a novel context-
aware feature extraction method to extract rich local facial
features in the context of global face description.
The proposed framework has been evaluated on bench-
marking databases using standard protocols. The results
achieved on the database containing images with extreme
poses (AFLW [24]) are significantly better than the state-
of-the-art performance reported in the literature.
The paper is organised as follows. In the next section
we present a brief review of related literature. The prelim-
inaries of CSR are presented in Section 3. The proposed
DAC-CSR is introduced in Section 4.1. The discussion of
its training is confined to Section 4.2, which defines the
domain-specific fuzzy membership functions and their an-
nealing schedule. On-line dynamic domain selection is the
subject of Section 4.3 and the proposed feature extraction
scheme can be found in Section 4.4. Section 5 addresses
the problem of training set augmentation. The experimental
evaluation carried out and the results achieved are described
in Section 6. The paper is drawn to conclusion in Section 7.
2. Related Work
Facial landmark detection can trace its history to the
nineteen nineties. The representative FLD methods mak-
ing the early milestones include Active Shape Model
(ASM) [8], Active Appearance Model (AAM) [7] and
Constrained Local Model (CLM) [10]. These algorithms
and their extensions have achieved excellent FLD results
in constrained scenarios [17]. As a result, the current
trend is to develop a more robust FLD for unconstrained
faces that are rich in appearance variations. The lead-
ing algorithms for unconstrained FLD are CSR-based ap-
proaches [6, 12, 15, 29, 46]. In contrast to the classical
methods such as ASM, AAM and CLM that rely on a gen-
erative PCA-based shape model, CSR directly positions fa-
cial landmarks on their optimal locations based on image
features. The shape update is achieved in a discriminative
way by constructing a mapping function from robust shape-
related local features to shape updates. The secret of the
success of CSR is the architecture that cascades a set of
weak regressors in series to form a strong regressor.
There have been a number of improvements to the per-
formance of CSR-based FLD. One category of these im-
provements is to enhance some components of the exist-
ing CSR architecture. For example, the use of more ro-
bust shape-related local features, e.g. Scale-Invariant Fea-
ture Transform (SIFT) [38, 42, 43], Histogram of Ori-
ented Gradients (HOG) [11, 15, 21, 40], Sparse Auto-
Encoder (SAE) [16], Local Binary Features (LBF) [6, 29]
and Convolutional Neural Networks (CNN-) based fea-
tures [35, 37], has been suggested. Another example is
to use more powerful regression methods as weak regres-
sors in CSR, such as random forests [6, 29] and deep neural
networks [32, 35, 37, 42, 43, 44]. Lately, 3D face models
have been shown to positively impact FLD in challenging
benchmarking datasets, especially in relation to faces with
extreme poses [15, 26, 47].
Multi-view models: Another important approach is to
adopt advanced CSR architectures, such as the use of mul-
tiple CSR models or constructing multi-view models. Feng
et al. [16] constructed multiple CSR models by randomly
selecting subsets from the original training set and fusing
multiple outputs to produce the final FLD result. A similar
idea has also been used in [41]. As an alternative, a multi-
view FLD system employs a set of view-specific models
that are able to achieve more accurate landmark detection
for faces exhibiting specific views [9, 36, 46].
However, the use of multiple models or multi-view mod-
els is not without difficulties. One has to either estimate the
view of a test image to select an appropriate model, or apply
all view-specific models to a test image and then choose the
2
best result as the final output. For the former, implement-
ing a model selection stage for unconstrained faces is hard
in practice. An erroneously selected view-specific model
may result in FLD failure. For the latter strategy, it is time-
consuming to apply all the trained models to a test image.
Also, the ranking of the outputs of different view-based
models is non-trivial. In contrast to previous studies, our
DAC-CSR addresses these issues by improving the fault-
tolerance properties of a trained domain-specific model and
by using an online dynamic model selection strategy.
Data augmentation: For a learning-based approach
such as CSR, the availability of a large volume of train-
ing samples is essential. However, it is a tedious task to
manually annotate facial landmarks for a large quantity of
training data. To address this problem, data augmentation
is widely used in CSR-based FLD. Traditional methods in-
clude random perturbation of initial landmarks, image flip-
ping, image rotation, image blurring and adding noise to the
original face images. However, none of these methods are
able to inject new out-of-plane rotated faces to an existing
training dataset. Recently, to augment a training set by sam-
ples with rich pose variations, the use of 3D face models has
been suggested. For instance, Feng et al. [15, 23, 31] used
a 3D morphable face model to synthesise a large number
of 2D faces. However, the synthesised virtual faces lack
realistic appearance variations especially in terms of back-
ground and expression changes. To mitigate this problem,
they advocated a cascaded collaborative regression strategy
to train a CSR from a mixture of real and synthesised faces.
To generate realistic face images with pose variations, Zhu
et al. fit a 3D shape model to 2D face images and generate
profile face views from the reconstructed 3D shape infor-
mation [47]. However, these 3D-based methods [15, 23, 47]
require 3D face scans for model construction, which are ex-
pensive to capture. Also, it is difficult in practice to fit a 3D
face model to 2D images. In this paper, we propose a sim-
ple but efficient 2D-based method to generate virtual faces
with out-of-plane pose variations.
3. Cascaded Shape Regression (CSR)
Given an input face image Iand the corresponding face
bounding box b= [x1, y1, x2, y2]T(coordinates of the up-
per left and lower right corners) of a detected face in the
image, the goal of FLD is to output the face shape in the
form of a vector, s= [x1, y1, ..., xL, yL]T, consisting of
the coordinates of Lpre-defined facial landmarks with se-
mantic meaning such as eye centres and nose tip. To this
end, we first initialise the face shape, s0, by putting the
mean shape into the bounding box. Then a trained CSR
Φ = {φ(1), φ(2), ..., φ(N)}is used to update the initial
shape estimate, where Φis a strong regressor consisting
of Nweak regressors. A weak regressor can be obtained
using any regression method, such as linear regression, ran-
dom forests and neural networks. In this paper, we use ridge
regression as a weak regressor, i.e.φ={A,e}:
φ:δs=A·f(I,s0) + e,(1)
where AR2L×Nfis a projection matrix, Nfis the di-
mensionality of a shape-related feature vector extracted us-
ing f(I,s0), and eR2Lis an offset. For the shape-related
feature extraction, we apply local descriptors, e.g. HOG, to
the neighbourhoods of all the facial landmarks of the cur-
rent shape estimate and concatenate the extracted features
into a long vector. The use of a weak regressor results in an
update to the current shape estimate:
s0s0+δs.(2)
A trained CSR applies all the weak regressors in cascade to
progressively update the shape estimate and obtain the final
FLD result from an input image.
Given a training dataset T={Ii,bi,s
i}I
i=1 with Isam-
ples including face images, face bounding boxes and manu-
ally annotated facial landmarks, we first obtain the initial
shape estimates, {s0
i}I
i=1, of all the training samples us-
ing the face bounding boxes provided. Then the shape up-
date between the current shape estimate and ground-truth
shape of the ith training sample can be calculated using
δs
i=s
is0
i. The first weak regressor is obtained using
ridge regression by minimising the loss:
arg min
A,e
I
X
i=1
||A·f(Ii,s0
i) + eδs
i||2
2+λ||A||2
F,(3)
where λis the weight of the regularisation term. This is a
typical least-square estimation problem with a closed-form
solution [16, 38]. Last, this trained weak regressor is used
to update the current shape estimates of all the training sam-
ples, which forms the training data for the second weak re-
gressor. This procedure is repeated until all the Nweak
regressors are obtained.
4. Dynamic Attention-controlled CSR
4.1. Architecture
The architecture of the proposed DAC-CSR method has
three cascaded stages: face bounding box refinement, gen-
eral CSR and domain-specific CSR, as shown in Fig. 2. In
fact, our DAC-CSR can be portrayed as a strong regres-
sor Φ = {φb,Φg,Φd}, where φbis a weak regressor for
face bounding box refinement, Φg={φg(1), ..., φg(Ng)}
is a classical CSR with Ngweak regressors, Φd=
{Φd(1), ..., Φd(M)}is a strong regressor with Mdomain-
specific CSRs and each of them has Ndweak regressors
Φd(m) = {φd(m, 1), ..., φd(m, Nd)}.
3
Face Box
Refinement
󰒱
b
General Shape
Regression
󰒱
g(n)
(n{1, 2, …, Ng})
Shape
Initialisation
dense face
description Domain-specific Shape
Regression
󰒱
d(m,n) (n{1, 2, ...,
Nd})
... ...
Domain Prediction
based on the Current
Shape Estimate
(m{1, 2, ..., M} )
n=1 n←n+1
󰒱
d(1,n)
󰒱
d(2,n)
󰒱
d(M,n)
dense face
description
sparse face
description
dense face
description
sparse face
description
n←n+1
n=1
Figure 2. The proposed DAC-CSR has three stages in cascade: face bounding box refinement, general CSR and domain-specific CSR.
Face bounding box refinement: We define the weak
regressor for the first step as φb={Ab,eb}:
φb:δb=Ab·fb(I,b) + eb,(4)
where fb(I,b)extracts dense local features from the image
region inside the original face bounding box and δbis used
to adjust the original bounding box.
The training of this weak regressor is the same as the
procedure introduced in Section 3 for classical CSR. The
only difference here is that we use face bounding box differ-
ences instead of shape differences for the regressor learning
in Eq. (3). The ground-truth face bounding box for a train-
ing sample is computed by taking the minimum enclosing
rectangle around the ground-truth face shape.
General CSR: The initial shape estimate, s0, for gen-
eral CSR is obtained by translating and scaling the mean
shape so that it exactly fits into the refined bounding box,
touching all four sides. Then the general CSR progressively
updates the initial shape estimate, s0s0+δs, using all
the weak regressors in Φg={φg(1), ..., φg(Ng)}, as indi-
cated in Algorithm 1. The nth weak regressor is defined as
φg(n) = {Ag(n),eg(n)}:
φg(n) : δs=Ag(n)·fc(I,s0) + eg(n),(5)
where fc(I,s0)is a context-aware feature extraction func-
tion that combines both dense face description and shape-
related sparse local features. The training of this stage is
the same as the classical CSR introduced in Section 3.
Domain-specific CSR: Suppose this stage has M
domain-specific CSRs corresponding to Msub-domains,
each having Ndweak regressors. The nth weak regressor
of the mth domain-specific CSR is defined as:
φd(m, n) : δs=Ad(m, n)·fc(I,s0) + ed(m, n),(6)
where m= 1, ..., M ,N= 1, ..., Nd. Given the current
shape estimate s0output by the previous general CSR, a do-
main predictor is used to select a domain-specific CSR for
input : image I, face bounding box band a trained
DAC-CSR model Φ = {φb,Φg,Φd}
output: facial landmarks s0
1refine the face bounding box busing φb;
2estimate the current face shape, s0, using the refined
face bounding box ;
3for n1to Ngdo
4apply the nth general weak regressor φg(n)to
update the current shape estimate;
5end
6for n1to Nddo
7predict the label (m) of the sub-domain of the
current shape estimate using Eq. (11) ;
8apply the nth weak regressor φd(m, n)in the mth
domain-specific CSR to update the current shape;
9end
Algorithm 1: FLD using our DAC-CSR.
the current shape update (Section 4.3). It should be noted
that we use a dynamic domain selection strategy, which up-
dates the label for the domain-specific model selection after
each shape update, as shown in Algorithm 1. As a result
of the proposed domain-specific CSR training described in
Section 4.2, this mechanism makes our DAC-CSR tolerant
to domain prediction errors.
4.2. Offline Domain-specific CSR Training
Given a training dataset Twith Isamples, as introduced
in Section 3, the first two stages, i.e. face bounding box
refinement and general CSR, are trained directly using T.
To train a domain-specific CSR, we first create Msubsets
{T1, ..., TM}from the original training set, where TmT.
To this end, we normalise all the current shape estimates,
output by the previous general CSR, to the interval [0,1].
Then PCA is used to obtain the first Kshape eigenvec-
tors. All the current shape estimates are projected to the
4
󰒦1
󰒦2
sub-domain 5
sub-domain 1 [00]
sub-domain 3 [01] sub-domain 4 [11]
sub-domain 2 [10]
Figure 3. The proposed domain split strategy (K= 2,ck= 0).
K-dimensional subspace to obtain the projected coefficients
{ci}I
i=1, where ci= [ci,1, ..., ci,K ]T. Then the original
domain is partitioned into M= 2K+ 1 overlapping sub-
domains, as demonstrated in Fig. 3 for K= 2. For the
Mth sub-domain, it includes the training samples satisfy-
ing PK
k=1
(ci,kck)2
(σ(k))21, where ckand σ(k)are the mean
and standard deviation of the kth element of the coefficient
vectors. For other sub-domains, each includes the training
samples in a specific region of a K-dimensional coordinate
system. To be more specific, for each coefficient ci, a sub-
domain membership word g(ci)is generated by:
g(ci) = 1 +
K
X
k=1
bc(ci,k)2k1,(7)
where bc(ci,k)is a coding function that converts the kth el-
ement in a coefficient vector to a bit:
bc(ci,k) = 1if ci,k c(k)
0otherwise .(8)
Then the mth sub-domain, 1<m<2K, includes the train-
ing samples with their membership words g(ci) = m. Our
domain split strategy results in Msub-domains with over-
lapping boundaries. This is different from previous stud-
ies using multi-view models such as [39, 46], in which
the intersection of any two different subsets is empty, i.e.
TiTj=,i6=j.
The advantage of our domain split strategy is that it im-
proves the fault-tolerance ability of each trained domain-
attention model, because of the overlap of two different
sub-domains. For a test sample, a domain predictor may
output an inaccurate label for model selection due to the
rough shape estimate provided from the previous general
CSR. But, the inaccurately selected domain-specific model
is still able to refine the current shape estimate. To further
improve this refinement capacity, we propose a fuzzy train-
ing strategy. For each domain-specific CSR, we use all the
training samples from the original training set to train a spe-
cific regressor, but weight more heavily the training samples
of the specific domain by increasing their fuzzy set mem-
bership values in the objective function. More specifically,
to train the nth weak regressor of the mth domain-specific
CSR, the objective function is defined as:
arg min
Ad,ed
I
X
i=1
wi||Ad·fc(Ii,s0
i) + edδs
i||2
2+λ||Ad||2
F,
(9)
where wiis a fuzzy set membership value defined by:
wi=1h(n)if {Ii,bi,s
i} ∈ Tm
h(n)otherwise ,(10)
where h(n)is a decreasing function which progressively re-
duces the weights of the training samples not belonging to
the mth sub-domain and increases the weights of the train-
ing samples of the mth sub-domain. This is a standard
weighted least-square estimation problem with a closed-
form solution. It should be noted that our fuzzy domain-
specific model learns a weak regressor that is able to refine
a face shape estimate from any sub-domain, and with bet-
ter capacity to refine face shapes from a specific domain.
This capability is exhibited even when using a domain split
strategy without overlap.
4.3. Dynamic Domain Selection in Testing
Given a new test image with a detected face bounding
box, the trained DAC-CSR model Φ = {φb,Φg,Φd}first
applies the face bounding box refiner φband general CSR
Φgto obtain the intermediate face shape estimate s0. Then a
specific domain-attention weak regressor is selected to fur-
ther update the current shape estimate.
To select an appropriate weak regressor, the current
shape estimate s0is projected into the PCA space learned
at training time to obtain the coefficient vector c, and the
label of the sub-domain is obtained using:
p(c) = (2K+ 1 if PK
k=1
(ci,kck)2
(σ(k))21
g(c)otherwise .(11)
Note that, here, the sub-domains are not overlapped. This
is different from the domain split strategy used in the train-
ing stage. However, this domain prediction function is only
based on the current shape information and may provide
inaccurate labels for model selection. To address this is-
sue and further improve the fault-tolerance capacity of our
DAC-CSR, a dynamic domain selection strategy is used.
As discussed in the last section, a trained domain-
specific CSR is able to improve the current shape estimate
even if selected in error by the domain prediction mecha-
nism. Hence the updated shape estimate produced by the
nth weak regressor can be a basis for selecting a more ap-
propriate domain in the next step of the shape updating pro-
cess. We re-run the domain prediction before performing
the next weak regressor and choose the (n+ 1)st weak re-
gressor of a newly selected domain-specific model for cur-
5
(a)
(b)
(c)
Figure 4. A comparison of synthesised 2D faces using (a) a 3D
morphable model [15], (b) 3D-based face profiling [47], and (c)
our 2D-based method.
rent shape update, as summarised in Algorithm 1. This dy-
namic model selection strategy is repeated after each shape
update in our domain-specific CSR.
4.4. Context-aware Feature Extraction
Feature extraction is crucial for constructing a robust
mapping from feature space to shape updates. In classical
CSR-based approaches, shape-related local features are cre-
ated by concatenating all the extracted local features around
each landmark into a long vector. Although this sparse
shape-related feature extraction method provides a good de-
scription of the texture information of different facial parts,
it does not offer a good representation of the contextual in-
formation of faces. In our DAC-CSR, we use a context-
aware feature extraction method. To be more specific, we
use both a dense local description of the whole face region
and sparse shape-related local features for weak regressor
training (Fig. 3). Note that, for the first bounding box re-
finement step, we use only the dense local features.
5. 2D Profile Face Generation
For a learning-based approach, a large number of an-
notated face images are crucial for training. As discussed
in Section 2, traditional data augmentation methods are not
able to inject new out-of-plane pose variations, and the use
of 3D face models is very expensive. To mitigate this issue,
we propose a simple 2D-based method that can generate vir-
tual faces with out-of-plane pose variations. A comparison
between our proposed 2D-based profile face generator and
two 3D-based methods [15, 47] is shown in Fig. 4.
To warp a face image to another pose, we first build
Figure 5. The mesh generated for 2D image warpping.
a PCA-based shape model that is equivalent to the shape
model used in ASM [8] and AAM [7, 27]. Then we choose
the corresponding shape eigenvector controlling yaw rota-
tions (usually the first one) to change the pose of the current
face shape. To this end, we first calculate the coefficient of
the shape of a face image projected by the selected shape
eigenvector. A new face shape with pose variations is gen-
erated by adjusting the projected coefficient. The 2D shape
model used is constructed using a face dataset rich in pose
variations. Note, we only generate pose-varying face shapes
with the same rotation direction of the original shape, i.e.
left or right. Then we expand the face shape with more ex-
ternal facial landmarks and compute a 2D mesh of the origi-
nal and new shapes using Delaunay triangulation, as shown
in Fig. 5. Last, a piece-wise affine warp is used to map
the texture from the original face shape to a new one [27].
Moreover, the synthesised faces can be flipped about their
vertical axis to obtain more faces with pose variations in the
other direction (right or left), which is similar to [47].
6. Experimental Results
6.1. Datasets and Implementation Details
Datasets: In our experiments, we use two challenging
face datasets, including the Annotated Facial Landmarks in
the Wild (AFLW) dataset [24] and the Caltech Occluded
Faces in the Wild (COFW) dataset [4] to evaluate the per-
formance of our DAC-CSR architecture.
The AFLW dataset has 25993 unconstrained faces with
large-scale pose variations up to ±90. Each AFLW face
image has up to 21 landmarks of visible facial parts. AFLW
does not have a standard protocol for FLD evaluation; hence
we follow the protocol used in Cascaded Compositional
Learning (CCL) [46]. This is the first work to use the
whole AFLW dataset to benchmark an FLD algorithm. It re-
ports the currently best results on AFLW. CCL used 24386
images from AFLW and manually annotated all the miss-
ing landmarks in the original dataset. The annotation sys-
tem opted for 19 landmarks per image without the two
ear landmarks (ID-13 and ID-17). CCL has two proto-
cols: AFLW-full and AFLW-frontal, as shown in Table 1.
AFLW-full splits the 24386 images into 20000/4386 for
training/testing. The AFLW-frontal protocol selects 1165 1
1In our experiments, 1314 frontal faces were selected using the list pro-
vided by [46].
6
Table 1. A summary of the evaluation protocols used in our experiments
Protocol Training Set Test Set # Landmarks Normalisation Setting
AFLW-full 20000 from AFLW 4386 from AFLW 19 face size CCL [46]
AFLW-frontal 20000 from AFLW 1165 from AFLW 19 face size CCL [46]
COFW 1345 from COFW 507 from COFW 29 eye distance standard [4]
frontal images from the 4386 test images to evaluate an FLD
algorithm on frontal faces.
The COFW dataset has 1345 training and 507 test im-
ages, which are all unconstrained faces. Each COFW face
has 29 manually annotated landmarks. COFW is a chal-
lenging benchmark containing major occlusions.
Implementation Details: In our experiments, we only
used one weak regressor for face bounding box refine-
ment. The numbers of weak regressors for general CSR
and domain-specific CSR were set to 2 and 3 respectively.
We set the number of sub-domains to M= 5 using 2 PCA
shape coefficients, i.e.K= 2. The value of the regulari-
sation term in the ridge regression training was assigned to
λ= 10000, and the decreasing schedule controlling fuzzy
membership values was set to h(n) = (0.3,0.2,0.1) for
n= (1,2,3). To extract a dense face description, we re-
sized the face region to 100 ×100 and extracted HOG fea-
tures using a cell size of 10 and block size of 2. To extract
sparse shape-related local features, we computed the HOG
descriptor in the neighbourhood of each facial landmark.
The radius was set to 1/7of the maximum of the height and
width of the current shape estimate. Each local image patch
was resized to 30 ×30 and the cell size was set to 10. In
addition, the central 15×15 image patch was used to extract
multi-scale HOG features using a cell size of 5.
To augment training data, we applied our 2D-based
method to generate virtual face images with new poses.
Each training image in COFW was augmented using 9 new
poses. For AFLW, we only synthesised new faces for semi-
frontal training images. We also flipped all the training im-
ages about the vertical axis, added Gaussian blur with σ= 1
pixel and performed random perturbations of the initial face
bounding boxes.
6.2. Evaluation on AFLW
The Cumulative Error Distribution (CED) curve of our
DAC-CSR using the AFLW-full protocol is shown in Fig. 6.
The error was calculated using the Euclidean distance be-
tween the detected and ground-truth landmarks, normalised
by face size [46]. Our DAC-CSR achieves much better re-
sults on the AFLW-full protocol than the current best result
reported for CCL [46].
Table 2 compares our DAC-CSR with state-of-the-art
methods on AFLW using both the AFLW-full and AFLW-
frontal protocols. The results obtained with our DAC-CSR
show the best normalised average error on both the full test
0 0.02 0.04 0.06 0.08 0.1
Error Normalised by Face Size
0
0.2
0.4
0.6
0.8
1
Fraction of Test Faces (4386 in Total)
SDM
ERT
RCPR
CFSS
LBF
GRF+LBP
CCL
Our DAC-CSR
Figure 6. A CED curve comparison of our DAC-CSR with state-
of-the-art methods, including SDM [38], ERT [22], RCPR [4],
CFSS [45], LBF [29], LBF + GRF [19] and CCL [46], on the
AFLW dataset (better viewed in colour). In this experiment, 20000
images were used for training and 4386 images were used for test-
ing, following the AFLW-full protocol in [46].
Table 2. A comparison of our DAC-CSR with state-of-the-art
methods on AFLW, measured in terms of the average error, nor-
malised by face size. The protocol is the same as in [46].
Method AFLW-full AFLW-frontal
SDM [38] 4.05% 2.94%
RCPR [4] 3.73% 2.87%
ERT [22] 4.35% 2.75%
LBF [29] 4.25% 2.74%
LBF + GRF [19] 3.15% N.A.
CFSS [45] 3.92% 2.68%
CCL [46] 2.72% 2.17%
Our DAC-CSR 2.27% 1.81%
set and the frontal face subset protocols.
6.3. Evaluation on COFW
6.3.1 Comparison to State-of-the-art
The CED curves of our DAC-CSR and a set of state-of-
the-art methods on the COFW dataset are shown in Fig. 7.
In addition, a more detailed comparison is presented in Ta-
ble 3, reporting the average error, failure rate and speed.
The failure rate is defined by the percentage of test images
with more than 10% detection error.
Our DAC-CSR achieves competitive results in accuracy
7
0 0.05 0.1 0.15
Error Normalised by Eye Distance
0
0.2
0.4
0.6
0.8
1
Fraction of Test Faces (507 in Total)
SDM
RCPR
CDRN
DRDA
RCRC
Our DAC-CSR
Figure 7. A comparison between our DAC-CSR and state-of-
the-art methods, including SDM [38], RCPR [4], RCRC [16],
CDRN [42] and DRDA [42], on COFW.
Table 3. Comparison on COFW. The error was measured on 29
landmarks and normalised by the inter-ocular distance.
Method Error Failure Speed (FPS)
ESR [5] 11.2% 36% 4
RCPR [4] 8.5% 20% 3
HPM [18] 7.5% 13% 0.03
RCRC [16] 7.3% 12% 22
CCR [15] 7.03% 10.9% 69
DRDA [42] 6.46% 6% N.A.
RAR [37] 6.03% 4.14% 4 (GPU)
Our DAC-CSR 6.03% 4.73% 10
compared to the two cutting-edge deep-neural-network-
based algorithms, DRDA [42] and RAR [37]. In addition,
the speed of our DAC-CSR on an Intel i7-4790 CPU is up
to 10 FPS, which is faster than RAR with GPU accelera-
tion (NVIDIA Titan Z). As the current bottleneck for un-
constrained FLD is not the speed, e.g. LBF can perform
FLD at up to 3000 FPS, the key aim of our DAC-CSR is
to provide a more robust FLD algorithm for faces with ex-
treme appearance variations, as exhibited in the AFLW-full
evaluation.
6.3.2 Self Evaluation
In this part, we investigate the contributions of the proposed
DAC-CSR architecture and our 2D-based data augmenta-
tion method to the accuracy of FLD on COFW. To this end,
we compare the classical CSR method trained on the orig-
inal training set (CSR) with the classical CSR trained on
the augmented dataset using faces synthesised by our 2D-
based face generation method (CSR+SYN), our DAC-CSR
trained on the original dataset (DAC-CSR) and our DAC-
CSR trained on the augmented dataset (DAC-CSR+SYN).
The CED curves of these settings are shown in Fig. 8.
0 0.05 0.1 0.15
Error Normalised by Eye Distance
0
0.2
0.4
0.6
0.8
1
Fraction of Test Faces (507 in Total)
CSR
CSR+SYN
DAC-CSR
DAC-CSR+SYN
Figure 8. A self-evaluation of our proposed DAC-CSR on COFW.
The meaning of each term is introduced in Section 6.3.2.
In fact, the architecture of classical CSR is the same as
SDM [38]. They also have similar CED curves (compar-
ing Fig. 7 with Fig. 8). As indicated by Fig. 8, the new
DAC-CSR architecture trained on the original dataset per-
forms better than CSR with our 2D-based data augmenta-
tion method (DAC-CSR vs CSR+SYN). However, the best
result is achieved when the new DAC-CSR architecture is
used jointly with our 2D-based data augmentation method.
7. Conclusion
We have presented a new DAC-CSR architecture for ro-
bust FLD in unconstrained faces. The proposed method
achieved superior FLD results on the challenging AFLW
dataset and delivered competitive performance on the
COFW dataset. This is due to the proposed versatile fault-
tolerant mechanism using fuzzy domain-specific model
training and the online dynamic model selection strategy. In
addition, a simple but effective data augmentation method
based on 2D face synthesis was proposed. Compared with
the classical CSR method, both the new DAC-CSR archi-
tecture and the 2D-based data augmentation method proved
beneficial for the FLD performance on unconstrained faces.
We believe that our contributions can be further ex-
tended, e.g. using deep-neural-network-based approaches.
We leave for future work the exploration of methods that
combine our DAC-CSR architecture and data augmentation
method with other FLD algorithms.
Acknowledgements
This work was supported in part by the EPSRC Pro-
gramme Grant ‘FACER2VM’ (EP/N007743/1), the Na-
tional Natural Science Foundation of China (61373055,
61672265) and the Natural Science Foundation of Jiangsu
Province (BK20140419, BK20161135).
8
References
[1] O. Aldrian and W. A. P. Smith. Inverse rendering of faces
with a 3D morphable model. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 35(5):1080–1093, 2013.
[2] P. N. Belhumeur, D. W. Jacobs, D. Kriegman, and N. Kumar.
Localizing parts of faces using a consensus of exemplars. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 545–552, 2011.
[3] J. R. Beveridge, H. Zhang, B. A. Draper, P. J. Flynn, Z.-H.
Feng, P. Huber, J. Kittler, Z. Huang, S. Li, Y. Li, et al. Re-
port on the FG 2015 video person recognition evaluation. In
IEEE International Conference on Automatic Face and Ges-
ture Recognition (FG), volume 1, pages 1–8. IEEE, 2015.
[4] X. P. Burgos-Artizzu, P. Perona, and P. Doll´
ar. Robust face
landmark estimation under occlusion. In International Con-
ference on Computer Vision, 2013.
[5] X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by Ex-
plicit Shape Regression. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 2887–2894. IEEE,
2012.
[6] X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by ex-
plicit shape regression. International Journal of Computer
Vision, 107(2):177–190, 2014.
[7] T. F. Cootes, G. Edwards, and C. J. Taylor. Active appear-
ance models. In European Conference on Computer Vision,
volume 1407, pages 484–498, 1998.
[8] T. F. Cootes and C. J. Taylor. Active shape models - ‘smart
snakes’. In British Machince Vision Conference, pages 266–
275, 1992.
[9] T. F. Cootes, G. V. Wheeler, K. N. Walker, and C. J. Tay-
lor. View-based active appearance models. Image and Vision
Computing, 20(9):657–664, 2002.
[10] D. Cristinacce and T. F. Cootes. Feature Detection and
Tracking with Constrained Local Models. In British Mahine
Vision Conference, volume 3, pages 929–938, 2006.
[11] J. Deng, Q. Liu, J. Yang, and D. Tao. M3csr: Multi-view,
multi-scale and multi-component cascade shape regression.
Image and Vision Computing, 47:19–26, 2016.
[12] P. Doll´
ar, P. Welinder, and P. Perona. Cascaded pose regres-
sion. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1078–1085. IEEE, 2010.
[13] S. Eleftheriadis, O. Rudovic, M. P. Deisenroth, and M. Pan-
tic. Variational gaussian process auto-encoder for ordinal
prediction of facial action units. In Asian Conference on
Computer Vision, Taipei, Taiwan. Oral, November 2016.
[14] S. Eleftheriadis, O. Rudovic, and M. Pantic. Discriminative
shared gaussian processes for multiview and view-invariant
facial expression recognition. IEEE transactions on Image
Processing, 24(1):189–204, 2015.
[15] Z.-H. Feng, G. Hu, J. Kittler, W. Christmas, and X.-J. Wu.
Cascaded collaborative regression for robust facial landmark
detection trained using a mixture of synthetic and real im-
ages with dynamic weighting. IEEE Transactions on Image
Processing, 24(11):3425–3440, 2015.
[16] Z.-H. Feng, P. Huber, J. Kittler, W. Christmas, and X. Wu.
Random Cascaded-Regression Copse for Robust Facial
Landmark Detection. IEEE Signal Processing Letters,
22(1):76–80, Jan 2015.
[17] Z.-H. Feng, J. Kittler, W. Christmas, X.-J. Wu, and S. Pfeif-
fer. Automatic face annotation by multilinear AAM with
missing values. In International Conference on Pattern
Recognition (ICPR), pages 2586–2589. IEEE, 2012.
[18] G. Ghiasi and C. C. Fowlkes. Occlusion Coherence: Local-
izing Occluded Faces with a Hierarchical Deformable Part
Model. In IEEE Conference on Computer Vision and Pat-
tern Recognition, June 2014.
[19] K. Hara and R. Chellappa. Growing regression forests by
classification: Applications to object pose estimation. In
European Conference on Computer Vision, (ECCV), pages
552–567. Springer, 2014.
[20] G. Hu, F. Yan, J. Kittler, W. Christmas, C.-H. Chan, Z.-H.
Feng, and P. Huber. Efficient 3D Morphable Face Model
Fitting. Pattern Recognition, 67:366–379, 2017.
[21] P. Huber, Z.-H. Feng, W. Christmas, J. Kittler, and
M. R¨
atsch. Fitting 3D Morphable Face Models using local
features. In IEEE International Conference on Image Pro-
cessing (ICIP), pages 1195–1199. IEEE, 2015.
[22] V. Kazemi and J. Sullivan. One millisecond face alignment
with an ensemble of regression trees. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
1867–1874, 2014.
[23] J. Kittler, P. Huber, Z.-H. Feng, G. Hu, and W. Christmas. 3D
Morphable Face Models and Their Applications. In Inter-
national Conference on Articulated Motion and Deformable
Objects, pages 185–206. Springer, 2016.
[24] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof. An-
notated Facial Landmarks in the Wild: A Large-scale, Real-
world Database for Facial Landmark Localization. In First
IEEE International Workshop on Benchmarking Facial Im-
age Analysis Technologies, 2011.
[25] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang. In-
teractive facial feature localization. In European Conference
on Computer Vision, pages 679–692. Springer, 2012.
[26] F. Liu, D. Zeng, Q. Zhao, and X. Liu. Joint face align-
ment and 3d face reconstruction. In European Conference on
Computer Vision (ECCV), pages 545–560. Springer, 2016.
[27] I. Matthews and S. Baker. Active Appearance Models Revis-
ited. International Journal of Computer Vision, 60(2):135–
164, 2004.
[28] M. Piotraschke and V. Blanz. Automated 3d face recon-
struction from multiple images using quality measures. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), June 2016.
[29] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000
fps via regressing local binary features. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
1685–1692, 2014.
[30] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou,
and M. Pantic. 300 faces in-the-wild challenge: Database
and results. Image and Vision Computing, 47:3–18, 2016.
[31] X. Song, Z.-H. Feng, G. Hu, J. Kittler, W. Christmas, and
X.-J. Wu. Dictionary Integration using 3D Morphable Face
9
Models for Pose-invariant Collaborative-representation-
based Classification. arXiv preprint arXiv:1611.00284,
2016.
[32] Y. Sun, X. Wang, and X. Tang. Deep Convolutional Network
Cascade for Facial Point Detection. In IEEE Conference
on Computer Vision and Pattern Recognition, pages 3476–
3483, 2013.
[33] Y. Sun, X. Wang, and X. Tang. Deep learning face represen-
tation from predicting 10,000 classes. In The IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
June 2014.
[34] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
Closing the gap to human-level performance in face verifica-
tion. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1701–1708, 2014.
[35] G. Trigeorgis, P. Snape, M. A. Nicolaou, E. Antonakos,
and S. Zafeiriou. Mnemonic Descent Method: A Recur-
rent Process Applied for End-To-End Face Alignment. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), June 2016.
[36] O. Tuzel, T. K. Marks, and S. Tambe. Robust face align-
ment using a mixture of invariant experts. In B. Leibe,
J. Matas, N. Sebe, and M. Welling, editors, European Con-
ference on Computer Vision (ECCV), pages 825–841, Cham,
2016. Springer International Publishing.
[37] S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, and A. Kas-
sim. Robust facial landmark detection via recurrent attentive-
refinement networks. In European Conference on Computer
Vision (ECCV), 2016.
[38] X. Xiong and F. De la Torre. Supervised descent method
and its applications to face alignment. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
532–539, 2013.
[39] X. Xiong and F. De la Torre. Global supervised descent
method. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 2664–2673,
2015.
[40] J. Yan, Z. Lei, D. Yi, and S. Z. Li. Learn to Combine Multiple
Hypotheses for Accurate Face Alignment. In International
Conference of Computer Vision - Workshops, 2013.
[41] H. Yang, X. Jia, I. Patras, and K.-P. Chan. Random Sub-
space Supervised Descent Method for Regression Problems
in Computer Vision. Signal Processing Letters, IEEE,
22(10):1816–1820, 2015.
[42] J. Zhang, M. Kan, S. Shan, and X. Chen. Occlusion-Free
Face Alignment: Deep Regression Networks Coupled With
De-Corrupt AutoEncoders. In IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), June 2016.
[43] J. Zhang, S. Shan, M. Kan, and X. Chen. Coarse-to-Fine
Auto-Encoder Networks (CFAN) for Real-Time Face Align-
ment. In European Conference on Computer Vision, volume
8690, pages 1–16. Springer International Publishing, 2014.
[44] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark
detection by deep multi-task learning. In European Confer-
ence on Computer Vision, pages 94–108. Springer, 2014.
[45] S. Zhu, C. Li, C. Change Loy, and X. Tang. Face align-
ment by coarse-to-fine shape searching. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
4998–5006, 2015.
[46] S. Zhu, C. Li, C.-C. Loy, and X. Tang. Unconstrained
Face Alignment via Cascaded Compositional Learning. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), June 2016.
[47] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face Alignment
Across Large Poses: A 3D Solution. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June
2016.
[48] X. Zhu, J. Yan, D. Yi, Z. Lei, and S. Z. Li. Discriminative 3D
morphable model fitting. In IEEE International Conference
and Workshops onAutomatic Face and Gesture Recognition,
volume 1, pages 1–8, 2015.
10
  • ... The first one is to use multi- view models. There is a long history of the use of multi- view models in landmark localisation, from the earlier studies on ASM [47] and AAM [10] to recent work on cascaded-regression-based [64,75,21] and deep-learning- based approaches [12]. For example, Feng et al. train multi- view cascaded regression models using a fuzzy membership weighting strategy, which, interestingly, outperforms even some deep-learning-based approaches [21]. ...
    ... There is a long history of the use of multi- view models in landmark localisation, from the earlier studies on ASM [47] and AAM [10] to recent work on cascaded-regression-based [64,75,21] and deep-learning- based approaches [12]. For example, Feng et al. train multi- view cascaded regression models using a fuzzy membership weighting strategy, which, interestingly, outperforms even some deep-learning-based approaches [21]. The second strategy, which has become very popular in recent years, is to use 3D face models [76,28,2,38,29]. ...
    ... Each image has 19 facial landmarks. We use three state-of-the-art algorithms [75,21,39] as our baseline for comparison. The first one is the Cascaded Compositional Learning algorithm (CCL) [75], which is a multi-view cascaded regression model based on random forests. ...
    Conference Paper
    Full-text available
    We present a new loss function, namely Wing loss, for robust facial landmark localisation with Convolutional Neural Networks (CNNs). We first compare and analyse different loss functions including L2, L1 and smooth L1. The analysis of these loss functions suggests that, for the training of a CNN-based localisation model, more attention should be paid to small and medium range errors. To this end, we design a piece-wise loss function. The new loss amplifies the impact of errors from the interval (-w, w) by switching from L1 loss to a modified logarithm function. To address the problem of under-representation of samples with large out-of-plane head rotations in the training set, we propose a simple but effective boosting strategy, referred to as pose-based data balancing. In particular, we deal with the data imbalance problem by duplicating the minority training samples and perturbing them by injecting random image rotation, bounding box translation and other data augmentation approaches. Last, the proposed approach is extended to create a two-stage framework for robust facial landmark localisation. The experimental results obtained on AFLW and 300W demonstrate the merits of the Wing loss function, and prove the superiority of the proposed method over the state-of-the-art approaches.
  • ... It is essential to convert the data into a meaningful form for accurate data analysis, which requires pre-processing the data before it can be used to develop a prediction or classification model. To improve classification accuracy, dimensional reduction [25][26][27] and data augmentation [28][29][30] have been studied. Garcke et al. [25] proposed a method to reduce the dimension of nonlinear time-series data extracted from wind turbines, setting the baseline so as to distinguish normal turbines from abnormal turbines, and monitoring the state of the wind turbines. ...
    ... To compared to typical approaches based on sound analysis, we used three feature extraction methods (i.e., MFCC [28], spectrogram [29], and mel-spectrum [30]) and two classification methods (i.e., SVM [31] and K-NN [32]). Note that, since the intensity data is time series data, we applied the Shapelets [33] and LSTM-FCN [27] without the feature extraction methods. ...
    Article
    Full-text available
    The use of IoT (Internet of Things) technology for the management of pet dogs left alone at home is increasing. This includes tasks such as automatic feeding, operation of play equipment, and location detection. Classification of the vocalizations of pet dogs using information from a sound sensor is an important method to analyze the behavior or emotions of dogs that are left alone. These sounds should be acquired by attaching the IoT sound sensor to the dog, and then classifying the sound events (e.g., barking, growling, howling, and whining). However, sound sensors tend to transmit large amounts of data and consume considerable amounts of power, which presents issues in the case of resource-constrained IoT sensor devices. In this paper, we propose a way to classify pet dog sound events and improve resource efficiency without significant degradation of accuracy. To achieve this, we only acquire the intensity data of sounds by using a relatively resource-efficient noise sensor. This presents issues as well, since it is difficult to achieve sufficient classification accuracy using only intensity data due to the loss of information from the sound events. To address this problem and avoid significant degradation of classification accuracy, we apply long short-term memory-fully convolutional network (LSTM-FCN), which is a deep learning method, to analyze time-series data, and exploit bicubic interpolation. Based on experimental results, the proposed method based on noise sensors (i.e., Shapelet and LSTM-FCN for time-series) was found to improve energy efficiency by 10 times without significant degradation of accuracy compared to typical methods based on sound sensors (i.e., mel-frequency cepstrum coefficient (MFCC), spectrogram, and mel-spectrum for feature extraction, and support vector machine (SVM) and k-nearest neighbor (K-NN) for classification).
  • ... Face Alignment: The earliest approaches to regression- based face alignment trained a cascade of regressors to de- tect face landmarks [9,10,18,29,70]. More recently, deep convolutional neural networks (CNNs) have been used for both 2D and 3D facial landmark detection from 2D images [28,65]. ...
    Preprint
    While much progress has been made in capturing high-quality facial performances using motion capture markers and shape-from-shading, high-end systems typically also rely on rotoscope curves hand-drawn on the image. These curves are subjective and difficult to draw consistently; moreover, ad-hoc procedural methods are required for generating matching rotoscope curves on synthetic renders embedded in the optimization used to determine three-dimensional facial pose and expression. We propose an alternative approach whereby these curves and other keypoints are detected automatically on both the image and the synthetic renders using trained neural networks, eliminating artist subjectivity and the ad-hoc procedures meant to mimic it. More generally, we propose using machine learning networks to implicitly define deep energies which when minimized using classical optimization techniques lead to three-dimensional facial pose and expression estimation.
  • ... However, the faces of some low resolution images with extreme pose variations were missed by MTCNN. For those face images, a bounding box regression approach was used to obtain the face bounding box, as described in [19], [20]. ...
    Conference Paper
    Full-text available
    This paper investigates the evaluation of dense 3D face reconstruction from a single 2D image in the wild. To this end, we organise a competition that provides a new benchmark dataset that contains 2000 2D facial images of 135 subjects as well as their 3D ground truth face scans. In contrast to previous competitions or challenges, the aim of this new benchmark dataset is to evaluate the accuracy of a 3D dense face reconstruction algorithm using real, accurate and high-resolution 3D ground truth face scans. In addition to the dataset, we provide a standard protocol as well as a Python script for the evaluation. Last, we report the results obtained by three state-of-the-art 3D face reconstruction systems on the new benchmark dataset. The competition is organised along with the 2018 13th IEEE Conference on Automatic Face & Gesture Recognition.
  • ... However, the faces of some low resolution images with extreme pose variations were missed by MTCNN. For those face images, a bounding box regression approach was used to obtain the face bounding box, as described in [19], [20]. ...
    Conference Paper
    Full-text available
    This paper investigates the evaluation of dense 3D face reconstruction from a single 2D image in the wild. To this end, we organise a competition that provides a new benchmark dataset that contains 2000 2D facial images of 135 subjects as well as their 3D ground truth face scans. In contrast to previous competitions or challenges, the aim of this new benchmark dataset is to evaluate the accuracy of a 3D dense face reconstruction algorithm using real, accurate and high-resolution 3D ground truth face scans. In addition to the dataset, we provide a standard protocol as well as a Python script for the evaluation. Last, we report the results obtained by three state-of-the-art 3D face reconstruction systems on the new benchmark dataset. The competition is organised along with the 2018 13th IEEE Conference on Automatic Face & Gesture Recognition.
  • ... Fitting a cohort-specific 3D face model to faces within the same cohort is easier and more accurate than fitting a global/general model. This has also been investigated and demon- strated for other computer vision tasks in [12,17,20,53] . The main reason is that a fitting algorithm usually starts the optimisation from the mean of the model. ...
    Article
    Full-text available
    3D Morphable Face Models (3DMM) have been used in pattern recognition for some time now. They have been applied as a basis for 3D face recognition, as well as in an assistive role for 2D face recognition to perform geometric and photometric normalisation of the input image, or in 2D face recognition system training. The statistical distribution underlying 3DMM is Gaussian. However, the single-Gaussian model seems at odds with reality when we consider different cohorts of data, e.g. Black and Chinese faces. Their means are clearly different. This paper introduces the Gaussian Mixture 3DMM (GM-3DMM) which models the global population as a mixture of Gaussian subpopulations, each with its own mean. The proposed GM-3DMM extends the traditional 3DMM naturally, by adopting a shared covariance structure to mitigate small sample estimation problems associated with data in high dimensional spaces. We construct a GM-3DMM, the training of which involves a multiple cohort dataset, SURREY-JNU, comprising 942 3D face scans of people with mixed backgrounds. Experiments in fitting the GM-3DMM to 2D face images to facilitate their geometric and photometric normalisation for pose and illumination invariant face recognition demonstrate the merits of the proposed mixture of Gaussians 3D face model.
  • Chapter
    Recent state-of-the-art landmark localization task are dominated by heatmap regression and fully convolutional network. In spite of its superior performance in face alignment, heatmap regression method has a few drawbacks in nature, such as do not follow shape constraint and sensitivity to partial occlusions. In this paper, we proposed a score-guided face alignment network that simultaneously outputs a heatmap and corresponding score map for each landmark. Rather than treating all predicted landmarks equally, a weight is assigned to each landmark based on the two relational maps. In this way, more reliable landmarks with strong local information are assigned large weights and the land-marks with small weights that may stay with occlusions can be inferred with the help of the reliable landmarks. Meanwhile, an exemplar-based shape dictionary is designed to take advantage of these landmarks with high score to infer the landmark with small score. The shape constraint is implicitly applied in this way. Thus our method demonstrates superior performance in detecting landmarks with extreme occlusions and improving overall performance. Experiment results on 300 W and COFW dataset show the effectiveness of the proposed method.
  • Article
    In this article, a two-stage refinement network is proposed for facial landmarks detection on unconstrained conditions. Our model can be divided into two modules, namely the Head Attribude Classifier (HAC) module and the Domain-Specific Refinement (DSR) module. Given an input facial image, HAC adopts multi-task learning mechanism to detect the head pose and obtain an initial shape. Based on the obtained head pose, DSR designs three different CNN-based refinement networks trained by specific domain, respectively, and automatically selects the most approximate network for the landmarks refinement. Different from existing two-stage models, HAC combines head pose prediction with facial landmarks estimation to improve the accuracy of head pose prediction, as well as obtaining a robust initial shape. Moreover, an adaptive sub-network training strategy applied in the DSR module can effectively solve the issue of traditional multi-view methods that an improperly selected sub-network may result in alignment failure. The extensive experimental results on two public datasets, AFLW and 300W, confirm the validity of our model.
  • Chapter
    Facial expression recognition is a topical task. However, very little research investigates subtle expression recognition, which is important for mental activity analysis, deception detection, etc. We address subtle expression recognition through convolutional neural networks (CNNs) by developing multi-task learning (MTL) methods to effectively leverage a side task: facial landmark detection. Existing MTL methods follow a design pattern of shared bottom CNN layers and task-specific top layers. However, the sharing architecture is usually heuristically chosen, as it is difficult to decide which layers should be shared. Our approach is composed of (1) a novel MTL framework that automatically learns which layers to share through optimisation under tensor trace norm regularisation and (2) an invariant representation learning approach that allows the CNN to leverage tasks defined on disjoint datasets without suffering from dataset distribution shift. To advance subtle expression recognition, we contribute a Large-scale Subtle Emotions and Mental States in the Wild database (LSEMSW). LSEMSW includes a variety of cognitive states as well as basic emotions. It contains 176K images, manually annotated with 13 emotions, and thus provides the first subtle expression dataset large enough for training deep CNNs. Evaluations on LSEMSW and 300-W (landmark) databases show the effectiveness of the proposed methods. In addition, we investigate transferring knowledge learned from LSEMSW database to traditional (non-subtle) expression recognition. We achieve very competitive performance on Oulu-Casia NIR&Vis and CK+ databases via transfer learning.
  • Article
    Full-text available
    3D face reconstruction of shape and skin texture from a single 2D image can be performed using a 3D Morphable Model (3DMM) in an analysis-by-synthesis approach. However, performing this reconstruction (fitting) efficiently and accurately in a general imaging scenario is a challenge. Such a scenario would involve a perspective camera to describe the geometric projection from 3D to 2D, and the Phong model to characterise illumination. Under these imaging assumptions the reconstruction problem is nonlinear and, consequently, computationally very demanding. In this work, we present an efficient stepwise 3DMM-to-2D image-fitting procedure, which sequentially optimises the pose, shape, light direction, light strength and skin texture parameters in separate steps. By linearising each step of the fitting process we derive closed-form solutions for the recovery of the respective parameters, leading to efficient fitting. The proposed optimisation process involves all the pixels of the input image, rather than randomly selected subsets, which enhances the accuracy of the fitting. It is referred to as Efficient Stepwise Optimisation (ESO).
  • Article
    Full-text available
    The paper presents a dictionary integration algorithm using 3D morphable face models (3DMM) for pose-invariant collaborative-representation-based face classification. To this end, we first fit a 3DMM to the 2D face images of a dictionary to reconstruct the 3D shape and texture of each image. The 3D faces are used to render a number of virtual 2D face images with arbitrary pose variations to augment the training data, by merging the original and rendered virtual samples {\color{black}to create} an extended dictionary. Second, to reduce the information redundancy of the extended dictionary and improve the sparsity of reconstruction coefficient vectors using collaborative-representation-based classification (CRC), we exploit an on-line elimination scheme to optimise the extended dictionary by identifying the most representative training samples for a given query. The final goal is to perform pose-invariant face classification using the proposed dictionary integration method and the on-line pruning strategy under the CRC framework. Experimental results obtained for a set of well-known face datasets demonstrate the merits of the proposed method, especially its robustness to pose variations.
  • Conference Paper
    In this work, we introduce a novel Recurrent Attentive-Refinement (RAR) network for facial landmark detection under unconstrained conditions, suffering from challenges like facial occlusions and/or pose variations. RAR follows the pipeline of cascaded regressions that refines landmark locations progressively. However, instead of updating all the landmark locations together, RAR refines the landmark locations sequentially at each recurrent stage. In this way, more reliable landmark points are refined earlier and help to infer locations of other challenging landmarks that may stay with occlusions and/or extreme poses. RAR can thus effectively control detection errors from those challenging landmarks and improve overall performance even in presence of heavy occlusions and/or extreme conditions. To determine the sequence of landmarks, RAR employs an attentive-refinement mechanism. The attention LSTM (A-LSTM) and refinement LSTM (R-LSTM) models are introduced in RAR. At each recurrent stage, A-LSTM implicitly identifies a reliable landmark as the attention center. Following the sequence of attention centers, R-LSTM sequentially refines the landmarks near or correlated with the attention centers and provides ultimate detection results finally. To further enhance algorithmic robustness, instead of using mean shape for initialization, RAR adaptively determines the initialization by selecting from a pool of shape centers clustered from all training shapes. As an end-to-end trainable model, RAR demonstrates superior performance in detecting challenging landmarks in comprehensive experiments and it also establishes new state-of-the-arts on the 300-W, COFW and AFLW benchmark datasets.
  • Conference Paper
    Face alignment, which is the task of finding the locations of a set of facial landmark points in an image of a face, is useful in widespread application areas. Face alignment is particularly challenging when there are large variations in pose (in-plane and out-of-plane rotations) and facial expression. To address this issue, we propose a cascade in which each stage consists of a mixture of regression experts. Each expert learns a customized regression model that is specialized to a different subset of the joint space of pose and expressions. The system is invariant to a predefined class of transformations (e.g., affine), because the input is transformed to match each expert’s prototype shape before the regression is applied. We also present a method to include deformation constraints within the discriminative alignment framework, which makes our algorithm more robust. Our algorithm significantly outperforms previous methods on publicly available face alignment datasets.
  • We present an approach to simultaneously solve the two problems of face alignment and 3D face reconstruction from an input 2D face image of arbitrary poses and expressions. The proposed method iteratively and alternately applies two sets of cascaded regressors, one for updating 2D landmarks and the other for updating reconstructed pose-expression-normalized (PEN) 3D face shape. The 3D face shape and the landmarks are correlated via a 3D-to-2D mapping matrix. In each iteration , adjustment to the landmarks is firstly estimated via a landmark regressor, and this landmark adjustment is also used to estimate 3D face shape adjustment via a shape regressor. The 3D-to-2D mapping is then computed based on the adjusted 3D face shape and 2D landmarks, and it further refines the 2D landmarks. An effective algorithm is devised to learn these regressors based on a training dataset of pairing annotated 3D face shapes and 2D face images. Compared with existing methods, the proposed method can fully automatically generate PEN 3D face shapes in real time from a single 2D face image and locate both visible and invisible 2D landmarks. Extensive experiments show that the proposed method can achieve the state-of-the-art accuracy in both face alignment and 3D face reconstruction, and benefit face recognition owing to its reconstructed PEN 3D face shapes.