Technical ReportPDF Available

Facial Landmark Detection: a Modern Approach

Abstract and Figures

The aim of this paper is to evaluate the impact of some of the research advances in Deep Learning on the everyday development and their usefulness to practitioners. A novel Convolutional Neural Network model is presented to solve a particular instance of the Facial Landmark Detection Problem, the one obtained by considering only five landmarks (Left and Right Eye center, Nose tip, Mouth Left and Right corner). The proposed solution is a single Convolutional Neural Network, with a conceptually simple architecture, employing some of the most recent techniques: inception modules, residual connections, batch normalization, dropout and ELU units. This solution has been compared with the work by Sun et al. [52] and the obtained results clearly show the positive impact of those techniques: it is possible for a practitioner, without a previous domain knowledge and with an average machine, to quickly build a simple model, with competitive computational performance, and able to obtain predictive performance comparable with the ones obtained by more complex, state-of-the-art models presented in the last few years. In addition, the proposed solution has proven to be relatively cheap to train. This reveals the usefulness of the research advances to practitioners.
Content may be subject to copyright.
Facial Landmark Detection: a Modern Approach
Stefano Cappellini
contact@stefanocappellini.com
Abstract
The aim of this paper is to evaluate the impact of some of the research advances in Deep Learning on
the everyday development and their usefulness to practitioners. A novel Convolutional Neural Network
model is presented to solve a particular instance of the Facial Landmark Detection Problem, the one
obtained by considering only five landmarks (Left and Right Eye center, Nose tip, Mouth Left and Right
corner). The proposed solution is a single Convolutional Neural Network, with a conceptually simple
architecture, employing some of the most recent techniques: inception modules, residual connections,
batch normalization, dropout and ELU units. This solution has been compared with the work by Sun
et al. [52] and the obtained results clearly show the positive impact of those techniques: it is possible for
a practitioner, without a previous domain knowledge and with an average machine, to quickly build a
simple model, with competitive computational performance, and able to obtain predictive performance
comparable with the ones obtained by more complex, state-of-the-art models presented in the last few
years. In addition, the proposed solution has proven to be relatively cheap to train. This reveals the
usefulness of the research advances to practitioners.
1 Introduction
In the last decades the face analysis field has gath-
ered a lot of attention and many big companies
like Google [46] and Facebook [56] are currently
working on it. Within this field a key problem is
the Facial Landmark1Detection Problem (FLDP):
given an input image containing a single face and
a set of landmarks to be detected, predict the posi-
tion of each of those landmarks on that face. This
problem is crucial since it can help:
Indirectly: it plays a key role in the face align-
ment task [44] and thus it is useful in the face
verification [38, 51, 56], face recognition [66, 67],
facial emotion recognition [35], face attribute in-
ference [39] and even in the face detection [12].
Directly: it can, for example, help the facial
recognition/verification task by indicating rel-
evant areas where to extract interesting fea-
tures [4, 18], the facial emotion recognition task
[17, 57], the gender recognition task [34]
It is a useful task by itself too, having applica-
tions even in the medicine, e.g. to help detecting
dysmorphic facial signs [11]. Zhang et al. [63] did
a great job summarizing the landmark detection
relevant background. Thus, the development of ef-
fective landmark detection techniques would have
an enormous impact.
The Convolutional Neural Networks (CNNs)
gathered an incredible importance in the last years
and today many state-of-the-art solutions in the
computer vision field employ at least one CNN2.
CNNs are particularly well suited for working with
images because they allow to take advantage of the
their spatial information. In addition, the weight
sharing property of the convolutions helps to re-
duce the model complexity and thus the risk of
overfitting. CNNs are also able to implicitly per-
form feature extraction by themselves.
In this paper a novel Convolutional Neural Net-
work model is presented to solve a particular in-
stance of the Facial Landmark Detection Problem,
the one obtained by considering only five land-
marks (Left and Right Eye center, Nose tip, Mouth
Left and Right corner). The aim is to evaluate the
impact of some of the research advances in Deep
Learning on the everyday development and their
usefulness to practitioners. How does a model,
built by a practitioner3who is not a domain ex-
pert, using an average computer4, with very little
This work is released under the Creative Commons Attribution 4.0 License
1A facial landmark is an highly informative point of a face, like for example the tip of the nose or the left eye center.
2See the R-CNN [20, 21, 45], the Facebook triad [19], the work of He et al. [25] and [37, 42, 43] for the object detection,
FaceNet [46] for the face recognition, DeepFace [56] for the face verification, [31,62] for the emotion recognition
3Author’s note: Actually, I am a practitioner
4The term average computer is used to indicate a generic off-the-shelf home/enterprise computer, slightly tuned, used in
order to solve Deep Learning tasks: all the specialized, expensive Deep Learning boxes are thus excluded. The computer
used to build the proposed solution is an old Asus CM6870, with an Intel i7-3770, 20 GB of ram and a single GTX 1080 Ti
1
time, but employing some of the most recent tech-
niques compare with other, older, state-of-the-art
models presented in the last few years?
The proposed solution, whose development
took just one week, is a single Convolutional Neu-
ral Network, with a conceptually simple architec-
ture, employing some of the most recent tech-
niques:
Inception modules [53–55]: they allow the net-
work to learn what the operations that work the
best are and how to combine them. This helps
to reduce the modeling time and to obtain good
predictive performance.
Residual connections [26]: They make the build-
ing of deep model easier, allowing the network
to learn which depth works the best, prevent-
ing the natural performance degradation due to
an excessive depth. They also make the train-
ing faster, because, as the network gets closer to
the right solution, each layer has only to learn a
small correction to its inputs
Batch normalization [30]: normalization tech-
nique that normalizes the inputs of each level to
which it is applied. This makes the development
easier, helping to prevent the gradient vanishing
and exploding problems and allowing to be less
careful about weights initialization.
Dropout [50]: it is a fundamental regularization
technique that helps both to prevent the over-
fitting and to make the network more robust,
forcing its units to learn redundant representa-
tions. This helps speed up the model building
phase and to build a deeper model.
ELUs [13]: they offer almost all the advantages
of ReLU units (sparse representations, no gra-
dient vanishing problem, no gradient explosion)
without the dying neuron problem. In addition,
they outperformed many of the most used ReLU
extensions (LReLU, PReLU). Thus ELUs allow
to speed up the training and making it more ro-
bust.
Adam [32]: the proposed solution was trained
using the mini-batch learning and Adam was
the optimization algorithm chosen for the loss
minimization. It offers both momentum, help-
ing to escape from local minima, and learning
rates decay, increasing the convergence rate. In
particular, it uses a different learning rate for
each parameter and so it is able to both reduce
the learning rate for the parameters being fre-
quently updated and to keep it big for all the
others.
In order to make such evaluation meaningful
and objective, the solution was be compared with
the work by Sun et al. [52]. This work was cho-
sen for various reasons. (1) First of all, it was
the state-of-the-art in 2013 and so almost all the
techniques employed in the proposed solution were
not available at the moment of writing, allowing
to evaluate their impact. (2) Second, the Authors
proposed a CNN based solution (referred in this
paper as the “benchmark model ”), with a novel,
very complex architecture, based on twenty-three
CNNs arranged in a three-layers cascade. This is
perfect to analyze the impact of the research ad-
vances. (3) Third, the Authors made available the
data they used and the results obtained by their
model and by all the models they compared their
solution with [1, 2, 6, 9, 36, 58].
2 Related works
Facial Landmark Detection Problem
As reported in [63], two popular approaches for
solving the FLDP are:
Template fitting approaches, that build face
templates to fit input images: some works worth
citing are the one by Cootes et al. [14], the work
of Yu et al. [61] and the one by Zhu et al. [65],
in which the task of face detection, pose estima-
tion and landmark localization are jointly solved
using Deformable Parts Models.
Regression based methods, that estimate land-
mark locations explicitly by regression: here the
works range from the Support Vector Regres-
sor [56, 58], random regression forests [15], cas-
caded fern regression [9] up to CNNs [52, 63].
This is the approach followed in this paper.
As reported in [6], many solutions are based on
the notion of fiducial point detectors: classifiers
or regressor, trained to respond to a specific land-
mark, that are scanned over the image in a slid-
ing window fashion [6–8, 16, 24, 28, 59]. Many used
AdaBoost [36], SVM [6, 65] or random forest [5]
as component detector. Constraints can thus be
added to the relative location of those detections,
in the form of Active Appearance Model [14,16], of
conditional probabilities [16], predicted locations
or bounding regions.
As regards the CNNs, as said, they have al-
ready been successfully applied to the facial land-
mark detection problem [29, 41, 52, 63, 64].
Surveys and literature for practitioners
There are numerous papers giving precious advice
on how to train and build neural networks [22, 33]
and various interesting surveys on the Facial Land-
mark Detection Problem [10, 60]. A recent trend
is to focus more on Deep Learning practitioners
and the works by L. Smith [48, 49] are clear and
relevant examples of this.
3 Problem statement
The Facial Landmark Detection Problem is a
learning problem and, in particular, a super-
vised multinomial regression problem. The
2
particular instance considered can be formally de-
fined as:
GIVEN: A set of annotated examples {(I , P )}
where Iis a fixed-size image containing a single
face and Pis a real-value vector of ten elements,
representing the x-y coordinates of five specific
landmarks: (1) left eye center [LE], (2) right eye
center [RE], (3) nose tip [NO], (4) mouth left
corner [LM] and (5) mouth right corner [RM]
PROBLEM: Build a model Mthat, given as
input a fixed-size and single-face image I, pre-
dicts the position of the five landmarks consid-
ered, that is, produces as output a vector ˆ
Pof
ten elements, representing the x-y coordinates of
those landmarks
4 Benchmark, metrics and data
As said in the introduction, the proposed solution
was compared with the work by Sun et al. [52].
The result presented in that paper are presented
in table 1. In particular, during the training and
the validation the same exact datasets and metrics
employed in [52] were used, ensuring the objective-
ness of the comparison.
4.1 Evaluation and validation metric
In order to guarantee the objectiveness of the eval-
uation, it is mandatory to employ a widely adopted
metric. A common choice for the FLDP is the
mean detection error, something that could be
defined as a de facto standard. To define it, it is
necessary to introduce the detection error.
Definition 1. Given a labeled example
I(i), P (i), let (xj, yj)P(i)be the correct co-
ordinates of the jth landmark and let (x0
j, y0
j)be
the corresponding predictions. The detection error
for that particular example and for that specific
landmark is defined as:
err(i)
j=q(xjx0
j)2+ (yjy0
j)2
inter-ocular distance of I(i)
where the inter-ocular distance is the Euclidean
distance between the true left eye center and the
true right eye center for that particular input im-
age I(i).
Definition 2 (Mean Detection Error
(MDE)).Given a set of labeled examples S=
nI(i), P (i)o, the mean detection error for the
jth landmark is the value obtained taking the av-
erage of the detection errors for that landmark for
each example:
¯errj=1
|S|X
(I(i),P (i))S
err(i)
j
It is worth noting that the inter-ocular distance
is not the only normalization factor possible and
probably it is not the best: as pointed out by [52],
it biases the detection errors, since it tends to be
smaller when the head is turned, making the cor-
responding error bigger.
4.2 Available data
The original available data is organized in five
datasets:
A training and a validation set composed re-
spectively by 10000 and 3466 images obtained
by randomly partitioning a set of 13466 uncon-
strained images among which (1) 5590 images
are taken from the LFW5dataset and (2) 7876
images are taken from the Web.
The BioID test set6containing 1521 images of 23
subjects, presenting a big variability of illumina-
tion, background, and face size. This dataset is
slightly easier than the others because every im-
age shows the frontal view of a single face.
Two test sets, referred in this paper as
LF P W1and LF P W2, containing respectively
249 and 781 unconstrained images taken from
the LFPW7dataset.
As an example, five images are shown in figure 1.
Data collection
The training and validation set are ready to use,
being defined each in a singular and independent
CSV file in which each line contains: (1) the image
path, (2) the position of the left x, right x, top y,
bottom yposition of the bounding box returned
by the facial detection algorithm used in [52], fun-
damental in order to specify on which part of the
image (and on which face) to focus, and (3) the
ten x, y coordinates of the five landmarks consid-
ered in this order: left eye center, right eye center,
nose tip, mouth left corner and mouth right corner.
The test sets requires some additional work.
First of all, each test set definition is split in two
different files, one containing the bounding boxes
information and another one containing the ten
landmarks coordinates. Each file couple was thus
joined, performing a full outer join using the image
path as the key. Every missing information was re-
placed with a 1 as placeholder. Second, in [52],
the Authors only mentioned two test sets. In order
to obtain the exact same setting the two LF P W1
and LF P W2test sets were merged together into
5Labeled Faces in the Wild - http://vis-www.cs.umass.edu/lfw/
6https://www.bioid.com/About/BioID-Face-Database
7Labeled Face Parts in the Wild - http://neerajkumar.org/databases/lfpw/
3
Test set Model LE% RE% N% LM% RM%
BioID [52] 1.73 1.47 2.27 2.18 2.45
BioID [36] 4.11 3.88 7.03 9.06 9.23
BioID [58] 7.11 7.77 11.10 9.42 10.65
BioID [1](*) 4.13 3.65 4.95 4.25 5.22
BioID [2](*)(**) - - - 7.12 7.36
LFPW [52] 2.04 1.94 2.13 2.51 2.34
LFPW [36] 6.04 6.02 9.42 9.44 9.34
LFPW [58] 7.55 8.59 14.09 12.48 15.59
LFPW [1](*) 5.44 4.26 5.39 4.51 4.73
LFPW [2](*)(**) - - - 7.95 7.77
Table 1: Results reported by [52]. The values are the mean detection errors expressed as a percentage. (*)Commer-
cial software, (**)Only the mouth left and right corners were available. The lines corresponding to the benchmark
model have been colored in yellow.
Figure 1: Five example images from the datasets
a single one, referred in this paper as “LFPW test
set”.
Data preprocessing
There is some data missing in the two test sets:
there are 50 images from the BioID test set (actu-
ally the 3.3% of the total) and 47 images from the
LFPW test set (the 4.6% of the total) for which
the bounding boxes information is missing. All
the entries for which some information is missing
have been discarded, following the same approach
as in [52]. In addition, in [52], the Authors fo-
cused only on the test images that could be han-
dled by all the considered models. An identical
filtering was performed obtaining two test sets hav-
ing 1341 (BioID) and 601 (LFPW) images. After
those steps, the processing pipeline described be-
low was applied to all the entries. For each entry:
1. It is converted to grayscale, reducing the model
complexity and thus the training time. In addi-
tion, a model trained on grayscale images main-
tains the same performance even on RGB images
(after the conversion).
2. Taking advantage of the fact that all the bound-
ing boxes are perfect squares (with a tollerance
of ±1), it is cropped using a squared bounding
box obtained from the original one by using the
smallest of its two sides. This step produces
images containing a single face (with some ex-
ceptions) and removes all the unnecessary back-
ground, making the problem easier to solve and
well defined.
3. The cropped image is resized to a common size
d×d. In particular, d= 56 is the value that
worked the best.
4. Its landmarks are updated: given kand hthe
bounding box left xand top y,sthe cropped im-
ages size and dthe desired size, each coordinate
is updated as follow: x0= (xk)/(s/d), y0=
(yh)/(s/d). The translation and scale invari-
ance of the Mean Detection Error allow to dis-
card the original landmarks.
Finally, a data augmentation step was per-
formed, applying three different transformations to
each training image: (1) Histogram Equalization,
(2) Gaussian blur and (3) Image darkening. The
application of such transformations increased the
number of training images from 10000 to 40000.
It is important to notice that, although a man-
ual inspection on a random sample of the available
data indicate that sideways images are under rep-
resented in the training set, the data augmentation
was not used to create a more balanced set.
No further preprocessing steps were applied be-
cause all the training images tended to be, on aver-
age, correctly exposed. In addition, the grayscale
conversion implicitly performs a normalization of
all the input pixels, forcing their values to lie be-
tween 0 and 1.
5 Proposed solution
The proposed solution, built entirely in Tensor-
Flow in just one week, is described in table 2. It
4
Figure 2: The image pipeline used during datasets preprocessing.
takes as input images having the size of 56 ×56
pixel and it was trained using a batch size of 200
in order to guarantee some stability to the learning
process.
Convolutions
As regards the initialization, the biases were ini-
tialized to a fixed, small positive value (e5) to
push the ELUs towards their positive side, while all
the weights were initialized using the initialization
proposed in the work by He et al. [27]. Although
many authors, for example [52], suggest the use of
unshared convolutions, all the employed con-
volutions are shared weight convolutions. There
are two reasons for this decision: (1) the weight
sharing allows to keep the model simpler and (2)
TensorFlow does not offer, at the moment of writ-
ing, unshared weight convolutions and so they are
not immediately accessible by a practitioner
Batch normalization
The batch normalization layers have been applied
to all the convolutions (even the ones in the incep-
tion modules). In particular, they have been ap-
plied after the nonlinearities and not before them,
as prescribed in [30]. This is a recent trend [40,47]8
that seems to work better, as confirmed also by the
results obtained during the development. The Ten-
sorFlow batch norm required some tuning in order
to work properly: in particular, the Fused param-
eter was set to True to speed up the performance
and the decay parameter of the moving average
was set to 0.9 to increase its convergence rate.
The pyramid shape pattern
The first part of the network is composed by tra-
ditional layers, following the advice given in [54].
In particular, the first operation is a convolution
with 64 filters having a receptive field of 3×3 and
a unit stride. Its depth allows to extract many
information from the input images and the small
receptive field is useful in order to focus mostly on
local features. After that, a pooling layer with a
receptive field of (2,2) and a stride 2 is applied in
order to reduce the memory requirements of the
remaining network. It is interesting to notice that
the depth of the feature maps is monotonically
non-decreasing and that it is increased whenever
the feature maps get smaller, preventing or limit-
ing the information loss9. This is a design pattern
known as pyramid shape [49].
Inception modules
All the inception modules follow the architecture
described in [55]. In particular:
1. The 5×5 convolution have been replaced by two
3×3 convolutions to increase the performance
2. It is possible to use an arbitrary stride, some-
thing that allows to both reduce the feature
maps size and to increase their depth, reducing
the information loss
3. The 1×1 convolutions performed before the 3×3
and 5 ×5 branches correctly reduce the depth of
the feature maps, increasing the computational
performance
In addition to the original inception modules, a
1×1 convolution is applied to the output of each
module. This edit has proven to work greatly be-
cause: (1) it lets the network decide how to com-
bine the feature maps produced, (2) it adds more
nonlinearities and (3) it makes easier to decide the
depth of the feature maps produced as output. It
is worth noting that there are no pooling layers fol-
lowing the inception modules: they were replaced
with an inception module using a stride of 2, fol-
lowing what stated in [55]. In this way the infor-
mation loss is reduced even further. In addition,
every couple of consecutive size reductions is sepa-
rated by some layers in order to extract even more
information and features from every size.
Residual connections
Inception modules have been combined with resid-
ual connections, following an approach similar to
the one presented in [53]. In particular, each
residual inception element listed in 2 is a com-
posite box that, given an input X: (1) Firstly ap-
plies an inception module, producing an output O,
8http://bit.ly/2vxtKwE
9The term information is used informally to indicate the number of neurons contained in a set of feature maps, that is:
I=W ID T H ×H E IGHT ×D EP T H
5
Figure 3: The inception module used in this project. See table 2 for an explanation of the numbers.
(2) then employs the residual connection, adding
the input Xto the output O, obtaining O+Xand
(3) finally applies an ELU nonlinearity. After the
nonlinearity no batch normalization is applied, as
suggested in [53]. All the residual connection have
thus a skip length equals to one: this was the value
that worked the best. This is something pleasing
because it is usually advisable to keep the residual
connections short, in order to avoid some weird
behavior, as stated in [49] and shown in [3]. It is
worth noting that, as implemented here, the addi-
tion cannot be used when the input size and the
output size differ, that is, when a stride greater
than one is used. In the original paper [53] some
solutions to this problem are presented.
Fully connected layers
Extracting all the information with a good convo-
lutional network is useless if the fully connected
layers are not complex enough. This is why the
solution employs four fully connected layers. It is
interesting to note that the dropout has been ap-
plied only to the fully connected layers (with a keep
probability equals to 0.85). This is due to the fact
that there it proved to be very effective while when
applied to other positions it simply did not work
well. In particular, (1) applied to the input layer,
it produced mixed results and (2) when applied
to the convolutional layers, it gave not satisfying
results. This is probably due to the weight shar-
ing that reduces the number of parameters of the
layers making them less likely to overfit.
Depth
The model is composed by a single, very deep,
Convolutional Neural Network. Depth is a key
factor in order to increase the expressive power
of a network and thus its ability to solve more
complex problems. This why the solutions pro-
posed in the literature tend to get deeper over the
years [49]. The FLDP is a difficult multinomial re-
gression task and thus, in order to effectively solve
it with a single network, it is necessary to go deep.
However, an excessive depth may even hurt the
performance. To prevent this problem, each in-
ception module employed in the last part of the
network has a residual connection, creating a long
chain of skip links in order to allow the network to
learn which is the depth that works the best. It is
worth noting that, although deep, the proposed so-
lution has a conceptually very simple architecture,
made by a small number of blocks repeated many
times, following a common design pattern [49].
Training
The MDE is probably not a good metric to be
used during the training for at least two reasons:
(1) The square root could be numerically unsta-
ble, and (2) It would make the computation slower.
This is why the proposed solution was trained us-
ing the simple Mean Squared Error between the
target vectors Pand the predicted ones ˆ
P. How-
ever, the validation error was computed using the
Mean Detection Error, in order to have an estimate
of the generalization ability of the models.
The training made use of the mini-batch gra-
dient descent and, in particular, of the Adam [32]
optimizer. This gave a lot of trouble during the de-
velopment, with the training performing some un-
expected, huge jumps. In order to fix this problem,
the epsilon parameter value was changed from
the default e8to a bigger value of 0.1.
The downside of using a bigger epsilon is that
it makes all the learning rate smaller, making the
already long training time even longer. Two small
changes were thus made in order to cope with this
additional issue. First, during the training, after
each epoch, to evaluate its generalization ability,
the model was tested only on the first 500 entries
of the validation set, instead of using the entire
set. This decision actually speed the training up
but it was probably not such a good choice because
it heavily biases the evaluations performed on the
validation set because the first 500 entries are all
from the LFW dataset. To reduce the training
time even further, the validation performed after
each epoch considered the left eye landmark only.
Again, this choice probably affected the choosing of
the final model but it greatly reduced the training
time allowing to consider more complex models.
6 Inspection
The proposed solution was trained for 200 epochs
and reached a training loss equals to 0.93 and a
validation loss of 2.52% for the left eye center land-
mark. The training used the early stopping tech-
nique: at the end of each epoch, the model was
evaluated on the partial validation set considered
and a snapshot was taken every time it outper-
formed the previous best result. At the end of
the training, the final snapshot (the one that per-
formed the best) was taken as the proposed solu-
tion.
The model offers competitive computational
performance: given a single image as input, the
6
What Output size
Input 56x56x1 56 ×56 ×1
Convolution: 64 (3,3) filters with stride 1 56 ×56 ×64
Max Pooling: receptive field of (2,2) with stride 2 28 ×28 ×64
3 x Convolution: 64 (3,3) filters with stride 1 28 ×28 ×64
Inception: 96, 30, 96, 30, 96, 96, 96 with stride 2 14 ×14 ×96
3 x Inception: 96, 30, 96, 30, 96, 96, 96 14 ×14 ×96
Inception: 192, 60, 192, 60, 192, 192, 192 with stride 2 7 ×7×192
Inception: 192, 60, 192, 60, 192, 192, 192 7 ×7×192
7 x Residual Inception: 192, 60, 192, 60, 192, 192, 192 7 ×7×192
Fully connected 1024 + Dropout 0.85 1024
Fully connected 512 + Dropout 0.85 512
Fully connected 256 + Dropout 0.85 256
Output 10
Table 2: Proposed solution architecture. The numbers next to each inception module indicate the depth of the feature
maps produced: (1) by the 1×1branch, (2) by the 1×1convolution perfomed before the 3×3branch, (3) by the
3×3branch, (4) by the 1×1convolution perfomed before the 5×5branch, (5) by the 5×5branch, (6) by the 1×1
convolution following the max pooling and (7) by the module. See figure 3.
Figure 4: The five pipelines applied during the robustness evaluation
prediction of the corresponding five landmarks
considered takes an average of 17ms.
Robustness evaluation
In order to evaluate the robustness of the proposed
solution, that is, its ability to perform well even
under adverse conditions, the sensitivity analysis
of the inputs has been performed. In particular,
five different pipelines were independently applied
to each validation image (figure 4) keeping track
of the model performance. Given an image, each
pipeline produces 11 images: the original one and
other ten obtained from it by progressively chang-
ing a specific image property. The first pipeline
increases the images noise, the second the dark-
ness, the third increases the image exposure, the
fourth the blurring and the fifth rotates the im-
ages. The figure 5 shows how the performance of
the proposed solution varies when those properties
changes. Some interesting considerations can be
made. (1) First, the solution is very robust against
overexposed images, with the performance remain-
ing almost the same. (2) Second, small perturba-
tions does not heavily affect its performance, some-
thing really pleasing and indicating that the model
may be consider robust. (3) Third, in the plots B,
Dand Ethe long term trend seems to follow an
exponential growth, suggesting that this robust-
ness is actually limited to small perturbations. (4)
Finally, rotations heavily affects the performance
of the model, that can be considered robust only
up to 8-10 degrees of rotation.
Weight analysis
To analyze the model even further, a weight anal-
ysis has been performed. This is an interesting
analysis that allows to spot overfitting and that
it is also useful in order to check that everything
works properly. As regards the proposed solution,
the weights and the biases are pleasingly all al-
most perfectly centered in zero (having a mean
equals to 2.01E5), very well distributed, assum-
ing both negative and positive values, and highly
concentrated (having a standard deviation equals
to 0.028). Their relatively small size and high con-
centration suggests that probably the model was
not close to overfitting.
Analysis of the errors
Another important step is to analyze the errors
committed by the proposed solution, especially on
the test set. To this end, for each test image, the
Average Detection Error (ADE) was computed,
that is, the average of the five Detection Errors
(one for each landmark considered) obtained for
that particular image. This could be though as a
rough estimate of the difficulty of the particular
example. The ten images with the largest Aver-
age Detection Error were then picked. These im-
ages are shown in figure 6 together with the pre-
dicted landmarks and the committed errors. Un-
surprisingly, almost all the selected images present
severe occlusions (1,4,9), or extreme head poses
(3,6,7). In addition, some useful considerations can
be made:
1. Although the the committed absolute error for
7
the seventh image is smaller than the one com-
mitted for the eighth image, the bias due to the
inter-ocular distance makes the average detec-
tion error for the seventh image bigger.
2. The fifth and the eighth faces are not well cen-
tered, increasing the difficulty of the task. Prob-
ably a better face detection algorithm may help
to improve the predictive performance.
3. The third, the sixth and the seventh images do
not come as a surprise: sideways images are un-
derrepresented in the training set and that ex-
plains the poor performance. A possible solution
would be to generate more synthetic sideways
images during the data augmentation phase, in
order to build a more balanced training set.
4. The mouth landmarks are often wrong (first,
second and ninth image). This suggests that
probably more training example presenting
mouth occlusions are needed.
7 Results
7.1 Evaluation on the BioID and the
LFPW test sets
The proposed solution was tested on both the test
sets built during the preprocessing phase. In par-
ticular, it was evaluated on every single image from
the two test sets independently, collecting all the
Detection Errors. The results obtained, along the
ones presented in [52], are shown in figure 7a and
7b.
The proposed solution outperformed all the
models the benchmark solution had been com-
pared with, on both the test sets and for all the
five landmarks considered. In addition, its perfor-
mance is very close, although worse, to the one
obtained by the benchmark model. The superi-
ority of the benchmark solution is not surprising,
given the huge difference in complexity of the two
models. However the difference between the two
models is not so big: at most 0.76 (in percentage).
Probably, with more time available, an even better
solution could be found.
A deeper analysis
In order to understand the statistical significance
of the difference between the proposed solution and
the benchmark model, the benchmark model was
also evaluated on every single entry from the two
test sets independently, obtaining the correspond-
ing sets of Detection Errors. Then, for each land-
mark and for each test-set a one-tailed, paired t-
test, defined as follow, was performed:
H0:MDEbenchmark MDEf inal = 0
H1:MDEbenchmark MDEf inal <0
The results obtained, shown in table 3, clearly in-
dicate that the difference is statistically significant.
7.2 Evaluation on the LFPW1test set
Following the same approach as in [52], the pro-
posed solution was also compared, on the single
LF P W1test set only, with two more recent mod-
els: (1) one, proposed by Cao et al. [9], was pub-
lished in 2012 and employs a boosted ensemble of
regressor to model the face shapes, (2) the other,
proposed by Belhumeur et al. [6], was published
in 2011 and uses a consensus of many models in
order to produce its predictions. Thus, both mod-
els are way more complex than the proposed solu-
tion. The results of this comparison are shown in
figure 8. The proposed solution, again, obtained
significant results: (1) first, its performance was
very close to the one obtained by the other mod-
els (the gap with the benchmark model almost re-
mained unchanged) and, (2) second, it was also
able to outperform the two additional models on
some landmarks (the model by Belhumeur et al. [6]
on the nose, mouth left and right corners and the
model by Cao et al. [9] on the nose and mouth
right corner).
8 Conclusions
In this work a novel Convolutional Neural Net-
work model was presented to solve a specific in-
stance of the Facial Landmark Detection Problem,
the one obtained by considering only five specific
landmarks. The aim was to evaluate the impact
of some of the research advances on the everyday
development and their usefulness to practitioners.
In particular, the aim was to understand how a
model, built by a practitioner who is not a do-
main expert, using an average computer, but em-
ploying some of the latest research advancements
compares with other, older, state-of-the-art mod-
els presented in the last few years. The proposed
solution is a single, deep Convolutional Neural Net-
work with a conceptually simple architecture com-
posed by a small number of building boxes re-
peated multiple times, and employing some of the
most recent techniques: inception modules, resid-
ual connections, batch normalization, dropout and
ELUs. To make such evaluation meaningful, the
proposed solution was compared with the work,
published in 2013, by Sun et al. [52]. In that pa-
per a very specialized, complex model (referred as
the benchmark solution in this paper), composed
by twenty-three CNNs arranged in a three-layers
cascade architecture is presented.
The obtained results clearly shows the positive
impact of the employed techniques: it is possible,
without being a domain expert and with an aver-
age computer, to quickly develop a solution able
to obtain performance comparable to the one ob-
tained by state-of-the-art models built in the last
few years. In particular, the proposed solution is
conceptually simple, a lot more than all the mod-
els considered in the comparison, it has proven to
be relatively cheap to train and it obtained com-
petitive computational performance (17ms per im-
8
Figure 5: How the performance of the proposed solution changed when the pipelines were applied.
Figure 6: The ten test immages with the largest ADE. The arrows indicate how to correct the predictions.
9
(a) (b)
Figure 7: Comparison of the proposed solution with the benchmark model [52], the work by Liang et al. [36], the
work by Valstar et al. [58], the Luxand Face SDK [1] and the Microsoft Research Face SDK [2].
LFPW BioID
Landmark Value P-value Value P-value
LE 3.20 7.5E46.035 2.03E9
RE 4.66 2.00E617.67 2.95E63
NO 7.23 7.35E13 12.31 2.23E33
LM 4.92 5.56E714.06 2.65E42
RM 6.21 4.98E10 5.28 7.6E7
Table 3: Results of the t-test performed to compare the benchmark with the proposed solution.
Figure 8: Comparison of the proposed solution with the benchmark model [52] and with the two addition models
considered [6,9]. The numbers for these two additional models are missing because their results were not available:
they were empirically inferred from the original papers and from [52]. Thus they should be taken with care.
10
age to make its predictions). As regards the pre-
dictive performance, it completely outperformed
many state-of-the-art models and commercial soft-
wares presented in the last years [1, 2, 36, 58], it did
better than the two other more recent models on
some landmarks [6,9] and its performance was also
relatively close to the one obtained by the bench-
mark model [52]. The superiority of the bench-
mark solution is not surprising, given its more com-
plex architecture, and thus it is a reasonable trade-
off.
9 Future work
I think this paper clearly reveals the importance
of evaluating (and maybe comparing) the useful-
ness of the research advances to practitioners and
their impact on the everyday development, from
various points of view. This is surely an interest-
ing and overlooked research area that should be
explored further.
As regards the proposed solution, it would be
interesting to tune it in order to improve its pre-
dictive performance. For example, it would be
nice (1) to incorporate, somehow, in the training
loss, the inter-ocular distance, in order to make the
model aware of the bias it creates, and (2) to fix
the the bias of the evaluation performed on the val-
idation set, maybe by shuffling the validation set
before the use.
In addition, other research advances are worth
considering, for example the impact of Adversal
Training [23] on the robustness of the model.
References
[1] http://www.luxand.com/facesdk/.
[2] http://research.microsoft.com/en-us/projects/facesdk/. Link currently not working.
[3] G. Alain and Y. Bengio. Understanding intermediate layers using linear classifier probes. ArXiv
e-prints, Oct. 2016.
[4] A. Albiol, D. Monzo, A. Martin, J. Sastre, and A. Albiol. Face recognition using hog–ebgm. Pattern
Recognition Letters, 29(10):1537–1543, 2008.
[5] B. Amberg and T. Vetter. Optimal landmark detection using shape models and branch and bound.
In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 455–462. IEEE, 2011.
[6] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, et al. Localizing parts of faces using a consensus
of exemplars. In Proc. IEEE Conf. Computer Vision and Pattern Recognition. Citeseer, 2011.
[7] M. C. Burl, T. K. Leung, and P. Perona. Face localization via shape statistics. In Internatational
Workshop on Automatic Face and Gesture Recognition, pages 154–159. University of Zurich, 1995.
[8] P. Campadelli, R. Lanzarotti, and G. Lipori. Automatic facial feature extraction for face recognition.
In Face recognition. InTech, 2007.
[9] X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shape regression. In 2012 IEEE
Conference on Computer Vision and Pattern Recognition, pages 2887–2894, June 2012.
[10] O. C¸ eliktutan, S. Ulukaya, and B. Sankur. A comparative study of face landmarking techniques.
EURASIP Journal on Image and Video Processing, 2013(1):13, 2013.
[11] J. J. Cerrolaza, A. R. Porras, A. Mansoor, Q. Zhao, M. Summar, and M. G. Linguraru. Identification
of dysmorphic syndromes using landmark-specific local texture descriptors. In Biomedical Imaging
(ISBI), 2016 IEEE 13th International Symposium on, pages 1080–1083. IEEE, 2016.
[12] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade face detection and alignment. In
European Conference on Computer Vision, pages 109–122. Springer, 2014.
[13] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential
linear units (elus). CoRR, abs/1511.07289, 2015.
[14] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEE Transactions on
pattern analysis and machine intelligence, 23(6):681–685, 2001.
[15] T. F. Cootes, M. C. Ionita, C. Lindner, and P. Sauer. Robust and accurate shape model fitting
using random forest regression voting. In European Conference on Computer Vision, pages 278–291.
Springer, 2012.
11
[16] D. Cristinacce, T. F. Cootes, and I. M. Scott. A multi-stage approach to facial feature detection. In
BMVC, volume 1, pages 277–286, 2004.
[17] D. Datcu and L. Rothkrantz. Machine learning techniques for face analysis. In Proceedings of
EUROMEDIA Conference, 2005.
[18] O. D´eniz, G. Bueno, J. Salido, and F. De la Torre. Face recognition using histograms of oriented
gradients. Pattern Recognition Letters, 32(12):1598–1603, 2011.
[19] P. Dollar. Learning to segment. https://research.fb.com/learning-to-segment/, 2016. Ac-
cessed 07-23-2017.
[20] R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
[21] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object
detection and semantic segmentation. CoRR, abs/1311.2524, 2013.
[22] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks.
In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics,
pages 249–256, 2010.
[23] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and
Y. Bengio. Generative Adversarial Networks. ArXiv e-prints, June 2014.
[24] N. Gourier, D. Hall, and J. L. Crowley. Facial features detection robust to pose, illumination and
identity. In Systems, Man and Cybernetics, 2004 IEEE International Conference on, volume 1, pages
617–622. IEEE, 2004.
[25] K. He, G. Gkioxari, P. Doll´ar, and R. B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017.
[26] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR,
abs/1512.03385, 2015.
[27] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level perfor-
mance on imagenet classification. CoRR, abs/1502.01852, 2015.
[28] E.-J. Holden and R. Owens. Automatic facial point detection. In Proc. Asian Conf. Computer
Vision, volume 2, page 2, 2002.
[29] Q. Hou, J. Wang, L. Cheng, and Y. Gong. Facial landmark detection via cascade multi-channel
convolutional neural network. In Image Processing (ICIP), 2015 IEEE International Conference on,
pages 1800–1804. IEEE, 2015.
[30] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal
covariate shift. CoRR, abs/1502.03167, 2015.
[31] B.-K. Kim, H. Lee, J. Roh, and S.-Y. Lee. Hierarchical committee of deep cnns with exponentially-
weighted decision fusion for static facial expression recognition. In Proceedings of the 2015 ACM on
International Conference on Multimodal Interaction, ICMI ’15, pages 427–434, New York, NY, USA,
2015. ACM.
[32] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
[33] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. M¨uller. Efficient backprop. In Neural networks: Tricks
of the trade, pages 9–48. Springer, 2012.
[34] J. Lemley, S. Abdul-Wahid, D. Banik, and R. Andonie. Comparison of recent machine learning
techniques for gender recognition from facial images. 2016.
[35] G. Levi and T. Hassner. Emotion recognition in the wild via convolutional neural networks and
mapped binary patterns. In Proceedings of the 2015 ACM on International Conference on Multimodal
Interaction, pages 503–510. ACM, 2015.
[36] L. Liang, R. Xiao, F. Wen, and J. Sun. Face alignment via component-based discriminative search.
Computer Vision–ECCV 2008, pages 72–85, 2008.
12
[37] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot
multibox detector. CoRR, abs/1512.02325, 2015.
[38] C. Lu and X. Tang. Surpassing human-level face verification performance on lfw with gaussianface.
In AAAI, pages 3811–3819, 2015.
[39] P. Luo, X. Wang, and X. Tang. A deep sum-product architecture for robust facial attributes analysis.
In Proceedings of the IEEE International Conference on Computer Vision, pages 2864–2871, 2013.
[40] D. Mishkin, N. Sergievskiy, and J. Matas. Systematic evaluation of convolution neural network
advances on the imagenet. Computer Vision and Image Understanding, 2017.
[41] R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework for face
detection, landmark localization, pose estimation, and gender recognition. CoRR, abs/1603.01249,
2016.
[42] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time
object detection. CoRR, abs/1506.02640, 2015.
[43] J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. CoRR, abs/1612.08242, 2016.
[44] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 fps via regressing local binary features. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1685–1692,
2014.
[45] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with
region proposal networks. CoRR, abs/1506.01497, 2015.
[46] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and
clustering. CoRR, abs/1503.03832, 2015.
[47] A. Shah, E. Kadam, H. Shah, and S. Shinde. Deep residual networks with exponential linear unit.
CoRR, abs/1604.04112, 2016.
[48] L. N. Smith. Best practices for applying deep learning to novel applications. arXiv preprint
arXiv:1704.01568, 2017.
[49] L. N. Smith and N. Topin. Deep convolutional neural network design patterns. CoRR,
abs/1611.00847, 2016.
[50] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple
way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–
1958, 2014.
[51] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-
verification. In Advances in neural information processing systems, pages 1988–1996, 2014.
[52] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3476–3483,
2013.
[53] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual
connections on learning. CoRR, abs/1602.07261, 2016.
[54] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
[55] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture
for computer vision. CoRR, abs/1512.00567, 2015.
[56] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level perfor-
mance in face verification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 1701–1708, 2014.
13
[57] U. Tariq, K.-H. Lin, Z. Li, X. Zhou, Z. Wang, V. Le, T. S. Huang, X. Lv, and T. X. Han. Emotion
recognition from an ensemble of features. In Automatic Face & Gesture Recognition and Workshops
(FG 2011), 2011 IEEE International Conference on, pages 872–877. IEEE, 2011.
[58] M. Valstar, B. Martinez, X. Binefa, and M. Pantic. Facial point detection using boosted regression
and graph models. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference
on, pages 2729–2736. IEEE, 2010.
[59] D. Vukadinovic and M. Pantic. Fully automatic facial feature point detection using gabor feature
based boosted classifiers. In Systems, Man and Cybernetics, 2005 IEEE International Conference
on, volume 2, pages 1692–1698. IEEE, 2005.
[60] N. Wang, X. Gao, D. Tao, H. Yang, and X. Li. Facial feature point detection: A comprehensive
survey. Neurocomputing, 2017.
[61] X. Yu, J. Huang, S. Zhang, W. Yan, and D. N. Metaxas. Pose-free facial landmark fitting via opti-
mized part mixtures and cascaded deformable shape model. In Proceedings of the IEEE International
Conference on Computer Vision, pages 1944–1951, 2013.
[62] Z. Yu and C. Zhang. Image based static facial expression recognition with multiple deep network
learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction,
ICMI ’15, pages 435–442, New York, NY, USA, 2015. ACM.
[63] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning.
In European Conference on Computer Vision, pages 94–108. Springer, 2014.
[64] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Learning deep representation for face alignment with
auxiliary attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(5):918–
930, May 2016.
[65] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild.
In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2879–2886.
IEEE, 2012.
[66] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identity-preserving face space. In Proceedings
of the IEEE International Conference on Computer Vision, pages 113–120, 2013.
[67] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning multi-view representation for face recognition.
arXiv preprint arXiv:1406.6947, 2014.
14
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs), ELUs alleviate the vanishing gradient problem via the identity for positive values. However, ELUs have improved learning characteristics compared to the units with other activation functions. In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero like batch normalization but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While LReLUs and PReLUs have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information. Therefore, ELUs code the degree of presence of particular phenomena in the input, while they do not quantitatively model the degree of their absence. In experiments, ELUs lead not only to faster learning, but also to significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers. On CIFAR-100 ELUs networks significantly outperform ReLU networks with batch normalization while batch normalization does not improve ELU networks. ELU networks are among the top 10 reported CIFAR-10 results and yield the best published result on CIFAR-100, without resorting to multi-view evaluation or model averaging. On ImageNet, ELU networks considerably speed up learning compared to a ReLU network with the same architecture, obtaining less than 10% classification error for a single crop, single model network.
Article
Full-text available
The paper systematically studies the impact of a range of recent advances in convolution neural network (CNN) architectures and learning methods on the object categorization (ILSVRC) problem. The evalution tests the influence of the following choices of the architecture: non-linearity (ReLU, ELU, maxout, compatability with batch normalization), pooling variants (stochastic, max, average, mixed), network width, classifier design (convolutional, fully-connected, SPP), image pre-processing, and of learning parameters: learning rate, batch size, cleanliness of the data, etc.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
Recent research in the deep learning field has produced a plethora of new architectures. At the same time, a growing number of groups are applying deep learning to new applications and problems. Many of these groups might be composed of inexperienced deep learning practitioners who are baffled by the dizzying array of architecture choices and therefore use an older architecture, such as Alexnet. Here, we are attempting to bridge this gap by mining the collective knowledge contained in recent deep learning research to discover underlying principles for designing neural network architectures. In addition, we describe several architectural innovations, including Fractal of FractalNet, Stagewise Boosting Networks, and Taylor Series Networks (our Caffe code and prototxt files will be made publicly available). We hope others are inspired to build on this preliminary work.
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Conference Paper
Neural network models have a reputation for being black boxes. We propose a new method to understand better the roles and dynamics of the intermediate layers. This has direct consequences on the design of such models and it enables the expert to be able to justify certain heuristics (such as the auxiliary heads in the Inception model). Our method uses linear classifiers, referred to as "probes", where a probe can only use the hidden units of a given intermediate layer as discriminating features. Moreover, these probes cannot affect the training phase of a model, and they are generally added after training. They allow the user to visualize the state of the model at multiple steps of training. We demonstrate how this can be used to develop a better intuition about a known model and to diagnose potential problems.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
This report is targeted to groups who are subject matter experts in their application but deep learning novices. It contains practical advice for those interested in testing the use of deep neural networks on applications that are novel for deep learning. We suggest making your project more manageable by dividing it into phases. For each phase this report contains numerous recommendations and insights to assist novice practitioners.
Conference Paper
We present a novel method for classifying emotions from static facial images. Our approach leverages on the recent success of Convolutional Neural Networks (CNN) on face recognition problems. Unlike the settings often assumed there, far less labeled data is typically available for training emotion classification systems. Our method is therefore designed with the goal of simplifying the problem domain by removing confounding factors from the input images, with an emphasis on image illumination variations. This, in an effort to reduce the amount of data required to effectively train deep CNN models. To this end, we propose novel transformations of image intensities to 3D spaces, designed to be invariant to monotonic photometric transformations. These are applied to CASIA Webface images which are then used to train an ensemble of multiple architecture CNNs on multiple representations. Each model is then fine-tuned with limited emotion labeled training data to obtain final classification models. Our method was tested on the Emotion Recognition in the Wild Challenge (EmotiW 2015), Static Facial Expression Recognition sub-challenge (SFEW) and shown to provide a substantial, 15.36% improvement over baseline results (40% gain in performance).
Conference Paper
We report our image based static facial expression recognition method for the Emotion Recognition in the Wild Challenge (EmotiW) 2015. We focus on the sub-challenge of the SFEW 2.0 dataset, where one seeks to automatically classify a set of static images into 7 basic emotions. The proposed method contains a face detection module based on the ensemble of three state-of-the-art face detectors, followed by a classification module with the ensemble of multiple deep convolutional neural networks (CNN). Each CNN model is initialized randomly and pre-trained on a larger dataset provided by the Facial Expression Recognition (FER) Challenge 2013. The pre-trained models are then fine-tuned on the training set of SFEW 2.0. To combine multiple CNN models, we present two schemes for learning the ensemble weights of the network responses: by minimizing the log likelihood loss, and by minimizing the hinge loss. Our proposed method generates state-of-the-art result on the FER dataset. It also achieves 55.96% and 61.29% respectively on the validation and test set of SFEW 2.0, surpassing the challenge baseline of 35.96% and 39.13% with significant gains.