Efficient 3D Morphable Face Model Fitting

Article (PDF Available)inPattern Recognition 67:366-379 · February 2017with 990 Reads
DOI: 10.1016/j.patcog.2017.02.007
Abstract
3D face reconstruction of shape and skin texture from a single 2D image can be performed using a 3D Morphable Model (3DMM) in an analysis-by-synthesis approach. However, performing this reconstruction (fitting) efficiently and accurately in a general imaging scenario is a challenge. Such a scenario would involve a perspective camera to describe the geometric projection from 3D to 2D, and the Phong model to characterise illumination. Under these imaging assumptions the reconstruction problem is nonlinear and, consequently, computationally very demanding. In this work, we present an efficient stepwise 3DMM-to-2D image-fitting procedure, which sequentially optimises the pose, shape, light direction, light strength and skin texture parameters in separate steps. By linearising each step of the fitting process we derive closed-form solutions for the recovery of the respective parameters, leading to efficient fitting. The proposed optimisation process involves all the pixels of the input image, rather than randomly selected subsets, which enhances the accuracy of the fitting. It is referred to as Efficient Stepwise Optimisation (ESO).
Efficient 3D Morphable Face Model Fitting
Guosheng Hua,1, Fei Yana, Josef Kittlera, William Christmasa,, Chi Ho Chana,
Zhenhua Fenga, Patrik Hubera
aCentre for Vision, Speech and Signal Processing, University of Surrey, Guildford, GU2 7XH, UK
Abstract
3D face reconstruction of shape and skin texture from a single 2D image can be per-
formed using a 3D Morphable Model (3DMM) in an analysis-by-synthesis approach.
However, performing this reconstruction (fitting) efficiently and accurately in a general
imaging scenario is a challenge. Such a scenario would involve a perspective camera to
describe the geometric projection from 3D to 2D, and the Phong model to characterise
illumination. Under these imaging assumptions the reconstruction problem is nonlin-
ear and, consequently, computationally very demanding. In this work, we present an
efficient stepwise 3DMM-to-2D image-fitting procedure, which sequentially optimises
the pose, shape, light direction, light strength and skin texture parameters in separate
steps. By linearising each step of the fitting process we derive closed-form solutions
for the recovery of the respective parameters, leading to efficient fitting. The proposed
optimisation process involves all the pixels of the input image, rather than randomly
selected subsets, which enhances the accuracy of the fitting. It is referred to as Efficient
Stepwise Optimisation (ESO).
The proposed fitting strategy is evaluated using reconstruction error as a perfor-
mance measure. In addition, we demonstrate its merits in the context of a 3D-assisted
2D face recognition system which detects landmarks automatically and extracts both
holistic and local features using a 3DMM. This contrasts with most other methods
which only report results that use manual face landmarking to initialise the fitting.
Our method is tested on the public CMU-PIE and Multi-PIE face databases, as well
Corresponding author: William Christmas
Email address: w.christmas@surrey.ac.uk (William Christmas)
1Current address: AnyVision, Queen’s Road, Belfast, BT3 9DT, UK
Preprint submitted to Elsevier December 13, 2016
as one internal database. The experimental results show that the face reconstruction
using ESO is significantly faster, and its accuracy is at least as good as that achieved
by the existing 3DMM fitting algorithms. A face recognition system integrating ESO
to provide a pose and illumination invariant solution compares favourably with other
state-of-the-art methods. In particular, it outperforms deep learning methods when
tested on the Multi-PIE database.
Keywords: face recognition; face reconstruction; 3D Morphable Model
1. Introduction
The intrinsic properties of 3D faces give scope for a representation that is immune
to the kinds of variations in face appearance that are introduced by the imaging pro-
cess such as viewpoint, lighting and occlusion. These invariant facial properties are
potentially useful in a wide variety of applications in computer graphics and vision.
However, recovering the 3D face and scene properties (viewpoint and illumination)
from the appearance conveyed by a single 2D image is very challenging. Specifically,
as noted in [1], it is impossible to distinguish between texture and illumination effects
unless some assumptions are made to constrain them both. The 3D morphable face
model (3DMM) [2] encapsulates prior knowledge about human faces that can be used
for this purpose, and therefore potentially it is a good tool for 3D face reconstruction.
The 3DMM is a concise statistical model of a 3D face population created from 3D
face data using principal component analysis (PCA). The model separately represents
the face shape and surface texture. PCA removes data correlation and identifies a small
number of latent variables which represent each face instance very efficiently. See also
[3, 4] for other developments of generative models applicable to 3D graph structures.
The reconstruction of a 3D face is conducted by a 3DMM fitting process, which
estimates the 3D shape, texture, pose and illumination from a single 2D input image.
Considerable research has been carried out to achieve efficient and accurate fitting. The
methods advocated in the literature can be classified into two categories:
1. Simultaneous Optimisation (SimOpt): All the parameters (shape, texture, pose
and illumination) are optimised simultaneously [2, 5, 6, 7];
2
2. Sequential Optimisation (SeqOpt): These parameters are optimised sequentially
[8, 9, 10].
The SimOpt algorithms use gradient-based methods which are often slow and can easily
get trapped in local minima. On the other hand, SeqOpt methods can have closed-form
solutions for some or all of the parameters, and accordingly have the potential to be
much more efficient computationally. However, the existing SeqOpt methods [8, 9,
10] make strong assumptions about the imaging camera and consequently they do not
generalise well to faces distorted by perspective effects.
In this work we introduce a novel SeqOpt fitting framework, referred to as Effi-
cient Stepwise Optimisation (ESO), which overcomes these problems and is an order
of magnitude faster than existing methods. This framework groups the parameters to be
optimised into 5 categories: camera model (pose), shape, light direction, light strength
and albedo (skin texture). The fitting is decomposed into two separate processes: geo-
metric fitting and photometric fitting.
Geometric Model Fitting Existing fast pose and shape fitting methods assume an
affine camera model [9, 10] which is adequate provided the object’s depth is small
compared with its distance from the camera. A rule of thumb is that the object should
be at least 10 times further from the camera than its depth. However, this is often not
the case, e.g. when using a laptop camera for video conferencing, and for authentica-
tion, or when a camera is mounted on a vehicle windscreen for driver authentication,
or monitoring the driver for tiredness. In such applications it is essential to relax the
assumption and adopt a more general perspective camera model, which renders the re-
construction problem nonlinear, and consequently computationally expensive. In order
to address this conundrum, we propose a novel approach to the shape fitting problem
by formulating the fitting cost function in 3D, rather than the usual 2D. This formu-
lation admits linearisation of the optimisation task which significantly enhances the
computational efficiency.
As in [7] the occluding face contour is used to improve the shape fitting accuracy.
In order to mitigate the additional processing costs, we propose to use landmarks on
the occluding face contour (see Section 4.3) instead of face contour edges to refine
3
the camera and shape estimates. To this end, we develop a method that automatically
establishes the correspondence between the occluding contour landmarks of the input
image and vertices of the 3D face model.
Photometric Model Fitting Both Phong [2, 7] and Spherical Harmonics models
[9, 10] have been used in the past to estimate the illumination parameters. However,
in order to model adequately both diffuse light and specularity, the latter method re-
quired many bases (81 in total) of spherical harmonics. Compared with the Spherical
Harmonics approach, the Phong model has a more compact representation (elaborated
further in Section 3), and is therefore used here. We found that it was adequate to
model the illumination as a combination of a single distant point source plus uniform
ambient light, thus keeping the number of coefficients to be found to a minimum.
To accelerate the light model fitting and skin texture parameter estimation, we
present a novel approach to optimise both Phong model parameters and albedo. Specif-
ically, we propose techniques (Section 4) to linearise the Phong model and the subse-
quent albedo estimation. Because the objective functions of these linear methods are
convex, globally optimal solutions are guaranteed.
The measures to accelerate the illumination and texture reconstruction proposed in
the paper speed up the fitting process by a factor of ten or more. We evaluate the fitting
accuracy and show that it is superior to that achieved by the current alternatives. The
impressive performance is the consequence of the ESO fitting process involving all the
model vertices simultaneously, rather than just a randomly sampled subset.
We also evaluate the ESO fitting algorithm as part of a fully automatic pose- and
illumination-invariant face recognition system. Its performance is at least compara-
ble to the best performing state of the competitors, including solutions based on deep
learning methods [11, 12] when evaluated on the Multi-PIE dataset.
The paper is organised as follows. In the next section we present a brief sum-
mary of the related work. The fitting problem is formulated in Section 3 to establish
a methodological baseline. Our fast fitting algorithm ESO is developed in Section 4.
The proposed algorithm is evaluated in Section 5 in terms of its reconstruction perfor-
mance, as well as when embedded in a face recognition system. Section 6 draws the
4
input
image
input
image
3DMM
3DMM
fitting
fitting
shape parameters
shape parameters
texture parameters
texture parameters
camera parameters
camera parameters
lighting parameters
lighting parameters
Applications:
face synthesis
face recognition
...
Applications:
face synthesis
face recognition
...
Figure 1: 3D morphable model fitting pipeline including the inputs and outputs of a fitting, and the applica-
tions of the fitting outputs.
paper to a conclusion.
2. Related Work on 3D Morphable Model Fitting
The 3DMM, first proposed by Blanz and Vetter [2], has successfully been applied
to computer vision and graphics. A 3DMM consists of separate face shape and texture
models learned from a set of 3D exemplar faces. These faces are represented as a
graph, in which the node attributes are the 3D position and RGB colour at that node,
and the edges indicate geometric connectivity. Related work (e.g. [13]) considers more
complex image data. By virtue of a fitting process, a 3DMM can recover the face (shape
and texture) and scene properties (illumination and camera model) from a single 2D
image in a process schematically summarised in Fig. 1. The recovered parameters can
be used for different applications, such as realistic face synthesis and face recognition.
However, it is well known that achieving accurate fitting is particularly difficult for
two reasons. Firstly, when recovering the 3D shape from a single 2D image, the 3D
shape is generally projected to 2D in order to compare it with the 2D image features.
As a result, the depth information of the 3D shape is lost. Secondly, separating the
contributions of albedo and illumination is an ill-posed problem [14, 15]. Motivated
by the above challenges, considerable research [2, 6, 7, 8, 9, 10] has been carried out
to improve the fitting performance in terms of efficiency and accuracy. As mentioned
5
in Section 1, these methods can be classified into two groups: SimOpt and SeqOpt.
In the SimOpt category, the fitting algorithm in [2, 5] minimises the sum of squared
differences over all colour channels and all pixels between the input and reconstructed
images. A Stochastic Newton Optimisation (SNO) technique is used to optimise a non-
convex cost function. Performance of this technique is poor in terms of both efficiency
and accuracy because it is an iterative gradient-based optimiser which may end up in a
local minimum.
The efficiency of optimisation is the driver behind the work of [6] where an Inverse
Compositional Image Alignment algorithm [6] is introduced for fitting. The fitting is
conducted by modifying the cost function so that its Jacobian matrix can be regarded
as constant. In this way, the Jacobian matrix is precomputed, which greatly reduces the
computational costs. However, this method cannot model illumination effects.
The Multi-Feature Fitting (MFF) strategy [7] is known to achieve the best fitting
performance of the SimOpt methods. It makes use of many complementary features
from an input image, such as edges and specularity highlights, to constrain the fit-
ting process. The advantages of using these features are demonstrated in [7]. Further
improvements to the MFF framework have been achieved by enhancing the fitting ro-
bustness to varying image resolution with a resolution-aware 3DMM [16], and by de-
ploying a facial symmetry prior in [15] to ameliorate the quality of illumination fitting.
However, all the MFF-based fitting methods are rather slow.
In the SeqOpt category, the ‘linear shape and texture fitting algorithm’ (LiST) [8]
was proposed for improving fitting efficiency. The idea is to update the shape and tex-
ture parameters by solving linear systems. However, the illumination and camera pa-
rameters are optimised by the gradient-based Levenberg-Marquardt method, exhibiting
many local minima. The experiments reported in [8] show that the fitting is of similar
accuracy to the SNO algorithm, but much faster, in spite of the shape being recovered
using a relatively slow optical flow algorithm. The drawback of this approach is the
prerequisite that the light direction is known before fitting, which is not realistic for
automatic analysis.
Another SeqOpt method [9] decomposes the fitting process into geometric and pho-
tometric parts. The camera model is optimised by the Levenberg-Marquardt method,
6
and shape parameters are estimated by a closed-form solution. In contrast to the pre-
vious work, this method recovers 3D shape using only facial feature landmarks, and
models illumination using spherical harmonics. Illumination and albedo are deter-
mined using least squares optimisation. The work in [17] improved the fitting perfor-
mance of [9] by segmenting the 3D face model into different subregions. In addition,
a Markov Random Field is used in [17] to model the spatial coherence of the face tex-
ture. However, the illumination models of [9, 17] cannot deal with specular reflectance
because only 9 low-frequency spherical harmonics bases are used. In addition, [9, 17]
use an affine camera model, which cannot model perspective effects.
In common with [9], two more recent SeqOpt methods [10, 18] also sequentially fit
geometric and photometric models using least squares. Both methods use only facial
landmarks to estimate pose and facial shape via an affine camera. They also share the
use of spherical harmonics models to estimate illumination. The authors in [18] use 9
spherical harmonics bases, which cannot model specularity. The method in [10] can
model specularity by projecting the RGB values of the model and input images to a
specularity-free space for diffuse light and texture estimation. The specularity is then
estimated in the original RGB colour space. In common with [9], both methods [10,
18] use an affine camera, which cannot model perspective effects. In addition, the
colour of lighting in [10] is assumed to be known, which limits the applicability of the
method.
Some works only focus on shape fitting [19, 20, 21]. In [20], around 100 facial
landmarks are used to recover the facial shape employing the Levenberg-Marquardt
algorithm as the optimiser. In contrast to [20], [19] uses local image features rather
than facial landmarks as these features are more robust.
3. 3D Morphable face model and face image rendering
A 3D face model is a representation of the surface of a class of objects — the objects
in our case being faces. Each face consists of a set of vertices whose positions in 3D
space collectively express the face shape. The vertices also each have an RGB pixel
value, that collectively express the face skin texture (albedo). The model describes both
7
the shape of a face and its appearance, determined by the surface texture. It is defined
by a mesh of vertices V={vi|i= 1, ...., n}, sampling the face surface at a predefined
set of facial points of semantic identity (eye corners, nose tip, etc). The ith vertex vi
of a face is located at wi= (xi, yi, zi)T, and has the RGB colour values (ri, gi, bi).
Hence a 3D face is represented in terms of shape and texture as a pair of vectors:
s= (x1, y1, z1, ......, xn, yn, zn)T,t= (r1, g1, b1, ......, rn, gn, bn)T(1)
Even for twins, faces are unique. Each individual will have a particular face shape
and skin characteristics. The variability of face shape and skin texture in a population of
individuals is captured by a statistical 3D face model, defined by a probability distribu-
tion in the sand tspace. Since many vertex shape and texture measurements are highly
correlated, a population of 3D faces inevitably lies in a subspace of the sand tspace,
typically determined by the Principal Component Analysis (PCA) or other sparse rep-
resentation methods. Focusing on the former, let SR3n×rsand TR3n×rtdenote
the PCA bases of the rsshape and rttexture variations respectively. A face instance
(s,t)can concisely be expressed as
s=s0+Sα,t=t0+Tβ(2)
where s0and t0are the mean face shape and texture respectively. The parameters α
and βare assumed to have normal distributions:
p(α) N (0,σs)(3)
p(β) N (0,σt)(4)
where σsand σtare the vectors of variances of the latent model shape and texture
parameters.
As the bases and the mean vectors are fixed for a particular population, all statisti-
cal information is conveyed by the parameter vectors αand β, the dimensionality of
which is considerably lower than that of the original face space. Each pair of model pa-
rameter vectors αand βdefines an instance of a 3D face. This provides a very concise
representation of the face, which is convenient from the point of view of face synthe-
sis. By changing the shape and texture parameters we can generate different faces. A
8
transition from one pair of parameter vectors to another pair will morph one face to
another in a smooth manner. This morphing capability of the statistical 3D face model
has given it its name as the 3D Morphable Face Model (3DMM).
A 3DMM can be used for many purposes in face analysis. For instance, the model
can be fitted to an input 2D face image, and the estimated shape and texture parameters
of the reconstructed 3D face used for face recognition in the face model parameter
space. Alternatively, given the pose of an input 2D face image, we can fit the 3DMM
to a gallery face image and use the fitted 3D face to synthesise a new pose of the
subject. For instance this could be a pose identical to the given pose in order to perform
matching. Another possibility is to fit the 3DMM to an input 2D image of arbitrary
pose, and then frontalise the query image with the help of the estimated 3D face shape.
The operative phrase in all these use cases is 3DMM fitting. It is the crucial prerequisite
enabling all these applications.
The underlying principle of fitting a 3DMM to an input 2D face image is to identify
the shape and texture parameters of the face model that would enable the synthesis of a
2D model image deemed indistinguishable from the input query image. However, the
rendering process is quite complex. It involves not only the selection of model shape
and texture parameters to produce a 3D face instance, but also its transformation to a
new pose, and the subsequent projection of the 3D face to 2D under a particular scene
illumination. The assessment of similarity of the synthesised image to the input image
also traditionally involves sampling the input image at 2D points corresponding to the
projection of the 3D mesh vertices onto the 2D input image.
Let us now describe the rendering process, the underlying physics of which is cap-
tured in Fig. 2 in more detail. We shall render a 2D view of a face instance showing
a particular pose by rotating and translating the camera with respect to the face model
coordinate system by a (3 ×3) rotation matrix Rand a 3D translation vector τrespec-
tively. In the camera coordinate system the transformed shape s0can be expressed in
matrix form as
s0=Us +ˇ
τ(5)
where Uis the block diagonal matrix with ncopies of the rotation matrix Ron its
9
diagonal, and ˇ
τis a vector composed of ncopies of the displacement τ.
i
v
ni
i
r
wi
~
w
i
),o yx
(
Camera coordinates
o
Image plane
τ
R
dHead mesh
Light source
Figure 2: Physics of rendering: At image pixel position e
wi, the RGB value output by the camera (small green
blob) measures the reflection, on the face surface point wi, of the light source illuminating the face surface
in direction d. The surface normal at wiis ni. In the head (purple blob) coordinate system, the camera is
located at position τand the viewing direction of the vertex wiis vi. The specular light is reflected from
wicentred on direction ri, where riis such that the surface normal nibisects riand the direction dof the
incident light.
The pixel values at locations corresponding to swill depend on the albedo tand the
scene illumination. Different illumination models can be adopted for lighting the face
(e.g. [22]), but we will adopt the Phong model which can represent complex reflectance
phenomena, including specular reflectance, using a small number of parameters. The
appearance of the generated face at each point, represented by a 3n-dimensional vector
aM, is the product of the interplay of the face surface normal, skin albedo tand the in-
cident light, assumed to be the sum of contributions from ambient, diffuse and specular
lights:
aM=ˇ
lat
|{z}
ambient
+ (ˇ
ldt)(N3d)
| {z }
diffuse
+ˇ
lde
|{z}
specular
(6)
where the ambient light ˇ
lais a 3n-dimensional vector, composed of ncopies of the
10
ambient light intensity la= (lr
a, lg
a, lb
a)T:
ˇ
la= (lr
a, lg
a, lb
a, ........lr
a, lg
a, lb
a)TR3n(7)
Similarly, ˇ
ldis a 3n-dimensional vector, composed of ncopies of the directed light
strength ld= (lr
d, lg
d, lb
d)T:
ˇ
ld= (lr
d, lg
d, lb
d, ....., lr
d, lg
d, lb
d)TR3n(8)
The symbol denotes an element-wise multiplication operation. The matrix N3is a
stack of 3 copies of the matrices N:
N3=NT,NT,NTT(9)
where NRn×3is a stack of the surface normals niR3at vertices i= 1, ..., n
(see Fig. 2). Unit vector dR3is the light direction. Vector eR3nis a stack of the
specular reflectance eiof each vertex i= 1, ....., n (the components of which could be
different for the three channels), i.e.,
ei=kshvi,riiγ(10)
where viis the viewing direction of the ith vertex. Since in the face model coordinate
system the camera is at position τ, the viewing direction can be expressed as vi=
τwi
|τwi|where wi= (xi, yi, zi)Tis the vector of the 3D coordinates of that vertex.
Unit vector ridenotes the reflection direction of the light source at the ith vertex:
ri= 2hni,dinid. The two constants ksand γdenote the specular reflectance
and shininess respectively [23]. Note that ksand γare determined by the facial skin
reflectance property, which is similar for different people. They are assumed constant
over the whole facial region. For the sake of simplicity, in our work, we also assume
that ksand γare the same for the three colour channels. Thus each entry in eiis
repeated three times. In this work, the components of ksare each set to 0.175, and γis
set to 30, following [23].
In all cases it is important to check all vertices for visibility so that parts of the face
turned away from camera do not contribute to the rendered pixel values.
11
3.1. Fitting 3DMM to a 2D image
Let us consider an input face image Iacquired by a camera with focal length f
and the coordinates of its optical axis in the image plane o= (ox, oy)T. It is assumed
that the image is landmarked. Let ρdenote the set of extrinsic and intrinsic camera
parameters
ρ={R,τ, f, o}(11)
and let us lump together all illumination parameters as
µ={la,ld,d,ks, γ}(12)
Fitting the 3D face involves finding pose, shape, texture and illumination parameters
ρ, α,β, µ so that the image reconstructed from the model:
aM= (rM
1, gM
1, bM
1, ......., rM
n, gM
n, bM
n)T(13)
is as close as possible to the input image.
Typically, the quality of the reconstruction is measured in 2D. This involves pro-
jecting the mesh of 3D vertices into 2D. For each vertex the camera projects the triplet
of its 3D coordinates into 2Dpixel location in the camera image plane as
˜
s=Ps0+ˇ
o(14)
where PR2n×3nis a block diagonal matrix constructed from the projection matrices
Pi, i = 1, ...n:
Pi=
f
z0
i0 0
0f
z0
i0
(15)
Note that each Piis a function of the corresponding depth coordinate z0
i, as well as
the camera focal length f. The negative term in Piresults from an assumption of
a clockwise image coordinate system. The 2n-dimensional vector ˇ
ois a stack of n
copies of the 2Dposition oof the optical axis in the image plane. For faces at a
distance exceeding 10×the radius of a subject’s head we can use an affine projection
with Pi=Pj,i, jinstead, without incurring any significant approximation errors.
12
The 2D mesh of projected vertices, ˜
s, samples the input 2D image. Stacking the
RGB values of the corresponding samples into a vector aI
aI= (rI
1, gI
1, bI
1, ......., rI
n, gI
n, bI
n)T(16)
we can then compare the synthesised and input images by measuring the error ||aI
aM||. Noting that the samples picked from the input image by the mesh are a function
of ρand α, the objective of the fitting process is to solve the optimisation problem
min
α,β,ρ,µ kaI(ρ, α)aM(ρ, α,β, µ)k2+λ1kα÷σsk2+λ2kβ÷σtk2(17)
where the last two terms induce regularisation of the estimated parameters. The symbol
÷denotes element-wise division.
The problem formulated in (17) is very challenging because of its nonlinearity and
its ill-posed nature [14, 15]. The conventional approach to optimisation is to apply
the Newton Optimisation algorithm involving the sampling of random subsets of mesh
vertices to achieve computational feasibility [2, 5]. These challenges motivated the
developments reviewed in Section 2 but any speed-up therein is only achieved at the
expense of restricted applicability.
In the following section we propose a novel method of fitting 3DMM that is more
than an order of magnitude faster than the existing algorithms, without imposing any
restrictions on the camera model and lighting. The computational efficiency is achieved
by breaking the fitting problem up to create a sequence of optimisation tasks, most
of which are linearised to render closed form solutions. The proposed strategy has
the additional major benefit for illumination and albedo estimation of simultaneously
involving all the model vertices in optimisation. This avoids local optima and leads to
more accurate fitting.
4. Efficient Stepwise Optimisation (ESO)
This section describes our ESO framework. ESO is a SeqOpt method which groups
all the parameters into 5 categories: pose with camera parameters, shape, light direc-
tion, light strength and albedo. The parameters in each group are optimised under the
13
ESO
albedo
light
strength
light
direction
contour
landmarks
shapecamera
geometric refinement photometric refinement
Figure 3: The ESO fitting process topology. Each of the two main phases of the fitting process - geometric
and photometric - are iterated until convergence is achieved.
assumption that those in all the other groups are known, or have no impact on the opti-
misation process. The parameter grouping strategy aids the linearisation of the 3D face
model fitting process, but further group-specific linearisation measures are adopted, as
required. These are detailed in the respective sections.
The proposed method divides the fitting process into two phases, namely geometric
and photometric optimisation as shown in Fig. 3. The geometric phase aligns an input
image to a 3DMM, and the photometric phase recovers its reflectance. Each phase
consists of three stages that are iterated in turn a few times to refine the solution. A key
contribution of our approach is the proposed linearisation of all but one stage of the op-
timisation process that leads to closed-form solutions, and consequently computational
efficiency. In Sections 4.1 to 4.6, each step of ESO is explained in more detail.
4.1. Camera Parameter Estimation
The first step uses the input image facial landmarks to estimate the subject’s pose
and the camera parameters that roughly align the input image to the model. Let us
consider an identifiable point e
wI
i= (˜xI
i,˜yI
i)Ton the face of the input image, which
semantically corresponds to the ith vertex of the 3D face model with coordinates wi=
(xi, yi, zi)T. Image landmarks typically include the locations of the eye and mouth
14
Figure 4: Visualisation of the facial landmarks used throughout this paper
corners, tip of the nose etc.2In this work, a maximum of 28 landmarks are used as
shown in Fig. 4. However, some of these landmarks are not visible for non-frontal
poses due to self-occlusion. In those cases, only the visible landmarks are used. Also,
in the first iteration, the contour landmarks (7 shown in Fig 4) are not available.
For the alignment, we need to find the rigid transformation R,τthat moves the
coordinates of the point wi= (xi, yi, zi)Tto its new position w0
i= (x0
i, y0
i, z0
i)Tso that
after wiis projected to 2D via the mapping W(ρ), its 2D coordinates e
wi= (˜xi,˜yi)T
are as close as possible to the image point e
wI
i= (˜xI
i,˜yI
i)T.
Let Ldenote the subset of vertices corresponding to the facial landmark points in
the input image. Then the pose and camera parameters ρ={R,τ, f, o}can be esti-
mated by minimising the distance between the input landmarks and those reconstructed
from the model:
min
ρX
i∈L
ke
wI
ie
wik2(18)
This is the only cost function which is not linearised. It is minimised by the Levenberg-
Marquardt algorithm [7], but because of the small number of parameters involved, the
convergence is fast. Note that e
widepends on both the pose and camera parameters, as
well as the shape model s. The latter is kept constant in this step, and in the first itera-
tion, sis set to s0. In subsequent iterations, sis replaced by the shape update obtained
2Landmark detection itself is outside the scope of this paper.
15
in the previous iteration by the second stage of the fitting process described in the next
subsection. The estimated pose and camera parameters feed into the shape estima-
tion stage described in Section 4.2. The contour landmarks described in Section 4.3
constrain the pose and camera parameters, and shape estimation.
4.2. Shape Parameters Estimation
Once the pose and camera parameters are recovered, the shape parameters αcan
be estimated. We linearise this problem by making use of the current estimate of the
model vertex coordinates zi,ito define the projection matrices Pi. In addition, in
contrast to prior art, we define the cost function in 3D space, as:
min
αX
i∈L
kwI
iwik2(19)
where the image landmarks e
wI
i,i∈ L, are back-projected to wI
i= (xI
i, yI
i, zI
i)Tvia
wI
i=W1(e
wI
i, ρ). The main motivation for working in 3D is to reduce computational
complexity further.
Since wiis a vertex of the shape model s, it is a function of α. The cost function is
defined in 3D as:
min
αkˆ
sIˆ
s(α)k2+λ1αTσ1
sα(20)
where: ˆ
sIand ˆ
sare the stacked vertex positions wI
iand wi,i∈ L, respectively;
ˆ
s=ˆ
s0+ˆ
Sα;ˆ
s0and ˆ
Sare constructed by choosing those elements from s0and S
(defined in Eq. (2)) corresponding to the landmark indices L;λ1is a free weighting
parameter; αTσ1
sαis a regularisation term based on Eq. (3).
The closed-form solution for αin is:
α=ˆ
STˆ
S+λ1σ1
s1ˆ
STˆ
sIˆ
s0(21)
where σsis defined in Section 3.
Finally, we explain how to implement the inverse projection W1. Note that e
wI
i
cannot be back-projected to wI
iin the face model coordinate system unless zI
i, the
depth along the zaxis of wI
i, is known. Here, in the first iteration, zI
iis approximated
16
(a) model image (b) contour edge (c) image landmarks (d) correspondence
Figure 5: Contour landmarks detection. Yellow and red dots represent the contour landmarks of the input
and model reconstructed images, respectively. Algorithm 1 bridges (c) and (d).
by the model vertex zi, which is constructed from the mean shape s0. As the face shape
is updated in subsequent iterations, the latest estimate sis used in place of s0.
4.3. Contour Landmark Constraints
One impediment to accurate 3D shape reconstruction from a non-frontal 2D face
image stems from the lack of constraints on the projection of the occluding face con-
tour. In [7], the authors define the contour edges as the occluding boundary between
the face and non-face area, and use them to constrain the fitting. The contour edges
of the 2D face image synthesised from a 3D model-based reconstruction of a 2D input
image are shown in Fig. 5b. They are formed by linking the 3D model vertices lying
on the occluding boundary of the projected 3D face mesh, determined by the vertex
visibility check. A recent review of techniques developed to fit 3DMM to edges can
be found in [24]. To reduce the computational cost of working with contour edges, we
only use contour landmarks lying on the contour boundary. Here a contour landmark is
defined as the point of intersection of the occluding boundary of a face and a horizontal
line in the face coordinate system passing the corresponding symmetric landmark, as
shown in Fig. 6. Such contour landmarks are labelled in the input image automatically
by a cascaded-regression-based algorithm for automatic facial landmark detection [25],
which has been trained to detect contour landmarks defined in this way.
The vertices Lcthat form the contour landmarks along the occluding boundary
17
of the fitted 3DMM are found using Algorithm 1. They are the vertices (red dots in
Fig. 5d) closest to the contour landmarks of the input image (yellow dots of Fig. 5c).
Once this correspondence is established, these contour landmark pairs are added to the
available landmark set Lin Eq. (18) and (20) to improve the estimation of camera
parameters and shape.
Figure 6: Definition of the contour landmarks. The axis of face symmetry is defined by the chin and the
centre of the nose bridge. The face contour landmarks are the points of intersection of (i) the input image
face occluding contour and (ii) the horizontal lines in the thus-defined face coordinate system, passing the
visible facial contour landmarks.
4.4. Light Direction Estimation
After geometric fitting (Section 4.1-4.3), the 3DMM model is aligned to the input
image, and the reflectance parameters can be estimated. In this step, we focus on the
light direction d, and regard all other variables as constant. Recalling Eq. (6), the cost
function can be formulated as:
min
dkaIˇ
lat(ˇ
ldt)(N3d)ˇ
ldek2(22)
The minimisation of Eq. (22) is a non-linear problem because of the exponential
form of ein Eq. (10). To eliminate this nonlinear dependence we precompute the value
of ebased on the assumptions that: i)ksand γare constant; ii)the values of vand
rare set to those of the previous iteration. In order to make the linearity of the light
direction fitting problem more transparent, we avoid the element-wise multiplication
18
Input:
2D contour landmarks coordinates η={η1...ηk1}output by [25]
3DMM rendered contour edge coordinates ζ={ζ1...ζk2}(k2k1) via W
3D vertex indices φ={φ1...φk2}corresponding to ζ
Output: 3D vertex indices Lccorresponding to η
1for i= 1; ik1;i+ + do
2for j= 1; jk2;j+ + do
3distj=||ηiζj||2
4end
5Lc
i=φarg minj{distj}
6end
7return Lc
Algorithm 1: Establishing the contour landmark correspondence
in ( 22) by reformulating the cost function as:
min
dkaIˇ
latˇ
lde(AN3)dk2(23)
where A= [ˇ
ldt,ˇ
ldt,ˇ
ldt]R3n×3. By this reformulation, a closed-form solution
can be found as: d= ((AN3)T(AN3))1(AN3)T(aIˇ
latˇ
lde). Then d
is normalised to a unit vector.
For the first iteration, we initialise the values of t,ˇ
laand ˇ
ldas follows. 1) In
common with [26, 27], we assume that the face is a Lambertian surface. Consequently,
only the diffuse light in Eq. (6) is modelled. 2) The strengths of diffuse light ˇ
ldand
albedo tare set to vectors whose entries are all 1 and t0respectively. With these
assumptions, the cost function in the first iteration becomes:
min
dkaI(BN3)dk2(24)
where B= [t0,t0,t0]R3n×3. The closed-form solution is: d= ((BN3)T(B
N3))1(BN3)TaI.
The estimated light direction is fed into the light strength and albedo estimations
detailed in Section 4.5 and Section 4.6.
19
4.5. Light Strength Estimation
Having obtained an estimate of d, the ambient and directed light strengths can
be recovered. Because the three colour channels can be processed independently, for
simplicity only the red channel is described. The cost function for the red channel is:
min
lr
ad
kaI,r Cˇ
lr
adk2(25)
where aI,r is the red channel of aI;C= [tr,tr(Nd) + er]Rn×2,trand erare the
red channels of tand e;lr
ad = (lr
a, lr
d)T, where lr
aand lr
dare the strengths of ambient
and directed lights of the red channel respectively. The closed-form solution for lr
ad is:
lr
ad = (CTC)1CTaI,r (26)
Note that tis set as stated earlier in Section 4.4. The green and blue channels are solved
in the same way.
4.6. Albedo Estimation
Once the light direction and strengths are recovered, the albedo can be estimated.
To avoid over-fitting, we regularise the albedo estimation and generate the cost func-
tion:
min
βkaI(t0+Tβ)ˇ
la(t0+Tβ)ˇ
ld(N3d)ˇ
ldek2+λ2βTσ1
tβ(27)
where λ2is a free weighting parameter. The closed-form solution is
β=TTT+λ2σ1
t1TT(aI
in t0)(28)
where σtis as defined in Section 3 and aI
in, the illumination-normalised image, is given
by:
aI
in = (aIˇ
lde)÷(ˇ
la+ˇ
ld(N3d)) (29)
where the symbol ÷denotes element-wise division as before.
20
4.7. Computational complexity
The computational complexity of our method is dominated by the albedo estima-
tion described above. From Eq. (28) we can see that, since nrt, the dominating
computations are the two matrix multiplications, both of which have a complexity of
O(n rt
2).
5. Experiments
In this section, a comprehensive evaluation of our methodology is described. First,
face reconstruction performance is evaluated. Then, in face recognition experiments,
we compare our ESO with the existing 3DMM methods and other state-of-the-art meth-
ods. We implemented two effective 3DMM fitting methods [7] and [10], and the free
parameter settings of [7, 10] follow the original papers. The results of all the other
methods are cited from their papers based on the same experimental settings.
5.1. Face Reconstruction
First, we present some qualitative fitting results in Fig. 7. These images are from
the Multi-PIE database. The people in these images have different gender, ethnicity and
facial features such as a beard and/or glasses. All these factors can cause difficulties
for fitting. As can be seen in Fig. 7, the input images are well fitted. Note that our
3DMM does not model glasses. Therefore, the glasses of an input image, such as the
3rd person in Fig. 7, can confuse the fitting process. Despite it, our ESO reconstructs
this face well, showing its robustness.
In order to quantitatively measure every component of ESO, the 2D input images
and their corresponding ground truths of camera parameters, 3D shape, light direction
and strength, and texture need to be known. To meet all these requirements, we gen-
erated a local database of rendered 2D images with all the 3D ground truth as follows:
(1) We collected and registered 20 3D face scans. The first 10 scans are used for model
selection, and the remaining scans are used for performance evaluation. (2) The regis-
tered 3D scans are projected to PCA space, parameterising the ground truth in terms of
coefficients αand β. (3) Using the registered 3D scans, we rendered 2D images under
21
Figure 7: Row 1: input images with different pose and illumination variations. Row 2: ESO-
fitted/reconstructed images.
different poses and illuminations. (4) The 3DMM is fitted to obtain estimates of all
these parameters. (5) Reconstruction performance is measured using cosine similarity
between the estimated and ground-truth αor β.
5.1.1. Effects of Hyperparameters
Before we evaluate the face reconstruction performance, the sensitivity of the hy-
perparameters of ESO on the fitting process is investigated. The relevant hyperparam-
eters are the regularisation weights λ1in Eq. (20) and λ2in Eq. (27) and the number of
iterations (l1and l2) for geometric and photometric refinements (Fig. 3), respectively.
All the renderings in Section 5.1.1 are generated by setting both the focal length and
the distance between the object and camera to 1800 pixels as suggested in [28].
Impact of the weight λ1on shape reconstruction The weight λ1should be selected
carefully because improper λ1will cause under- or over-fitting during shape recon-
struction. As shown in Fig. 8, the reconstruction using a large λ1(= 1000) looks very
smooth and the shape details are lost, exhibiting typical characteristics of under-fitting.
On the other hand, a small λ1(= 0) causes over-fitting, and the reconstruction in Fig. 8
is excessively stretched. In comparison, the reconstruction with λ1= 0.5recovers the
shape well.
To quantitatively evaluate the impact of λ1, 2D renderings under 3 poses (frontal,
side and profile), without directed light, are generated. To decouple the impact of λ1
22
Figure 8: Impact of λ1and λ2on shape and albedo reconstruction. Column 1: input image, Column 2:
ground truth of shape and albedo, Column 3-5: reconstructions with different λ1and λ2.
and l1on shape refinement, l1is set to 1. As shown in Fig. 9a, neither small ( <0.4)
nor large (>1) λ1lead to good reconstruction which is consistent with Fig. 8. On the
other hand, the reconstructions of all 3 poses do not change much with λ1in the region
between 0.4 and 0.7. Hence, λ1is set to 0.5, which is the average value of the best λ1
over all the test cases, to simplify parameter tuning.
Impact of the number of iterations l1on shape refinement The same renderings
are also used to evaluate the sensitivity to l1. From Fig. 9b, we can see that more
than 3 iterations do not greatly improve the reconstruction performance for any pose.
Therefore, l1is fixed at 3 in the remaining experiments.
Impact of the weight λ2on albedo reconstruction We also examine the impact of
λ2on albedo reconstruction. Fig. 8 shows some qualitative results. Clearly, the re-
construction with λ2= 1000 loses the facial details because of being under-fitted. On
the other hand, the one with λ2= 0 does not separate the illumination and albedo
properly, causing over-fitting. In comparison, the one with λ2= 0.7reconstructs the
albedo well.
To quantitatively investigate the impact of λ2on the estimated light direction and
23
strength, the renderings from different light direction dand strength ld3are used as
shown in Fig. 9c. All these renderings are under frontal pose and l2=1. It is clear that
the reconstructed albedo does not change greatly with λ2in the region between 0.2 and
1. To simplify parameter tuning, λ2is fixed to 0.7 which is the average value of the
best λ2over all the test cases.
Impact of the number of iterations l2on albedo refinement To investigate the impact
of l2, the same 2D renderings for the λ2evaluation are used. As shown in Fig. 9d, all
the images converge by the 4th iteration. Hence, for simplicity, l2is fixed to 4 in ESO.
5.1.2. Reconstruction Results
We evaluate shape and albedo reconstructions separately. ESO is compared with
two methods: MFF [7] and [10], which are the best SimOpt and SeqOpt methods,
respectively. We implemented the whole framework of MFF. Regarding [10], we
only implemented the geometric (camera model and shape) part, because insufficient
implementation details of the photometric part were released.
Shape Reconstruction As mentioned in Section 1 and 2, the affine camera used by
[10] cannot model perspective effects, while the perspective camera used by ESO and
MFF can. Different camera models lead to different shape reconstruction strategies. In
order to find out how significant this difference is, we change the distance between the
object and camera to generate perspective effects, at the same time keeping the facial
image size constant by adjusting the focal length to match [28]. Note that the shorter
this distance, the larger the perspective distortion. To compare shape reconstruction
performance, 2D renderings under frontal pose obtained for 6 different distances are
generated. We can see from Fig. 10 that the performance of ESO and MFF remains
constant under different perspective distortions. However, the performance of [10]
reduces greatly as the distance between the object and camera decreases. Also, ESO
consistently works better than MFF under all perspective distortions.
Albedo Reconstruction We compare ESO with MFF [7] in Table 1 using images ren-
dered under different light direction and strength. We see that the albedo reconstruction
3The illumination is set to be white here, i.e. ld= (ld, ld, ld)T
24
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 100 1000
0
0.1
0.2
0.3
0.4
0.5
0.6
λ1
cosine similarity of shape
frontal
side
profile
\ \ \ \
(a) Impact of regularisation weight λ1on shape
reconstruction over poses
123456
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
l1
cosine similarity of shape
frontal
side
profile
(b) Impact of the number of iterations l1on
shape reconstruction
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 100 1000
0.3
0.4
0.5
0.6
0.7
0.8
0.9
λ2
cosine similarity of albedo
left−light, ld = 0.5
right−light, ld = 0.5
frontal−light, ld = 0.5
frontal−light, ld = 0.1
frontal−light, ld = 1
\ \ \ \
(c) Impact of regularisation weight λ2on
albedo reconstruction over lightings
123456
0.2
0.3
0.4
0.5
0.6
0.7
0.8
l2
cosine similarity of albedo
left−light, ld = 0.5
right−light, ld = 0.5
frontal−light, ld = 0.5
frontal−light, ld = 0.1
frontal−light, ld = 1
(d) Impact of the number of iterations l2on
albedo reconstruction
Figure 9: Effects of hyperparameters on facial shape and albedo reconstruction
performance for different light direction is very similar, but it varies greatly for differ-
ent directed light strength. This demonstrates that the albedo reconstruction is more
sensitive to light strength than direction. Also, ESO consistently works better than
MFF. The reasons are two fold: 1) MFF uses a gradient-based method that suffers
from the non-convexity of the cost function. 2) For computational efficiency, MFF ran-
domly samples only a small number (1000) of polygons to establish the cost function.
This is insufficient to capture the information of the whole face, causing under-fitting.
Our method being much faster makes use of all the polygons. Further computational
25
0 500 1000 1500 2000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
distance between the camera and object (unit: pixel)
cosine similarity of shape
[8] with affine camera
ESO with perspective camera
MFF with perspective camera
Figure 10: Shape reconstruction results measured by cosine similarity
Table 1: Albedo reconstruction results measured by cosine similarity.
light direction light strength ldMFF [7] ESO
left 0.5 0.57 ±0.15 0.61 ±0.08
right 0.5 0.57 ±0.13 0.60 ±0.08
frontal 0.5 0.58 ±0.14 0.62 ±0.08
frontal 0.1 0.60 ±0.13 0.67 ±0.07
frontal 1.0 0.49 ±0.16 0.54 ±0.08
efficiency discussions can be found in Section 5.2.1.
5.2. Pose and Illumination Invariant Face Recognition
Pose- and illumination-invariant face recognition is a challenging problem addressed
by a variety of approaches [29]. 2D methods address the pose and illumination prob-
lem at either pixel-level or image feature-level. The former aim to create pixel-level
correspondence across different poses [30, 31, 32, 33]. For example, regression-based
methods [31, 32] learn mapping matrices which project images of one particular pose
to another one. The latter project the pixel values into pose- and/or illumination-robust
feature spaces[11, 34, 35, 36, 37]. For example, Canonical Correlation Analysis [34]
projects pixel values to a subspace where the impacts of pose and illumination are
effectively removed. Deep Learning [11, 36] is also based on the same motivation.
3D methods intrinsically model pose variations using the analysis-by-synthesis ap-
proach. This means that a 3D face model has to be fitted to the input 2D image an-
26
notated with facial landmarks. The methods can be categorised into 3 groups: pose
normalisation [18, 38, 39], pose synthesis [40, 41] and 3D shape and texture feature
extraction [2, 7, 42]. Pose normalisation renders all the images (gallery and probe) to
a frontal view using 3D models; pose synthesis renders multiple gallery images of dif-
ferent poses for each subject. A probe image is only matched against that of the most
similar pose in the gallery. 3D shape and texture feature extraction methods attempt
to match probe and gallery images in a 3D parameter space, providing a pose- and
illumination-invariant representation.
Among other options, 3DMM-based face recognition systems [7, 8, 9, 10] have a
particular appeal because the process of 3D face model fitting provides a means of ex-
tracting the intrinsic 3D face shape and albedo from an unconstrained input face image.
However, for a long time, the wide spread use of 3DMM in face recognition has been
inhibited by inefficient 3D face model fitting algorithms. The ESO fitting algorithm
presented in Section 4 offers a new means that greatly enhances the applicability of the
3D face model fitting approach.
Most existing 3DMM methods [7, 8, 9, 10] assume that accurate facial landmarks
are known. To the best of our knowledge, only one previous work [43] proposes the use
of automatically detected landmarks. In [43], the automatic landmark detection and
3DMM fitting are combined by a data-driven Markov chain Monte Carlo method. This
method is robust to automatically detected landmarks but is rather slow. In contrast, we
use an efficient cascaded regression technique [25] to automatically detect landmarks,
which are then fed into a fully automatic face recognition system.
The conventional pipeline of a 3DMM face recognition system, shown as Scheme
1 in Fig. 11, involves the use of the generative 3D shape and texture parameters (α
and β), which are isolated from the input image appearance by suppressing the pose
and illumination nuisance parameters. As in previous works [7, 9, 10], αand βare
concatenated into a single vector to work as a holistic descriptor.
The drawback of holistic features is their inability to capture local facial proper-
ties, e.g. a scar, which may be very discriminative between people. To overcome this
problem, we propose to extract local features as an alternative. Specifically, with the as-
sistance of ESO fitting, we can render a pose- and illumination-normalised face image
27
Scheme1
Scheme 2
Local Feature
3DMM
Holistic Feature
[󰑚, 󰓏]
LPQ
Holistic
Feature
Local
Feature
Gallery
Gallery
Model
Fitting
Normali
sation
Matching
Matching
Figure 11: Face recognition pipeline. Scheme 1 and 2 use holistic and local features for face recognition
respectively.
from an unconstrained input face as shown in Scheme 2 in Fig. 11. The pose normali-
sation is achieved by setting ρ=ρ0to transform the input face to a canonical frontal
view. The illumination-normalised input image aI
in is obtained using Equation 29. Lo-
cal features, such as those from the Local Phase Quantisation (LPQ) [44] descriptor
used in this work, can then be extracted from this rendered image.
We evaluate the merit of the ESO fitting approach in the context of face recognition
on the PIE [45] and Multi-PIE [46] databases which both have large pose and illumi-
nation variations. We set the hyperparameters {λ1, l1, λ2, l2}of ESO to {0.5, 3, 0.7,
4}as discussed in Section 5.1.
5.2.1. PIE Database
PIE is a benchmark database that can be used to compare different 3DMM fitting
methods.
Protocol To compare all the methods fairly, the standard experimental protocol is used
by our system. In particular, the recognition performance is measured using a subset
of PIE including 3 poses (frontal, side and profile) and 24 illuminations. In order to
conform to the protocol, in this experiment the fitting is initialised by manual land-
marks. The gallery set contains frontal face images under neutral illumination, and the
remaining images are probes. The holistic features α,βare used to represent a face.
Results Face recognition performance in the presence of combined pose and illumina-
28
Table 2: Face recognition rate (%) on different poses averaging over all the illuminations on PIE
frontal side profile average
LiST [8] 97 91 60 82.6
Zhang [9] 96.5 94.6 78.7 89.9
Aldrian [10] 99.5 95.1 70.4 88.3
MFF [7] 98.9 96.1 75.7 90.2
ESO 100 97.4 73.9 90.4
tion variations is reported in Table 2. ESO performs substantially better than [8], and
marginally better than [7, 9, 10]. Note that MFF [7], whose performance is very close
to ESO, has more than 10 hyperparameters, causing difficulties for optimal parameter
selection. In contrast, ESO has only 4 hyperparameters.
Runtime The optimisation time was measured on a computer with Intel Core2 Duo
E8400 CPU and 4GB RAM memory. The results obtained for our implementation of
the SimOpt method (MFF [7]) and the results reported for the SeqOpt method [10]
are compared with those obtained with ESO. MFF took 23.1 seconds to fit one image,
while ESO took only 2.1 seconds on average per fitting. The authors of [10] did not
report their run time, but they also determined the albedo estimation to be the dominant
step, with the same complexity of O(n rt
2). Note however that [10] uses not only one
group of global αand βbut also four additional local groups to represent a face, while
we only use the global parameters. Therefore rtin our approach is one fifth of [10],
giving a 25-fold speed advantage.
5.2.2. Multi-PIE Database
To compare with other state-of-the-art methods, evaluations are also conducted on
a larger database, Multi-PIE, containing more than 750,000 images of 337 people. In
addition, our face recognition systems, initialised by both manually and automatically
detected landmarks, are compared. We used a cascaded regression-based automatic
landmark detection method [25].
Protocol There are two settings, Setting-I and Setting-II, widely used in previous work
[11, 12, 36, 38]. Setting-I is used for face recognition in the presence of combined pose
29
Table 3: Face recognition rate (%) on different poses averaging all the illuminations on Multi-PIE (Setting-I)
Method Annotation Feature -45-30-15+15+30+45Mean 0
Li [31] Manual Gabor 63.5 69.3 79.7 75.6 71.6 54.6 69.1 N/A
Deep
Learning Automatic
RL [11] 66.1 78.9 91.4 90.0 82.5 62.0 78.5 94.3
FIP [11] 63.6 77.5 90.5 89.8 80.0 59.5 76.81 94.3
MVP [12] 75.2 83.4 93.3 92.2 83.9 70.6 83.1 95.7
ESO
Automatic Holistic 73.8 87.5 95.0 95.1 90.0 76.2 86.3 98.7
Local 79.6 91.6 98.2 97.9 92.6 81.3 90.2 99.4
Manual Holistic 80.8 88.9 96.7 97.6 93.3 81.1 89.7 99.1
Local 81.1 93.3 97.7 98.0 93.3 82.4 91.0 99.6
Table 4: Face recognition rate (%) on different poses under neutral illumination on Multi-PIE (Setting-II)
Method Annotation -45-30-15+15+30+45Mean
2D
PLS [32]
Manual
51.1 76.9 88.3 88.3 78.5 56.5 73.3
CCA [47] 53.3 74.2 90.0 90.0 85.5 48.2 73.5
GMA [48] 75.0 74.5 82.7 92.6 87.5 65.2 79.6
DAE [49] Automatic 69.9 81.2 91.0 91.9 86.5 74.3 82.5
SPAE [36] 84.9 92.6 96.3 95.7 94.3 84.4 91.4
3D
Asthana [38]
Automatic
74.1 91.0 95.7 95.7 89.5 74.8 86.8
MDF [50] 78.7 94.0 99.0 98.7 92.2 81.8 90.7
ESO+LPQ 91.7 95.3 96.3 96.7 95.3 90.3 94.4
and illumination variations, Setting-II for that with only pose variations.
In common with [11, 12], Setting-I uses a subset in session 01 consisting of 249
subjects with 7 poses and 20 illumination variations. The images of the first 100 sub-
jects constitute the training set. The remaining 149 subjects form the test set. In the test
set, the frontal images under neutral illumination work as the gallery and the remaining
are probe images. Following [36, 38], Setting-II uses the images of all the 4 sessions
(01-04) under 7 poses and only neutral illumination. The images from the first 200
subjects are used for training and the remaining 137 subjects for testing. In the test set,
the frontal images from session 01 work as gallery, and the others are probes.
ESO vs Deep Learning (Setting-I) In recent years, deep learning methods have achieved
considerable success in a range of vision applications. In particular, deep learning
works well for pose- and illumination-invariant face recognition [11, 12]. To our
30
knowledge, these methods have reported the best face recognition rate so far on Multi-
PIE over both pose and illumination variations. Systems deploying these methods
learned 3 pose- and illumination-invariant features: FIP (face identity-preserving), RL
(FIP reconstructed features), and MVP (multi-view perceptron) using convolutional
neural networks (CNN). Table 3 compares ESO with these deep learning methods and
the baseline method [31]. Not surprisingly, deep learning methods work better than
[31] because of their powerful feature learning capability. However, ESO with auto-
matic annotation, using either holistic or local features, outperforms these three deep
learning solutions as shown in Table 3. We conclude that the superior performance
of ESO results from the fact that the fitting process of ESO can explicitly model the
pose. In contrast, the deep learning methods try to learn the view/pose-invariant fea-
tures across different poses. This learning objective is highly non-linear so that the
methods tend to get trapped in local minima. In contrast, ESO solves several convex
problems and avoids this pitfall.
Automatic vs Manual Annotation (Setting-I) Table 3 also compares the performance
of ESO with fully automatic annotation against that based on manual annotation. This
table shows that the mean face recognition rates of the fully automatic system are close
to those relying on manual annotation: 88.0% vs 91.2% for holistic features, and 91.5%
vs 92.2% for local features. It means that ESO is reasonably robust to the errors caused
by automatically detected landmarks.The superiority of local features, which can cap-
ture more facial details than holistic features, is also evident from the results.
ESO for Pose-robust Face Recognition (Setting-II) Table 4 compares ESO with the
state-of-the-art methods for pose-robust face recognition. The methods can be clas-
sified into 2D and 3D approaches as discussed in Section 5.2. In the 2D category,
PLS [32] and CCA [47] are unsupervised methods, and consequently they deliver in-
ferior performance. GMA [48] benefits from its use of some additional supervisory
information. DAE [49] and SPAE [36] are auto-encoder-based methods, which have
superior capability to learn the non-linear relationships between images of different
poses. SPAE set the state-of-the-art in performance, even compared with 3D methods
[38] and [50]. However, our ESO outperforms SPAE, specifically 94.4% vs 91.4%,
31
because of its accurate shape and albedo reconstruction capability.
6. Conclusions
We proposed a new optimisation method — Efficient Stepwise Optimisation (ESO)
— for fitting a 3D morphable face model to a 2D face image. In order to improve the
optimisation efficiency, the method decouples the geometric and photometric optimi-
sations and uses least squares sequentially to optimise the reconstructed shape, light di-
rection, light strength and albedo parameters in separate steps. It includes a perspective
camera model that becomes important in view of the growing interest in near-camera
applications.
The computational efficiency of ESO is achieved thanks to the proposed lineari-
sation of the model fitting steps, leading to closed-form solutions. ESO improves the
optimisation efficiency by an order of magnitude in comparison with [7]. Moreover, it
overcomes the weaknesses of earlier SeqOpt methods:
The shape reconstruction of ESO supports a perspective camera.
ESO linearises the Phong model.
It models specularity.
Occluding contour landmarks (Section 4.3) are used for a more robust fitting.
The experimental results demonstrate that the face reconstruction achievable by ESO
is an improvement on that obtained from the state-of-the-art methods.
The ESO fitting algorithm can extract both holistic features and local features. A
face recognition system that incorporates ESO to facilitate pose and illumination in-
variance was constructed, and evaluated on the PIE and Multi-Pie benchmark datasets
with very promising results.
7. Acknowledgments
Support for this work is gratefully acknowledged from: EPSRC/DSTL project
EP/K014307/1 “Signal processing in a networked battlespace”; EPSRC Programme
32
Grant EP/L000539 “S3A: Future spatial audio for immersive listener experiences at
home”; and the European Commission FP7 project 284989 “BEAT”.
References
[1] R. Ramamoorthi, P. Hanrahan, A signal-processing framework for inverse ren-
dering, in: Proceedings of the 28th annual conference on Computer graphics and
interactive techniques, ACM, 2001, pp. 117–128.
[2] V. Blanz, T. Vetter, Face recognition based on fitting a 3D morphable model,
Pattern Analysis and Machine Intelligence, IEEE Transactions on 25 (9) (2003)
1063–1074.
[3] X. Bai, E. R. Hancock, R. C. Wilson, A generative model for graph matching and
embedding, Computer Vision and Image Understanding 113 (7) (2009) 777–789.
[4] X. Bai, E. R. Hancock, R. C. Wilson, Graph characteristics from the heat kernel
trace, Pattern Recognition 42 (11) (2009) 2589–2606.
[5] V. Blanz, T. Vetter, A morphable model for the synthesis of 3D faces, in: Pro-
ceedings of the 26th annual conference on Computer graphics and interactive
techniques, 1999, pp. 187–194.
[6] S. Romdhani, T. Vetter, Efficient, robust and accurate fitting of a 3D morphable
model, in: ICCV, IEEE, 2003, pp. 59–66.
[7] S. Romdhani, T. Vetter, Estimating 3D shape and texture using pixel intensity,
edges, specular highlights, texture constraints and a prior, in: CVPR, IEEE, 2005,
pp. 986–993.
[8] S. Romdhani, V. Blanz, T. Vetter, Face identification by fitting a 3D morphable
model using linear shape and texture error functions, in: ECCV, Springer, 2002,
pp. 3–19.
[9] L. Zhang, D. Samaras, Face recognition from a single training image under arbi-
trary unknown lighting using spherical harmonics, Pattern Analysis and Machine
Intelligence, IEEE Transactions on 28 (3) (2006) 351–363.
33
[10] O. Aldrian, W. A. Smith, Inverse rendering of faces with a 3D morphable model,
Pattern Analysis and Machine Intelligence, IEEE Transactions on 35 (5) (2013)
1080–1093.
[11] Z. Zhu, P. Luo, X. Wang, X. Tang, Deep learning identity preserving face space,
in: Proc. ICCV, Vol. 1, 2013, p. 2.
[12] Z. Zhu, P. Luo, X. Wang, X. Tang, Deep learning multi-view representation for
face recognition, arXiv preprint arXiv:1406.6947.
[13] C. P. Huynh, A. Robles-Kelly, E. R. Hancock, Shape and refractive index from
single-view spectro-polarimetric images, International Journal of Computer Vi-
sion 101 (1) (2013) 64–94.
[14] R. Ramamoorthi, P. Hanrahan, A signal-processing framework for reflection,
ACM Transactions on Graphics (TOG) 23 (4) (2004) 1004–1042.
[15] G. Hu, P. Mortazavian, J. Kittler, W. Christmas, A facial symmetry prior for im-
proved illumination fitting of 3D morphable model, in: International Conference
on Biometrics, IEEE, 2013, pp. 1–6.
[16] G. Hu, C. Chan, J. Kittler, W. Christmas, Resolution-aware 3D morphable model,
in: British Machine Vision Conference, 2012, pp. 1–10.
[17] Y. Wang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, D. Samaras, Face re-lighting from
a single image under harsh lighting conditions, in: Computer Vision and Pattern
Recognition, IEEE Conference on, IEEE, 2007.
[18] X. Zhu, Z. Lei, J. Yan, D. Yi, S. Z. Li, High-fidelity pose and expression normal-
ization for face recognition in the wild, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2015, pp. 787–796.
[19] X. Zhu, J. Yan, D. Yi, Z. Lei, S. Z. Li, Discriminative 3D morphable model
fitting, in: Automatic Face and Gesture Recognition (FG), IEEE International
Conference on, 2015.
34
[20] A. Patel, W. A. Smith, 3D morphable face models revisited, in: CVPR, IEEE,
2009, pp. 1327–1334.
[21] P. Huber, Z. Feng, W. Christmas, J. Kittler, M. R¨
atsch, Fitting 3D Morphable
Models using local features, in: IEEE International Conference on Image Pro-
cessing, (ICIP), 2015. doi:10.1109/ICIP.2015.7350989.
URL http://dx.doi.org/10.1109/ICIP.2015.7350989
[22] W. A. P. Smith, E. R. Hancock, Estimating facial reflectance properties using
shape-from-shading, International Journal of Computer Vision 86 (2–3) (2010)
152–170.
[23] J. T. Rodriguez, 3D face modelling for 2D+3D face recognition, Ph.D. thesis,
Surrey University, Guildford, UK (2007).
URL http://www.ee.surrey.ac.uk/CVSSP/Publications/
papers/tena-2007.pdf
[24] A. Bas, W. A. P. Smith, T. Bolkart, S. Wuhrer, Fitting a 3D morphable
model to edges: A comparison between hard and soft correspondences, CoRR
abs/1602.01125.
URL http://arxiv.org/abs/1602.01125
[25] Z.-H. Feng, P. Huber, J. Kittler, W. Christmas, X.-J. Wu, Random cascaded-
regression copse for robust facial landmark detection, Signal Processing Letters,
IEEE 22 (1) (2015) 76–80.
[26] I. Kemelmacher-Shlizerman, R. Basri, 3D face reconstruction from a single image
using a single reference face shape, Pattern Analysis and Machine Intelligence,
IEEE Transactions on 33 (2) (2011) 394–405.
[27] S. R. Marschner, S. H. Westin, E. P. Lafortune, K. E. Torrance, D. P. Greenberg,
Image-based brdf measurement including human skin, in: Rendering Techniques
99, Springer, 1999, pp. 131–144.
[28] R. Hartley, A. Zisserman, Multiple view geometry in computer vision, Cambridge
university press, 2003.
35
[29] C. Ding, D. Tao, A comprehensive survey on pose-invariant face recognition,
arXiv preprint arXiv:1502.04383.
[30] S. R. Arashloo, J. Kittler, Energy normalization for pose-invariant face recogni-
tion based on mrf model image matching, Pattern Analysis and Machine Intelli-
gence, IEEE Transactions on 33 (6) (2011) 1274–1280.
[31] A. Li, S. Shan, W. Gao, Coupled bias-variance tradeoff for cross-pose face recog-
nition, Image Processing, IEEE Transactions on 21 (1) (2012) 305–315.
[32] A. Sharma, D. W. Jacobs, Bypassing synthesis: PLS for face recognition with
pose, low-resolution and sketch, in: CVPR, IEEE, 2011, pp. 593–600.
[33] A. B. Ashraf, S. Lucey, T. Chen, Learning patch correspondences for improved
viewpoint invariant face recognition, in: CVPR, IEEE, 2008, pp. 1–8.
[34] T.-K. Kim, J. Kittler, R. Cipolla, Discriminative learning and recognition of im-
age set classes using canonical correlations, Pattern Analysis and Machine Intel-
ligence, IEEE Transactions on 29 (6) (2007) 1005–1018.
[35] S. J. Prince, J. Warrell, J. H. Elder, F. M. Felisberti, Tied factor analysis for face
recognition across large pose differences, Pattern Analysis and Machine Intelli-
gence, IEEE Transactions on 30 (6) (2008) 970–984.
[36] M. Kan, S. Shan, H. Chang, X. Chen, Stacked progressive auto-encoders (spae)
for face recognition across poses, in: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2013, pp. 1883–1890.
[37] C. Ding, C. Xu, D. Tao, Multi-task pose-invariant face recognition, Image Pro-
cessing, IEEE Transactions on 24 (3) (2015) 980–993.
[38] A. Asthana, T. K. Marks, M. J. Jones, K. H. Tieu, M. Rohith, Fully automatic
pose-invariant face recognition via 3D pose normalization, in: Computer Vision,
International Conference on, IEEE, 2011, pp. 937–944.
36
[39] R. Abiantun, U. Prabhu, M. Savvides, Sparse feature extraction for pose-tolerant
face recognition, Pattern Analysis and Machine Intelligence, IEEE Transactions
on 36 (10) (2014) 2061–2073.
[40] K. Niinuma, H. Han, A. K. Jain, Automatic multi-view face recognition via 3D
model based pose regularization, in: Biometrics: Theory, Applications and Sys-
tems (BTAS), IEEE Conference on, 2013.
[41] U. Prabhu, J. Heo, M. Savvides, Unconstrained pose-invariant face recognition
using 3D generic elastic models, Pattern Analysis and Machine Intelligence,
IEEE Transactions on 33 (10) (2011) 1952–1961.
[42] D. Yi, Z. Lei, S. Z. Li, Towards pose robust face recognition, in: CVPR, IEEE,
2013, pp. 3539–3545.
[43] S. Sch ¨
onborn, A. Forster, B. Egger, T. Vetter, A monte carlo strategy to integrate
detection and model-based face analysis, in: Pattern Recognition, 2013.
[44] T. Ahonen, E. Rahtu, V. Ojansivu, J. Heikkila, Recognition of blurred faces using
local phase quantization, in: ICPR, IEEE, 2008.
[45] T. Sim, S. Baker, M. Bsat, The CMU pose, illumination, and expression (PIE)
database, in: Automatic Face and Gesture Recognition, IEEE International Con-
ference on, 2002, pp. 46–51.
[46] R. Gross, I. Matthews, J. Cohn, T. Kanade, S. Baker, Multi-PIE, Image and Vision
Computing 28 (5) (2010) 807–813.
[47] H. Hotelling, Relations between two sets of variates, Biometrika (1936) 321–377.
[48] A. Sharma, A. Kumar, H. Daume, D. W. Jacobs, Generalized multiview analysis:
A discriminative latent space, in: CVPR, IEEE, 2012, pp. 2160–2167.
[49] Y. Bengio, Learning deep architectures for ai, Foundations and trends R
in Ma-
chine Learning 2 (1) (2009) 1–127.
37
[50] S. Li, X. Liu, X. Chai, H. Zhang, S. Lao, S. Shan, Morphable displacement
field based image matching for face recognition across pose, in: ECCV, Springer,
2012.
38
  • ... The author used the framework in face recognition [24] to remove the influence of poses and illuminations. Hu [25] improved the 3D Morphable Model method in 2D image projection process. Walid [26] proposed a 3D face recognition method using covari- ance based descriptors. ...
    Article
    We propose a novel method for measuring the nasal similarity among 3D faces. Firstly, we construct a representation for the nose shape, which is composed of a set of geodesic curves, each crosses the bridge of the nose. Next, using these geodesic curves, we formulate a similarity measure to compare among noses in the curve shape space. Under the Riemannian framework, the shape space is a quotient space for which the scaling, translation and rotation are removed. Since the nose similarity measure is based on the shape comparison, the proposed method has the following advantages: (1) the similarity measure is robust to facial expressions since the nose is not affected by facial expressions; (2) the geometric features of the nose shape match well with the human perception; (3) the similarity measure is independent of the mesh grid because the chosen nose curves are not sensitive to the triangular mesh model. We con- struct a nasal hierarchical structure for noses organization which is based on nose similarity measure results. In our experiments, we evaluate the performance of the proposed method and compare it with competing methods on three public face databases namely, FRGC2.0, Texas3D and BosphorusDB. The results show superiority of the proposed method in terms of both the speed and the accuracy when the nasal measurements are processed in the nasal hierarchical structure and the nasal samples with low sampling rate (5%-25% of original point cloud).
  • ... Usually, we can only 'guess' the pixels in the occluded area from the unoccluded area based on the assumption that faces are roughly symmetrical. In this work, we adopt a face mirroring method to solve the self-occlusion problem by exploiting the facial symmetry prior [64], [65], [21], [66]. As illustrated in Fig. 4, A3F-CNN just recovers half of the frontal face, then the other half is mirrored from the unoccluded half face. ...
    Article
    Facial pose variation is one of the major factors making face recognition (FR) a challenging task. One popular solution is to convert non-frontal faces to frontal ones on which face recognition is performed. Rotating faces causes facial pixel value changes. Therefore, existing CNN-based methods learn to synthesize frontal faces in color space. However, this learning problem in a color space is highly non-linear, causing the synthetic frontal faces to lose fine facial textures. In this work, we take the view that the nonfrontal-frontal pixel changes are essentially caused by geometric transformations (rotation, translation, etc) in space. Therefore, we aim to learn the nonfrontal-frontal facial conversion in spatial domain rather than the color domain to ease the learning task. To this end, we propose an Appearance- Flow-based Face Frontalization Convolutional Neural Network (A3F-CNN). Specifically, A3F-CNN learns to establish the dense correspondence between the non-frontal and frontal faces. Once the correspondence is built, frontal faces are synthesized by explicitly ‘moving’ pixels from the non-frontal one. In this way, the synthetic frontal faces can preserve fine facial textures. To improve the convergence of training, an appearance-flow-guided learning strategy is proposed. In addition, GAN loss is applied to achieve a more photorealistic face and a face mirroring method is introduced to handle the self-occlusion problem. Extensive experiments are conducted on face synthesis and pose invariant face recognition. Results show that our method can synthesize more photorealistic faces than existing methods in both controlled and uncontrolled lighting environments. Moreover, we achieve very competitive face recognition performance on the Multi-PIE, LFW and IJB-A databases. IEEE
  • ... Matching 3D faces for recognition is a complex task caused by the presence of expression variations, missing data and ex- haustive values. A lot of attention is paid to solving this prob- lem, as evidenced by numerous publications in recent years [2][3][4][5][6]. The authors are used different methods and algorithms to create 3D models of the person whose main purpose is to get such a 3D image that is as much as possible consistent with the real person and was convenient at work. ...
    Article
    Full-text available
    This paper presents the results of the creation soft-ware on Kinect 2.0 basis that is intended to develop a 3D model of the face, simulate various situations and problems that arise dur-ing its creation, and consider some tools and methods for their solution. The possibility of correct setting of the original face im-age to its 3D model and the factors influencing the accuracy of this process are analyzed. (8) (PDF) Creating a Face Editor Using Kinect 2.0. Available from: https://www.researchgate.net/publication/328536317_Creating_a_Face_Editor_Using_Kinect_20 [accessed Nov 20 2018].
  • ... However, for handling a wider range of variations, and especially expres- sions, 3DMM is the model of choice [10], [11]. The compu- tational complexity of the 3DMM fitting has recently been addressed in [12] and in [13]. ...
    ... A few papers attempt to make use of the estimated 3D shape and texture information for face matching. The exceptions include [12], [14], [15] where the recovered 3D face shape and texture parameters con- stitute the features for pose and illumination invariant face recognition. However, these features are just PCA coef- ficients, whose discriminatory power is not comparable to deep learning features used in 2D face recognition. ...
    Conference Paper
    Full-text available
    Fitting 3D Morphable Face Models (3DMM) to a 2D face image allows the separation of face shape from skin texture, as well as correction for face expression. However , the recovered 3D face representation is not readily amenable to processing by convolutional neural networks (CNN). We propose a conformal mapping from a 3D mesh to a 2D image, which makes these machine learning tools accessible by 3D face data. Experiments with a CNN based face recognition system designed using the proposed representation have been carried out to validate the advocated approach. The results obtained on standard benchmarking data sets show its promise.
  • ... This Multi-PIE subset has 25,200 2D face images and each face image was manually annotated using 68 facial landmarks [27]. Some examples from the Stirling low quality subset with the detected 68 facial landmarks using the CNN6 are shown in Fig. 3. 3) 3D face reconstruction: Given an input 2D image as well as its 2D facial landmarks, our eos [16] and Efficient Stepwise Optimisation (ESO) [28] fitting algorithms are used to recover the 3D face of the input. We use the term 'MTCNN-CNN6-eos' and 'MTCNN-CNN6-ESO' for these two systems. ...
    Conference Paper
    Full-text available
    This paper investigates the evaluation of dense 3D face reconstruction from a single 2D image in the wild. To this end, we organise a competition that provides a new benchmark dataset that contains 2000 2D facial images of 135 subjects as well as their 3D ground truth face scans. In contrast to previous competitions or challenges, the aim of this new benchmark dataset is to evaluate the accuracy of a 3D dense face reconstruction algorithm using real, accurate and high-resolution 3D ground truth face scans. In addition to the dataset, we provide a standard protocol as well as a Python script for the evaluation. Last, we report the results obtained by three state-of-the-art 3D face reconstruction systems on the new benchmark dataset. The competition is organised along with the 2018 13th IEEE Conference on Automatic Face & Gesture Recognition.
  • ... This Multi-PIE subset has 25,200 2D face images and each face image was manually annotated using 68 facial landmarks [27]. Some examples from the Stirling low quality subset with the detected 68 facial landmarks using the CNN6 are shown in Fig. 3. 3) 3D face reconstruction: Given an input 2D image as well as its 2D facial landmarks, our eos [16] and Efficient Stepwise Optimisation (ESO) [28] fitting algorithms are used to recover the 3D face of the input. We use the term 'MTCNN-CNN6-eos' and 'MTCNN-CNN6-ESO' for these two systems. ...
    Conference Paper
    Full-text available
    This paper investigates the evaluation of dense 3D face reconstruction from a single 2D image in the wild. To this end, we organise a competition that provides a new benchmark dataset that contains 2000 2D facial images of 135 subjects as well as their 3D ground truth face scans. In contrast to previous competitions or challenges, the aim of this new benchmark dataset is to evaluate the accuracy of a 3D dense face reconstruction algorithm using real, accurate and high-resolution 3D ground truth face scans. In addition to the dataset, we provide a standard protocol as well as a Python script for the evaluation. Last, we report the results obtained by three state-of-the-art 3D face reconstruction systems on the new benchmark dataset. The competition is organised along with the 2018 13th IEEE Conference on Automatic Face & Gesture Recognition.
  • ... By design, we amplify the impact of the samples with small and medium range errors to the network training. tion [14,29,24,23,44,31]. Thanks to the successive developments in this area of research during the past decades, we are able to perform very accurate facial landmark localisation in constrained scenarios, even using traditional approaches such as Active Shape Model (ASM) [7], Active Appearance Model (AAM) [8] and Constrained Local Model (CLM) [11]. ...
    Conference Paper
    Full-text available
    We present a new loss function, namely Wing loss, for robust facial landmark localisation with Convolutional Neural Networks (CNNs). We first compare and analyse different loss functions including L2, L1 and smooth L1. The analysis of these loss functions suggests that, for the training of a CNN-based localisation model, more attention should be paid to small and medium range errors. To this end, we design a piece-wise loss function. The new loss amplifies the impact of errors from the interval (-w, w) by switching from L1 loss to a modified logarithm function. To address the problem of under-representation of samples with large out-of-plane head rotations in the training set, we propose a simple but effective boosting strategy, referred to as pose-based data balancing. In particular, we deal with the data imbalance problem by duplicating the minority training samples and perturbing them by injecting random image rotation, bounding box translation and other data augmentation approaches. Last, the proposed approach is extended to create a two-stage framework for robust facial landmark localisation. The experimental results obtained on AFLW and 300W demonstrate the merits of the Wing loss function, and prove the superiority of the proposed method over the state-of-the-art approaches.
  • Article
    Full-text available
    3D Morphable Face Models (3DMM) have been used in pattern recognition for some time now. They have been applied as a basis for 3D face recognition, as well as in an assistive role for 2D face recognition to perform geometric and photometric normalisation of the input image, or in 2D face recognition system training. The statistical distribution underlying 3DMM is Gaussian. However, the single-Gaussian model seems at odds with reality when we consider different cohorts of data, e.g. Black and Chinese faces. Their means are clearly different. This paper introduces the Gaussian Mixture 3DMM (GM-3DMM) which models the global population as a mixture of Gaussian subpopulations, each with its own mean. The proposed GM-3DMM extends the traditional 3DMM naturally, by adopting a shared covariance structure to mitigate small sample estimation problems associated with data in high dimensional spaces. We construct a GM-3DMM, the training of which involves a multiple cohort dataset, SURREY-JNU, comprising 942 3D face scans of people with mixed backgrounds. Experiments in fitting the GM-3DMM to 2D face images to facilitate their geometric and photometric normalisation for pose and illumination invariant face recognition demonstrate the merits of the proposed mixture of Gaussians 3D face model.
  • ... One limitation of SFS methods lies in its assumed connection between 2D texture clues and 3D shape, which is too weak to discriminate among different individuals. 3DMM [1], [28], [32], [33] establishes statistical parametric models for both texture and shape, and represents a 3D face as a linear combination of basis shapes and textures. To recover the 3D face from a 2D image, 3DMM-based methods estimate the combination coefficients by minimizing the discrepancy between the input 2D face image and the one rendered from the reconstructed 3D face. ...
    Article
    Face alignment and 3D face reconstruction are traditionally accomplished as separated tasks. By exploring the strong correlation between 2D landmarks and 3D shapes, in contrast, we propose a joint face alignment and 3D face reconstruction method to simultaneously solve these two problems for 2D face images of arbitrary poses and expressions. This method, based on a summation model of 3D face shapes and cascaded regression in 2D and 3D face shape spaces, iteratively and alternately applies two cascaded regressors, one for updating 2D landmarks and the other for 3D face shape.The 3D face shape and the landmarks are correlated via a 3D-to-2D mapping matrix, which is updated in each iteration to refine the location and visibility of 2D landmarks. Unlike existing methods, the proposed method can fully automatically generate both pose-and-expression-normalized (PEN) and expressive 3D face shapes and localize both visible and invisible 2D landmarks. Based on the PEN 3D face shapes, we devise a method to enhance face recognition accuracy across poses and expressions. Extensive experiments show that the proposed method can achieve the state-of-the-art accuracy in both face alignment and 3D face reconstruction, and benefit face recognition owing to its reconstructed PEN 3D face shapes.
  • Conference Paper
    In this paper we explore the problem of fitting a 3D morphable model to single face images using only sparse geometric features (edges and landmark points). Previous approaches to this problem are based on nonlinear optimisation of an edge-derived cost that can be viewed as forming soft correspondences between model and image edges. We propose a novel approach, that explicitly computes hard correspondences. The resulting objective function is non-convex but we show that a good initialisation can be obtained efficiently using alternating linear least squares in a manner similar to the iterated closest point algorithm. We present experimental results on both synthetic and real images and show that our approach outperforms methods that use soft correspondence and other recent methods that rely solely on geometric features.
  • Article
    Full-text available
    We propose a fully automatic method for fitting a 3D morphable model to single face images in arbitrary pose and lighting. Our approach relies on geometric features (edges and landmarks) and, inspired by the iterated closest point algorithm, is based on computing hard correspondences between model vertices and edge pixels. We demonstrate that this is superior to previous work that uses soft correspondences to form an edge-derived cost surface that is minimised by nonlinear optimisation.
  • Conference Paper
    We present a novel probabilistic approach for fitting a statistical model to an image. A 3D Morphable Model (3DMM) of faces is interpreted as a generative (Top-Down) Bayesian model. Random Forests are used as noisy detectors (Bottom-Up) for the face and facial landmark positions. The Top-Down and Bottom-Up parts are then combined using a Data-Driven Markov Chain Monte Carlo Method (DDMCMC). As core of the integration, we use the Metropolis-Hastings algorithm which has two main advantages. First, the algorithm can handle unreliable detections and therefore does not need the detectors to take an early and possible wrong hard decision before fitting. Second, it is open for integration of various cues to guide the fitting process. Based on the proposed approach, we implemented a completely automatic, pose and illumination invariant face recognition application. We are able to train and test the building blocks of our application on different databases. The system is evaluated on the Multi-PIE database and reaches state of the art performance.
  • Article
    Identifying subjects with variations caused by poses is one of the most challenging tasks in face recognition, since the difference in appearances caused by poses may be even larger than the difference due to identity. Inspired by the observation that pose variations change non-linearly but smoothly, we propose to learn pose-robust features by modeling the complex non-linear transform from the non-frontal face images to frontal ones through a deep network in a progressive way, termed as stacked progressive auto-encoders (SPAE). Specifically, each shallow progressive auto-encoder of the stacked network is designed to map the face images at large poses to a virtual view at smaller ones, and meanwhile keep those images already at smaller poses unchanged. Then, stacking multiple these shallow auto-encoders can convert non-frontal face images to frontal ones progressively, which means the pose variations are narrowed down to zero step by step. As a result, the outputs of the topmost hidden layers of the stacked network contain very small pose variations, which can be used as the pose-robust features for face recognition. An additional attractiveness of the proposed method is that no pose estimation is needed for the test images. The proposed method is evaluated on two datasets with pose variations, i.e., MultiPIE and FERET datasets, and the experimental results demonstrate the superiority of our method to the existing works, especially to those 2D ones.
  • Article
    Full-text available
    The large eddy simulation (LES) for flow around circular cylinder at Re=4.1×104 is performed in this paper. The mean and root mean square (RMS) of fluctuating pressure on circular cylinder are reasonably presented, with the mean drag coefficient, fluctuating lift coefficient and particularly the vortex-shedding Strouhal number being in good agreement with available reports. It is found that the vortex shedding is not in phases along the span wise direction, while the vortex-shedding energy is concentrated to a limited band of frequency. Statistics show that the fluctuating pressure time series at θ = 90o and θ = 270o on the opposing side of the circular share the same features, and the pressure fluctuating energy is also concentrated on the vortex-shedding frequency. The flow mechanism involved in drag and lift on cylinder is discussed according to the coherence function of the fluctuating pressure between the upper and lower cylinder surface.
  • Conference Paper
    Full-text available
    In this paper, we propose a novel fitting method that uses local image features to fit a 3D Morphable Model to 2D images. To overcome the obstacle of optimising a cost function that contains a non-differentiable feature extraction operator, we use a learning-based cascaded regression method that learns the gradient direction from data. The method allows to simultaneously solve for shape and pose parameters. Our method is thoroughly evaluated on Morphable Model generated data and first results on real data are presented. Compared to traditional fitting methods, which use simple raw features like pixel colour or edge maps, local features have been shown to be much more robust against variations in imaging conditions. Our approach is unique in that we are the first to use local features to fit a Morphable Model. Because of the speed of our method, it is applicable for realtime applications. Our cascaded regression framework is available as an open source library (https://github.com/patrikhuber).