Efﬁcient 3D Morphable Face Model Fitting

Guosheng Hua,1, Fei Yana, Josef Kittlera, William Christmasa,∗, Chi Ho Chana,

Zhenhua Fenga, Patrik Hubera

aCentre for Vision, Speech and Signal Processing, University of Surrey, Guildford, GU2 7XH, UK

Abstract

3D face reconstruction of shape and skin texture from a single 2D image can be per-

formed using a 3D Morphable Model (3DMM) in an analysis-by-synthesis approach.

However, performing this reconstruction (ﬁtting) efﬁciently and accurately in a general

imaging scenario is a challenge. Such a scenario would involve a perspective camera to

describe the geometric projection from 3D to 2D, and the Phong model to characterise

illumination. Under these imaging assumptions the reconstruction problem is nonlin-

ear and, consequently, computationally very demanding. In this work, we present an

efﬁcient stepwise 3DMM-to-2D image-ﬁtting procedure, which sequentially optimises

the pose, shape, light direction, light strength and skin texture parameters in separate

steps. By linearising each step of the ﬁtting process we derive closed-form solutions

for the recovery of the respective parameters, leading to efﬁcient ﬁtting. The proposed

optimisation process involves all the pixels of the input image, rather than randomly

selected subsets, which enhances the accuracy of the ﬁtting. It is referred to as Efﬁcient

Stepwise Optimisation (ESO).

The proposed ﬁtting strategy is evaluated using reconstruction error as a perfor-

mance measure. In addition, we demonstrate its merits in the context of a 3D-assisted

2D face recognition system which detects landmarks automatically and extracts both

holistic and local features using a 3DMM. This contrasts with most other methods

which only report results that use manual face landmarking to initialise the ﬁtting.

Our method is tested on the public CMU-PIE and Multi-PIE face databases, as well

∗Corresponding author: William Christmas

Email address: w.christmas@surrey.ac.uk (William Christmas)

1Current address: AnyVision, Queen’s Road, Belfast, BT3 9DT, UK

Preprint submitted to Elsevier December 13, 2016

as one internal database. The experimental results show that the face reconstruction

using ESO is signiﬁcantly faster, and its accuracy is at least as good as that achieved

by the existing 3DMM ﬁtting algorithms. A face recognition system integrating ESO

to provide a pose and illumination invariant solution compares favourably with other

state-of-the-art methods. In particular, it outperforms deep learning methods when

tested on the Multi-PIE database.

Keywords: face recognition; face reconstruction; 3D Morphable Model

1. Introduction

The intrinsic properties of 3D faces give scope for a representation that is immune

to the kinds of variations in face appearance that are introduced by the imaging pro-

cess such as viewpoint, lighting and occlusion. These invariant facial properties are

potentially useful in a wide variety of applications in computer graphics and vision.

However, recovering the 3D face and scene properties (viewpoint and illumination)

from the appearance conveyed by a single 2D image is very challenging. Speciﬁcally,

as noted in [1], it is impossible to distinguish between texture and illumination effects

unless some assumptions are made to constrain them both. The 3D morphable face

model (3DMM) [2] encapsulates prior knowledge about human faces that can be used

for this purpose, and therefore potentially it is a good tool for 3D face reconstruction.

The 3DMM is a concise statistical model of a 3D face population created from 3D

face data using principal component analysis (PCA). The model separately represents

the face shape and surface texture. PCA removes data correlation and identiﬁes a small

number of latent variables which represent each face instance very efﬁciently. See also

[3, 4] for other developments of generative models applicable to 3D graph structures.

The reconstruction of a 3D face is conducted by a 3DMM ﬁtting process, which

estimates the 3D shape, texture, pose and illumination from a single 2D input image.

Considerable research has been carried out to achieve efﬁcient and accurate ﬁtting. The

methods advocated in the literature can be classiﬁed into two categories:

1. Simultaneous Optimisation (SimOpt): All the parameters (shape, texture, pose

and illumination) are optimised simultaneously [2, 5, 6, 7];

2

2. Sequential Optimisation (SeqOpt): These parameters are optimised sequentially

[8, 9, 10].

The SimOpt algorithms use gradient-based methods which are often slow and can easily

get trapped in local minima. On the other hand, SeqOpt methods can have closed-form

solutions for some or all of the parameters, and accordingly have the potential to be

much more efﬁcient computationally. However, the existing SeqOpt methods [8, 9,

10] make strong assumptions about the imaging camera and consequently they do not

generalise well to faces distorted by perspective effects.

In this work we introduce a novel SeqOpt ﬁtting framework, referred to as Efﬁ-

cient Stepwise Optimisation (ESO), which overcomes these problems and is an order

of magnitude faster than existing methods. This framework groups the parameters to be

optimised into 5 categories: camera model (pose), shape, light direction, light strength

and albedo (skin texture). The ﬁtting is decomposed into two separate processes: geo-

metric ﬁtting and photometric ﬁtting.

Geometric Model Fitting Existing fast pose and shape ﬁtting methods assume an

afﬁne camera model [9, 10] which is adequate provided the object’s depth is small

compared with its distance from the camera. A rule of thumb is that the object should

be at least 10 times further from the camera than its depth. However, this is often not

the case, e.g. when using a laptop camera for video conferencing, and for authentica-

tion, or when a camera is mounted on a vehicle windscreen for driver authentication,

or monitoring the driver for tiredness. In such applications it is essential to relax the

assumption and adopt a more general perspective camera model, which renders the re-

construction problem nonlinear, and consequently computationally expensive. In order

to address this conundrum, we propose a novel approach to the shape ﬁtting problem

by formulating the ﬁtting cost function in 3D, rather than the usual 2D. This formu-

lation admits linearisation of the optimisation task which signiﬁcantly enhances the

computational efﬁciency.

As in [7] the occluding face contour is used to improve the shape ﬁtting accuracy.

In order to mitigate the additional processing costs, we propose to use landmarks on

the occluding face contour (see Section 4.3) instead of face contour edges to reﬁne

3

the camera and shape estimates. To this end, we develop a method that automatically

establishes the correspondence between the occluding contour landmarks of the input

image and vertices of the 3D face model.

Photometric Model Fitting Both Phong [2, 7] and Spherical Harmonics models

[9, 10] have been used in the past to estimate the illumination parameters. However,

in order to model adequately both diffuse light and specularity, the latter method re-

quired many bases (81 in total) of spherical harmonics. Compared with the Spherical

Harmonics approach, the Phong model has a more compact representation (elaborated

further in Section 3), and is therefore used here. We found that it was adequate to

model the illumination as a combination of a single distant point source plus uniform

ambient light, thus keeping the number of coefﬁcients to be found to a minimum.

To accelerate the light model ﬁtting and skin texture parameter estimation, we

present a novel approach to optimise both Phong model parameters and albedo. Specif-

ically, we propose techniques (Section 4) to linearise the Phong model and the subse-

quent albedo estimation. Because the objective functions of these linear methods are

convex, globally optimal solutions are guaranteed.

The measures to accelerate the illumination and texture reconstruction proposed in

the paper speed up the ﬁtting process by a factor of ten or more. We evaluate the ﬁtting

accuracy and show that it is superior to that achieved by the current alternatives. The

impressive performance is the consequence of the ESO ﬁtting process involving all the

model vertices simultaneously, rather than just a randomly sampled subset.

We also evaluate the ESO ﬁtting algorithm as part of a fully automatic pose- and

illumination-invariant face recognition system. Its performance is at least compara-

ble to the best performing state of the competitors, including solutions based on deep

learning methods [11, 12] when evaluated on the Multi-PIE dataset.

The paper is organised as follows. In the next section we present a brief sum-

mary of the related work. The ﬁtting problem is formulated in Section 3 to establish

a methodological baseline. Our fast ﬁtting algorithm ESO is developed in Section 4.

The proposed algorithm is evaluated in Section 5 in terms of its reconstruction perfor-

mance, as well as when embedded in a face recognition system. Section 6 draws the

4

input

image

input

image

3DMM

3DMM

fitting

fitting

shape parameters

shape parameters

texture parameters

texture parameters

camera parameters

camera parameters

lighting parameters

lighting parameters

Applications:

face synthesis

face recognition

...

Applications:

face synthesis

face recognition

...

Figure 1: 3D morphable model ﬁtting pipeline including the inputs and outputs of a ﬁtting, and the applica-

tions of the ﬁtting outputs.

paper to a conclusion.

2. Related Work on 3D Morphable Model Fitting

The 3DMM, ﬁrst proposed by Blanz and Vetter [2], has successfully been applied

to computer vision and graphics. A 3DMM consists of separate face shape and texture

models learned from a set of 3D exemplar faces. These faces are represented as a

graph, in which the node attributes are the 3D position and RGB colour at that node,

and the edges indicate geometric connectivity. Related work (e.g. [13]) considers more

complex image data. By virtue of a ﬁtting process, a 3DMM can recover the face (shape

and texture) and scene properties (illumination and camera model) from a single 2D

image in a process schematically summarised in Fig. 1. The recovered parameters can

be used for different applications, such as realistic face synthesis and face recognition.

However, it is well known that achieving accurate ﬁtting is particularly difﬁcult for

two reasons. Firstly, when recovering the 3D shape from a single 2D image, the 3D

shape is generally projected to 2D in order to compare it with the 2D image features.

As a result, the depth information of the 3D shape is lost. Secondly, separating the

contributions of albedo and illumination is an ill-posed problem [14, 15]. Motivated

by the above challenges, considerable research [2, 6, 7, 8, 9, 10] has been carried out

to improve the ﬁtting performance in terms of efﬁciency and accuracy. As mentioned

5

in Section 1, these methods can be classiﬁed into two groups: SimOpt and SeqOpt.

In the SimOpt category, the ﬁtting algorithm in [2, 5] minimises the sum of squared

differences over all colour channels and all pixels between the input and reconstructed

images. A Stochastic Newton Optimisation (SNO) technique is used to optimise a non-

convex cost function. Performance of this technique is poor in terms of both efﬁciency

and accuracy because it is an iterative gradient-based optimiser which may end up in a

local minimum.

The efﬁciency of optimisation is the driver behind the work of [6] where an Inverse

Compositional Image Alignment algorithm [6] is introduced for ﬁtting. The ﬁtting is

conducted by modifying the cost function so that its Jacobian matrix can be regarded

as constant. In this way, the Jacobian matrix is precomputed, which greatly reduces the

computational costs. However, this method cannot model illumination effects.

The Multi-Feature Fitting (MFF) strategy [7] is known to achieve the best ﬁtting

performance of the SimOpt methods. It makes use of many complementary features

from an input image, such as edges and specularity highlights, to constrain the ﬁt-

ting process. The advantages of using these features are demonstrated in [7]. Further

improvements to the MFF framework have been achieved by enhancing the ﬁtting ro-

bustness to varying image resolution with a resolution-aware 3DMM [16], and by de-

ploying a facial symmetry prior in [15] to ameliorate the quality of illumination ﬁtting.

However, all the MFF-based ﬁtting methods are rather slow.

In the SeqOpt category, the ‘linear shape and texture ﬁtting algorithm’ (LiST) [8]

was proposed for improving ﬁtting efﬁciency. The idea is to update the shape and tex-

ture parameters by solving linear systems. However, the illumination and camera pa-

rameters are optimised by the gradient-based Levenberg-Marquardt method, exhibiting

many local minima. The experiments reported in [8] show that the ﬁtting is of similar

accuracy to the SNO algorithm, but much faster, in spite of the shape being recovered

using a relatively slow optical ﬂow algorithm. The drawback of this approach is the

prerequisite that the light direction is known before ﬁtting, which is not realistic for

automatic analysis.

Another SeqOpt method [9] decomposes the ﬁtting process into geometric and pho-

tometric parts. The camera model is optimised by the Levenberg-Marquardt method,

6

and shape parameters are estimated by a closed-form solution. In contrast to the pre-

vious work, this method recovers 3D shape using only facial feature landmarks, and

models illumination using spherical harmonics. Illumination and albedo are deter-

mined using least squares optimisation. The work in [17] improved the ﬁtting perfor-

mance of [9] by segmenting the 3D face model into different subregions. In addition,

a Markov Random Field is used in [17] to model the spatial coherence of the face tex-

ture. However, the illumination models of [9, 17] cannot deal with specular reﬂectance

because only 9 low-frequency spherical harmonics bases are used. In addition, [9, 17]

use an afﬁne camera model, which cannot model perspective effects.

In common with [9], two more recent SeqOpt methods [10, 18] also sequentially ﬁt

geometric and photometric models using least squares. Both methods use only facial

landmarks to estimate pose and facial shape via an afﬁne camera. They also share the

use of spherical harmonics models to estimate illumination. The authors in [18] use 9

spherical harmonics bases, which cannot model specularity. The method in [10] can

model specularity by projecting the RGB values of the model and input images to a

specularity-free space for diffuse light and texture estimation. The specularity is then

estimated in the original RGB colour space. In common with [9], both methods [10,

18] use an afﬁne camera, which cannot model perspective effects. In addition, the

colour of lighting in [10] is assumed to be known, which limits the applicability of the

method.

Some works only focus on shape ﬁtting [19, 20, 21]. In [20], around 100 facial

landmarks are used to recover the facial shape employing the Levenberg-Marquardt

algorithm as the optimiser. In contrast to [20], [19] uses local image features rather

than facial landmarks as these features are more robust.

3. 3D Morphable face model and face image rendering

A 3D face model is a representation of the surface of a class of objects — the objects

in our case being faces. Each face consists of a set of vertices whose positions in 3D

space collectively express the face shape. The vertices also each have an RGB pixel

value, that collectively express the face skin texture (albedo). The model describes both

7

the shape of a face and its appearance, determined by the surface texture. It is deﬁned

by a mesh of vertices V={vi|i= 1, ...., n}, sampling the face surface at a predeﬁned

set of facial points of semantic identity (eye corners, nose tip, etc). The ith vertex vi

of a face is located at wi= (xi, yi, zi)T, and has the RGB colour values (ri, gi, bi).

Hence a 3D face is represented in terms of shape and texture as a pair of vectors:

s= (x1, y1, z1, ......, xn, yn, zn)T,t= (r1, g1, b1, ......, rn, gn, bn)T(1)

Even for twins, faces are unique. Each individual will have a particular face shape

and skin characteristics. The variability of face shape and skin texture in a population of

individuals is captured by a statistical 3D face model, deﬁned by a probability distribu-

tion in the sand tspace. Since many vertex shape and texture measurements are highly

correlated, a population of 3D faces inevitably lies in a subspace of the sand tspace,

typically determined by the Principal Component Analysis (PCA) or other sparse rep-

resentation methods. Focusing on the former, let S∈R3n×rsand T∈R3n×rtdenote

the PCA bases of the rsshape and rttexture variations respectively. A face instance

(s,t)can concisely be expressed as

s=s0+Sα,t=t0+Tβ(2)

where s0and t0are the mean face shape and texture respectively. The parameters α

and βare assumed to have normal distributions:

p(α)∼ N (0,σs)(3)

p(β)∼ N (0,σt)(4)

where σsand σtare the vectors of variances of the latent model shape and texture

parameters.

As the bases and the mean vectors are ﬁxed for a particular population, all statisti-

cal information is conveyed by the parameter vectors αand β, the dimensionality of

which is considerably lower than that of the original face space. Each pair of model pa-

rameter vectors αand βdeﬁnes an instance of a 3D face. This provides a very concise

representation of the face, which is convenient from the point of view of face synthe-

sis. By changing the shape and texture parameters we can generate different faces. A

8

transition from one pair of parameter vectors to another pair will morph one face to

another in a smooth manner. This morphing capability of the statistical 3D face model

has given it its name as the 3D Morphable Face Model (3DMM).

A 3DMM can be used for many purposes in face analysis. For instance, the model

can be ﬁtted to an input 2D face image, and the estimated shape and texture parameters

of the reconstructed 3D face used for face recognition in the face model parameter

space. Alternatively, given the pose of an input 2D face image, we can ﬁt the 3DMM

to a gallery face image and use the ﬁtted 3D face to synthesise a new pose of the

subject. For instance this could be a pose identical to the given pose in order to perform

matching. Another possibility is to ﬁt the 3DMM to an input 2D image of arbitrary

pose, and then frontalise the query image with the help of the estimated 3D face shape.

The operative phrase in all these use cases is 3DMM ﬁtting. It is the crucial prerequisite

enabling all these applications.

The underlying principle of ﬁtting a 3DMM to an input 2D face image is to identify

the shape and texture parameters of the face model that would enable the synthesis of a

2D model image deemed indistinguishable from the input query image. However, the

rendering process is quite complex. It involves not only the selection of model shape

and texture parameters to produce a 3D face instance, but also its transformation to a

new pose, and the subsequent projection of the 3D face to 2D under a particular scene

illumination. The assessment of similarity of the synthesised image to the input image

also traditionally involves sampling the input image at 2D points corresponding to the

projection of the 3D mesh vertices onto the 2D input image.

Let us now describe the rendering process, the underlying physics of which is cap-

tured in Fig. 2 in more detail. We shall render a 2D view of a face instance showing

a particular pose by rotating and translating the camera with respect to the face model

coordinate system by a (3 ×3) rotation matrix Rand a 3D translation vector τrespec-

tively. In the camera coordinate system the transformed shape s0can be expressed in

matrix form as

s0=Us +ˇ

τ(5)

where Uis the block diagonal matrix with ncopies of the rotation matrix Ron its

9

diagonal, and ˇ

τis a vector composed of ncopies of the displacement τ.

i

v

ni

i

r

wi

~

w

i

),o yx

(

Camera coordinates

o

Image plane

τ

R

dHead mesh

Light source

Figure 2: Physics of rendering: At image pixel position e

wi, the RGB value output by the camera (small green

blob) measures the reﬂection, on the face surface point wi, of the light source illuminating the face surface

in direction d. The surface normal at wiis ni. In the head (purple blob) coordinate system, the camera is

located at position τand the viewing direction of the vertex wiis vi. The specular light is reﬂected from

wicentred on direction ri, where riis such that the surface normal nibisects riand the direction dof the

incident light.

The pixel values at locations corresponding to swill depend on the albedo tand the

scene illumination. Different illumination models can be adopted for lighting the face

(e.g. [22]), but we will adopt the Phong model which can represent complex reﬂectance

phenomena, including specular reﬂectance, using a small number of parameters. The

appearance of the generated face at each point, represented by a 3n-dimensional vector

aM, is the product of the interplay of the face surface normal, skin albedo tand the in-

cident light, assumed to be the sum of contributions from ambient, diffuse and specular

lights:

aM=ˇ

la∗t

|{z}

ambient

+ (ˇ

ld∗t)∗(N3d)

| {z }

diffuse

+ˇ

ld∗e

|{z}

specular

(6)

where the ambient light ˇ

lais a 3n-dimensional vector, composed of ncopies of the

10

ambient light intensity la= (lr

a, lg

a, lb

a)T:

ˇ

la= (lr

a, lg

a, lb

a, ........lr

a, lg

a, lb

a)T∈R3n(7)

Similarly, ˇ

ldis a 3n-dimensional vector, composed of ncopies of the directed light

strength ld= (lr

d, lg

d, lb

d)T:

ˇ

ld= (lr

d, lg

d, lb

d, ....., lr

d, lg

d, lb

d)T∈R3n(8)

The symbol ∗denotes an element-wise multiplication operation. The matrix N3is a

stack of 3 copies of the matrices N:

N3=NT,NT,NTT(9)

where N∈Rn×3is a stack of the surface normals ni∈R3at vertices i= 1, ..., n

(see Fig. 2). Unit vector d∈R3is the light direction. Vector e∈R3nis a stack of the

specular reﬂectance eiof each vertex i= 1, ....., n (the components of which could be

different for the three channels), i.e.,

ei=kshvi,riiγ(10)

where viis the viewing direction of the ith vertex. Since in the face model coordinate

system the camera is at position τ, the viewing direction can be expressed as vi=

τ−wi

|τ−wi|where wi= (xi, yi, zi)Tis the vector of the 3D coordinates of that vertex.

Unit vector ridenotes the reﬂection direction of the light source at the ith vertex:

ri= 2hni,dini−d. The two constants ksand γdenote the specular reﬂectance

and shininess respectively [23]. Note that ksand γare determined by the facial skin

reﬂectance property, which is similar for different people. They are assumed constant

over the whole facial region. For the sake of simplicity, in our work, we also assume

that ksand γare the same for the three colour channels. Thus each entry in eiis

repeated three times. In this work, the components of ksare each set to 0.175, and γis

set to 30, following [23].

In all cases it is important to check all vertices for visibility so that parts of the face

turned away from camera do not contribute to the rendered pixel values.

11

3.1. Fitting 3DMM to a 2D image

Let us consider an input face image Iacquired by a camera with focal length f

and the coordinates of its optical axis in the image plane o= (ox, oy)T. It is assumed

that the image is landmarked. Let ρdenote the set of extrinsic and intrinsic camera

parameters

ρ={R,τ, f, o}(11)

and let us lump together all illumination parameters as

µ={la,ld,d,ks, γ}(12)

Fitting the 3D face involves ﬁnding pose, shape, texture and illumination parameters

ρ, α,β, µ so that the image reconstructed from the model:

aM= (rM

1, gM

1, bM

1, ......., rM

n, gM

n, bM

n)T(13)

is as close as possible to the input image.

Typically, the quality of the reconstruction is measured in 2D. This involves pro-

jecting the mesh of 3D vertices into 2D. For each vertex the camera projects the triplet

of its 3D coordinates into 2Dpixel location in the camera image plane as

˜

s=Ps0+ˇ

o(14)

where P∈R2n×3nis a block diagonal matrix constructed from the projection matrices

Pi, i = 1, ...n:

Pi=

f

z0

i0 0

0−f

z0

i0

(15)

Note that each Piis a function of the corresponding depth coordinate z0

i, as well as

the camera focal length f. The negative term in Piresults from an assumption of

a clockwise image coordinate system. The 2n-dimensional vector ˇ

ois a stack of n

copies of the 2Dposition oof the optical axis in the image plane. For faces at a

distance exceeding 10×the radius of a subject’s head we can use an afﬁne projection

with Pi=Pj,∀i, ∀jinstead, without incurring any signiﬁcant approximation errors.

12

The 2D mesh of projected vertices, ˜

s, samples the input 2D image. Stacking the

RGB values of the corresponding samples into a vector aI

aI= (rI

1, gI

1, bI

1, ......., rI

n, gI

n, bI

n)T(16)

we can then compare the synthesised and input images by measuring the error ||aI−

aM||. Noting that the samples picked from the input image by the mesh are a function

of ρand α, the objective of the ﬁtting process is to solve the optimisation problem

min

α,β,ρ,µ kaI(ρ, α)−aM(ρ, α,β, µ)k2+λ1kα÷σsk2+λ2kβ÷σtk2(17)

where the last two terms induce regularisation of the estimated parameters. The symbol

÷denotes element-wise division.

The problem formulated in (17) is very challenging because of its nonlinearity and

its ill-posed nature [14, 15]. The conventional approach to optimisation is to apply

the Newton Optimisation algorithm involving the sampling of random subsets of mesh

vertices to achieve computational feasibility [2, 5]. These challenges motivated the

developments reviewed in Section 2 but any speed-up therein is only achieved at the

expense of restricted applicability.

In the following section we propose a novel method of ﬁtting 3DMM that is more

than an order of magnitude faster than the existing algorithms, without imposing any

restrictions on the camera model and lighting. The computational efﬁciency is achieved

by breaking the ﬁtting problem up to create a sequence of optimisation tasks, most

of which are linearised to render closed form solutions. The proposed strategy has

the additional major beneﬁt for illumination and albedo estimation of simultaneously

involving all the model vertices in optimisation. This avoids local optima and leads to

more accurate ﬁtting.

4. Efﬁcient Stepwise Optimisation (ESO)

This section describes our ESO framework. ESO is a SeqOpt method which groups

all the parameters into 5 categories: pose with camera parameters, shape, light direc-

tion, light strength and albedo. The parameters in each group are optimised under the

13

ESO

albedo

light

strength

light

direction

contour

landmarks

shapecamera

geometric refinement photometric refinement

Figure 3: The ESO ﬁtting process topology. Each of the two main phases of the ﬁtting process - geometric

and photometric - are iterated until convergence is achieved.

assumption that those in all the other groups are known, or have no impact on the opti-

misation process. The parameter grouping strategy aids the linearisation of the 3D face

model ﬁtting process, but further group-speciﬁc linearisation measures are adopted, as

required. These are detailed in the respective sections.

The proposed method divides the ﬁtting process into two phases, namely geometric

and photometric optimisation as shown in Fig. 3. The geometric phase aligns an input

image to a 3DMM, and the photometric phase recovers its reﬂectance. Each phase

consists of three stages that are iterated in turn a few times to reﬁne the solution. A key

contribution of our approach is the proposed linearisation of all but one stage of the op-

timisation process that leads to closed-form solutions, and consequently computational

efﬁciency. In Sections 4.1 to 4.6, each step of ESO is explained in more detail.

4.1. Camera Parameter Estimation

The ﬁrst step uses the input image facial landmarks to estimate the subject’s pose

and the camera parameters that roughly align the input image to the model. Let us

consider an identiﬁable point e

wI

i= (˜xI

i,˜yI

i)Ton the face of the input image, which

semantically corresponds to the ith vertex of the 3D face model with coordinates wi=

(xi, yi, zi)T. Image landmarks typically include the locations of the eye and mouth

14

Figure 4: Visualisation of the facial landmarks used throughout this paper

corners, tip of the nose etc.2In this work, a maximum of 28 landmarks are used as

shown in Fig. 4. However, some of these landmarks are not visible for non-frontal

poses due to self-occlusion. In those cases, only the visible landmarks are used. Also,

in the ﬁrst iteration, the contour landmarks (7 shown in Fig 4) are not available.

For the alignment, we need to ﬁnd the rigid transformation R,τthat moves the

coordinates of the point wi= (xi, yi, zi)Tto its new position w0

i= (x0

i, y0

i, z0

i)Tso that

after wiis projected to 2D via the mapping W(ρ), its 2D coordinates e

wi= (˜xi,˜yi)T

are as close as possible to the image point e

wI

i= (˜xI

i,˜yI

i)T.

Let Ldenote the subset of vertices corresponding to the facial landmark points in

the input image. Then the pose and camera parameters ρ={R,τ, f, o}can be esti-

mated by minimising the distance between the input landmarks and those reconstructed

from the model:

min

ρX

∀i∈L

ke

wI

i−e

wik2(18)

This is the only cost function which is not linearised. It is minimised by the Levenberg-

Marquardt algorithm [7], but because of the small number of parameters involved, the

convergence is fast. Note that e

widepends on both the pose and camera parameters, as

well as the shape model s. The latter is kept constant in this step, and in the ﬁrst itera-

tion, sis set to s0. In subsequent iterations, sis replaced by the shape update obtained

2Landmark detection itself is outside the scope of this paper.

15

in the previous iteration by the second stage of the ﬁtting process described in the next

subsection. The estimated pose and camera parameters feed into the shape estima-

tion stage described in Section 4.2. The contour landmarks described in Section 4.3

constrain the pose and camera parameters, and shape estimation.

4.2. Shape Parameters Estimation

Once the pose and camera parameters are recovered, the shape parameters αcan

be estimated. We linearise this problem by making use of the current estimate of the

model vertex coordinates zi,∀ito deﬁne the projection matrices Pi. In addition, in

contrast to prior art, we deﬁne the cost function in 3D space, as:

min

αX

∀i∈L

kwI

i−wik2(19)

where the image landmarks e

wI

i,∀i∈ L, are back-projected to wI

i= (xI

i, yI

i, zI

i)Tvia

wI

i=W−1(e

wI

i, ρ). The main motivation for working in 3D is to reduce computational

complexity further.

Since wiis a vertex of the shape model s, it is a function of α. The cost function is

deﬁned in 3D as:

min

αkˆ

sI−ˆ

s(α)k2+λ1αTσ−1

sα(20)

where: ˆ

sIand ˆ

sare the stacked vertex positions wI

iand wi,∀i∈ L, respectively;

ˆ

s=ˆ

s0+ˆ

Sα;ˆ

s0and ˆ

Sare constructed by choosing those elements from s0and S

(deﬁned in Eq. (2)) corresponding to the landmark indices L;λ1is a free weighting

parameter; αTσ−1

sαis a regularisation term based on Eq. (3).

The closed-form solution for αin is:

α=ˆ

STˆ

S+λ1σ−1

s−1ˆ

STˆ

sI−ˆ

s0(21)

where σsis deﬁned in Section 3.

Finally, we explain how to implement the inverse projection W−1. Note that e

wI

i

cannot be back-projected to wI

iin the face model coordinate system unless zI

i, the

depth along the zaxis of wI

i, is known. Here, in the ﬁrst iteration, zI

iis approximated

16

(a) model image (b) contour edge (c) image landmarks (d) correspondence

Figure 5: Contour landmarks detection. Yellow and red dots represent the contour landmarks of the input

and model reconstructed images, respectively. Algorithm 1 bridges (c) and (d).

by the model vertex zi, which is constructed from the mean shape s0. As the face shape

is updated in subsequent iterations, the latest estimate sis used in place of s0.

4.3. Contour Landmark Constraints

One impediment to accurate 3D shape reconstruction from a non-frontal 2D face

image stems from the lack of constraints on the projection of the occluding face con-

tour. In [7], the authors deﬁne the contour edges as the occluding boundary between

the face and non-face area, and use them to constrain the ﬁtting. The contour edges

of the 2D face image synthesised from a 3D model-based reconstruction of a 2D input

image are shown in Fig. 5b. They are formed by linking the 3D model vertices lying

on the occluding boundary of the projected 3D face mesh, determined by the vertex

visibility check. A recent review of techniques developed to ﬁt 3DMM to edges can

be found in [24]. To reduce the computational cost of working with contour edges, we

only use contour landmarks lying on the contour boundary. Here a contour landmark is

deﬁned as the point of intersection of the occluding boundary of a face and a horizontal

line in the face coordinate system passing the corresponding symmetric landmark, as

shown in Fig. 6. Such contour landmarks are labelled in the input image automatically

by a cascaded-regression-based algorithm for automatic facial landmark detection [25],

which has been trained to detect contour landmarks deﬁned in this way.

The vertices Lcthat form the contour landmarks along the occluding boundary

17

of the ﬁtted 3DMM are found using Algorithm 1. They are the vertices (red dots in

Fig. 5d) closest to the contour landmarks of the input image (yellow dots of Fig. 5c).

Once this correspondence is established, these contour landmark pairs are added to the

available landmark set Lin Eq. (18) and (20) to improve the estimation of camera

parameters and shape.

Figure 6: Deﬁnition of the contour landmarks. The axis of face symmetry is deﬁned by the chin and the

centre of the nose bridge. The face contour landmarks are the points of intersection of (i) the input image

face occluding contour and (ii) the horizontal lines in the thus-deﬁned face coordinate system, passing the

visible facial contour landmarks.

4.4. Light Direction Estimation

After geometric ﬁtting (Section 4.1-4.3), the 3DMM model is aligned to the input

image, and the reﬂectance parameters can be estimated. In this step, we focus on the

light direction d, and regard all other variables as constant. Recalling Eq. (6), the cost

function can be formulated as:

min

dkaI−ˇ

la∗t−(ˇ

ld∗t)∗(N3d)−ˇ

ld∗ek2(22)

The minimisation of Eq. (22) is a non-linear problem because of the exponential

form of ein Eq. (10). To eliminate this nonlinear dependence we precompute the value

of ebased on the assumptions that: i)ksand γare constant; ii)the values of vand

rare set to those of the previous iteration. In order to make the linearity of the light

direction ﬁtting problem more transparent, we avoid the element-wise multiplication

18

Input:

2D contour landmarks coordinates η={η1...ηk1}output by [25]

3DMM rendered contour edge coordinates ζ={ζ1...ζk2}(k2k1) via W

3D vertex indices φ={φ1...φk2}corresponding to ζ

Output: 3D vertex indices Lccorresponding to η

1for i= 1; i≤k1;i+ + do

2for j= 1; j≤k2;j+ + do

3distj=||ηi−ζj||2

4end

5Lc

i=φarg minj{distj}

6end

7return Lc

Algorithm 1: Establishing the contour landmark correspondence

in ( 22) by reformulating the cost function as:

min

dkaI−ˇ

la∗t−ˇ

ld∗e−(A∗N3)dk2(23)

where A= [ˇ

ld∗t,ˇ

ld∗t,ˇ

ld∗t]∈R3n×3. By this reformulation, a closed-form solution

can be found as: d= ((A∗N3)T(A∗N3))−1(A∗N3)T(aI−ˇ

la∗t−ˇ

ld∗e). Then d

is normalised to a unit vector.

For the ﬁrst iteration, we initialise the values of t,ˇ

laand ˇ

ldas follows. 1) In

common with [26, 27], we assume that the face is a Lambertian surface. Consequently,

only the diffuse light in Eq. (6) is modelled. 2) The strengths of diffuse light ˇ

ldand

albedo tare set to vectors whose entries are all 1 and t0respectively. With these

assumptions, the cost function in the ﬁrst iteration becomes:

min

dkaI−(B∗N3)dk2(24)

where B= [t0,t0,t0]∈R3n×3. The closed-form solution is: d= ((B∗N3)T(B∗

N3))−1(B∗N3)TaI.

The estimated light direction is fed into the light strength and albedo estimations

detailed in Section 4.5 and Section 4.6.

19

4.5. Light Strength Estimation

Having obtained an estimate of d, the ambient and directed light strengths can

be recovered. Because the three colour channels can be processed independently, for

simplicity only the red channel is described. The cost function for the red channel is:

min

lr

ad

kaI,r −Cˇ

lr

adk2(25)

where aI,r is the red channel of aI;C= [tr,tr∗(Nd) + er]∈Rn×2,trand erare the

red channels of tand e;lr

ad = (lr

a, lr

d)T, where lr

aand lr

dare the strengths of ambient

and directed lights of the red channel respectively. The closed-form solution for lr

ad is:

lr

ad = (CTC)−1CTaI,r (26)

Note that tis set as stated earlier in Section 4.4. The green and blue channels are solved

in the same way.

4.6. Albedo Estimation

Once the light direction and strengths are recovered, the albedo can be estimated.

To avoid over-ﬁtting, we regularise the albedo estimation and generate the cost func-

tion:

min

βkaI−(t0+Tβ)∗ˇ

la−(t0+Tβ)∗ˇ

ld∗(N3d)−ˇ

ld∗ek2+λ2βTσ−1

tβ(27)

where λ2is a free weighting parameter. The closed-form solution is

β=TTT+λ2σ−1

t−1TT(aI

in −t0)(28)

where σtis as deﬁned in Section 3 and aI

in, the illumination-normalised image, is given

by:

aI

in = (aI−ˇ

ld∗e)÷(ˇ

la+ˇ

ld∗(N3d)) (29)

where the symbol ÷denotes element-wise division as before.

20

4.7. Computational complexity

The computational complexity of our method is dominated by the albedo estima-

tion described above. From Eq. (28) we can see that, since nrt, the dominating

computations are the two matrix multiplications, both of which have a complexity of

O(n rt

2).

5. Experiments

In this section, a comprehensive evaluation of our methodology is described. First,

face reconstruction performance is evaluated. Then, in face recognition experiments,

we compare our ESO with the existing 3DMM methods and other state-of-the-art meth-

ods. We implemented two effective 3DMM ﬁtting methods [7] and [10], and the free

parameter settings of [7, 10] follow the original papers. The results of all the other

methods are cited from their papers based on the same experimental settings.

5.1. Face Reconstruction

First, we present some qualitative ﬁtting results in Fig. 7. These images are from

the Multi-PIE database. The people in these images have different gender, ethnicity and

facial features such as a beard and/or glasses. All these factors can cause difﬁculties

for ﬁtting. As can be seen in Fig. 7, the input images are well ﬁtted. Note that our

3DMM does not model glasses. Therefore, the glasses of an input image, such as the

3rd person in Fig. 7, can confuse the ﬁtting process. Despite it, our ESO reconstructs

this face well, showing its robustness.

In order to quantitatively measure every component of ESO, the 2D input images

and their corresponding ground truths of camera parameters, 3D shape, light direction

and strength, and texture need to be known. To meet all these requirements, we gen-

erated a local database of rendered 2D images with all the 3D ground truth as follows:

(1) We collected and registered 20 3D face scans. The ﬁrst 10 scans are used for model

selection, and the remaining scans are used for performance evaluation. (2) The regis-

tered 3D scans are projected to PCA space, parameterising the ground truth in terms of

coefﬁcients αand β. (3) Using the registered 3D scans, we rendered 2D images under

21

Figure 7: Row 1: input images with different pose and illumination variations. Row 2: ESO-

ﬁtted/reconstructed images.

different poses and illuminations. (4) The 3DMM is ﬁtted to obtain estimates of all

these parameters. (5) Reconstruction performance is measured using cosine similarity

between the estimated and ground-truth αor β.

5.1.1. Effects of Hyperparameters

Before we evaluate the face reconstruction performance, the sensitivity of the hy-

perparameters of ESO on the ﬁtting process is investigated. The relevant hyperparam-

eters are the regularisation weights λ1in Eq. (20) and λ2in Eq. (27) and the number of

iterations (l1and l2) for geometric and photometric reﬁnements (Fig. 3), respectively.

All the renderings in Section 5.1.1 are generated by setting both the focal length and

the distance between the object and camera to 1800 pixels as suggested in [28].

Impact of the weight λ1on shape reconstruction The weight λ1should be selected

carefully because improper λ1will cause under- or over-ﬁtting during shape recon-

struction. As shown in Fig. 8, the reconstruction using a large λ1(= 1000) looks very

smooth and the shape details are lost, exhibiting typical characteristics of under-ﬁtting.

On the other hand, a small λ1(= 0) causes over-ﬁtting, and the reconstruction in Fig. 8

is excessively stretched. In comparison, the reconstruction with λ1= 0.5recovers the

shape well.

To quantitatively evaluate the impact of λ1, 2D renderings under 3 poses (frontal,

side and proﬁle), without directed light, are generated. To decouple the impact of λ1

22

Figure 8: Impact of λ1and λ2on shape and albedo reconstruction. Column 1: input image, Column 2:

ground truth of shape and albedo, Column 3-5: reconstructions with different λ1and λ2.

and l1on shape reﬁnement, l1is set to 1. As shown in Fig. 9a, neither small ( <0.4)

nor large (>1) λ1lead to good reconstruction which is consistent with Fig. 8. On the

other hand, the reconstructions of all 3 poses do not change much with λ1in the region

between 0.4 and 0.7. Hence, λ1is set to 0.5, which is the average value of the best λ1

over all the test cases, to simplify parameter tuning.

Impact of the number of iterations l1on shape reﬁnement The same renderings

are also used to evaluate the sensitivity to l1. From Fig. 9b, we can see that more

than 3 iterations do not greatly improve the reconstruction performance for any pose.

Therefore, l1is ﬁxed at 3 in the remaining experiments.

Impact of the weight λ2on albedo reconstruction We also examine the impact of

λ2on albedo reconstruction. Fig. 8 shows some qualitative results. Clearly, the re-

construction with λ2= 1000 loses the facial details because of being under-ﬁtted. On

the other hand, the one with λ2= 0 does not separate the illumination and albedo

properly, causing over-ﬁtting. In comparison, the one with λ2= 0.7reconstructs the

albedo well.

To quantitatively investigate the impact of λ2on the estimated light direction and

23

strength, the renderings from different light direction dand strength ld3are used as

shown in Fig. 9c. All these renderings are under frontal pose and l2=1. It is clear that

the reconstructed albedo does not change greatly with λ2in the region between 0.2 and

1. To simplify parameter tuning, λ2is ﬁxed to 0.7 which is the average value of the

best λ2over all the test cases.

Impact of the number of iterations l2on albedo reﬁnement To investigate the impact

of l2, the same 2D renderings for the λ2evaluation are used. As shown in Fig. 9d, all

the images converge by the 4th iteration. Hence, for simplicity, l2is ﬁxed to 4 in ESO.

5.1.2. Reconstruction Results

We evaluate shape and albedo reconstructions separately. ESO is compared with

two methods: MFF [7] and [10], which are the best SimOpt and SeqOpt methods,

respectively. We implemented the whole framework of MFF. Regarding [10], we

only implemented the geometric (camera model and shape) part, because insufﬁcient

implementation details of the photometric part were released.

Shape Reconstruction As mentioned in Section 1 and 2, the afﬁne camera used by

[10] cannot model perspective effects, while the perspective camera used by ESO and

MFF can. Different camera models lead to different shape reconstruction strategies. In

order to ﬁnd out how signiﬁcant this difference is, we change the distance between the

object and camera to generate perspective effects, at the same time keeping the facial

image size constant by adjusting the focal length to match [28]. Note that the shorter

this distance, the larger the perspective distortion. To compare shape reconstruction

performance, 2D renderings under frontal pose obtained for 6 different distances are

generated. We can see from Fig. 10 that the performance of ESO and MFF remains

constant under different perspective distortions. However, the performance of [10]

reduces greatly as the distance between the object and camera decreases. Also, ESO

consistently works better than MFF under all perspective distortions.

Albedo Reconstruction We compare ESO with MFF [7] in Table 1 using images ren-

dered under different light direction and strength. We see that the albedo reconstruction

3The illumination is set to be white here, i.e. ld= (ld, ld, ld)T

24

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 100 1000

0

0.1

0.2

0.3

0.4

0.5

0.6

λ1

cosine similarity of shape

frontal

side

profile

\ \ \ \

(a) Impact of regularisation weight λ1on shape

reconstruction over poses

123456

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

l1

cosine similarity of shape

frontal

side

profile

(b) Impact of the number of iterations l1on

shape reconstruction

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 100 1000

0.3

0.4

0.5

0.6

0.7

0.8

0.9

λ2

cosine similarity of albedo

left−light, ld = 0.5

right−light, ld = 0.5

frontal−light, ld = 0.5

frontal−light, ld = 0.1

frontal−light, ld = 1

\ \ \ \

(c) Impact of regularisation weight λ2on

albedo reconstruction over lightings

123456

0.2

0.3

0.4

0.5

0.6

0.7

0.8

l2

cosine similarity of albedo

left−light, ld = 0.5

right−light, ld = 0.5

frontal−light, ld = 0.5

frontal−light, ld = 0.1

frontal−light, ld = 1

(d) Impact of the number of iterations l2on

albedo reconstruction

Figure 9: Effects of hyperparameters on facial shape and albedo reconstruction

performance for different light direction is very similar, but it varies greatly for differ-

ent directed light strength. This demonstrates that the albedo reconstruction is more

sensitive to light strength than direction. Also, ESO consistently works better than

MFF. The reasons are two fold: 1) MFF uses a gradient-based method that suffers

from the non-convexity of the cost function. 2) For computational efﬁciency, MFF ran-

domly samples only a small number (1000) of polygons to establish the cost function.

This is insufﬁcient to capture the information of the whole face, causing under-ﬁtting.

Our method being much faster makes use of all the polygons. Further computational

25

0 500 1000 1500 2000

0.2

0.3

0.4

0.5

0.6

0.7

0.8

distance between the camera and object (unit: pixel)

cosine similarity of shape

[8] with affine camera

ESO with perspective camera

MFF with perspective camera

Figure 10: Shape reconstruction results measured by cosine similarity

Table 1: Albedo reconstruction results measured by cosine similarity.

light direction light strength ldMFF [7] ESO

left 0.5 0.57 ±0.15 0.61 ±0.08

right 0.5 0.57 ±0.13 0.60 ±0.08

frontal 0.5 0.58 ±0.14 0.62 ±0.08

frontal 0.1 0.60 ±0.13 0.67 ±0.07

frontal 1.0 0.49 ±0.16 0.54 ±0.08

efﬁciency discussions can be found in Section 5.2.1.

5.2. Pose and Illumination Invariant Face Recognition

Pose- and illumination-invariant face recognition is a challenging problem addressed

by a variety of approaches [29]. 2D methods address the pose and illumination prob-

lem at either pixel-level or image feature-level. The former aim to create pixel-level

correspondence across different poses [30, 31, 32, 33]. For example, regression-based

methods [31, 32] learn mapping matrices which project images of one particular pose

to another one. The latter project the pixel values into pose- and/or illumination-robust

feature spaces[11, 34, 35, 36, 37]. For example, Canonical Correlation Analysis [34]

projects pixel values to a subspace where the impacts of pose and illumination are

effectively removed. Deep Learning [11, 36] is also based on the same motivation.

3D methods intrinsically model pose variations using the analysis-by-synthesis ap-

proach. This means that a 3D face model has to be ﬁtted to the input 2D image an-

26

notated with facial landmarks. The methods can be categorised into 3 groups: pose

normalisation [18, 38, 39], pose synthesis [40, 41] and 3D shape and texture feature

extraction [2, 7, 42]. Pose normalisation renders all the images (gallery and probe) to

a frontal view using 3D models; pose synthesis renders multiple gallery images of dif-

ferent poses for each subject. A probe image is only matched against that of the most

similar pose in the gallery. 3D shape and texture feature extraction methods attempt

to match probe and gallery images in a 3D parameter space, providing a pose- and

illumination-invariant representation.

Among other options, 3DMM-based face recognition systems [7, 8, 9, 10] have a

particular appeal because the process of 3D face model ﬁtting provides a means of ex-

tracting the intrinsic 3D face shape and albedo from an unconstrained input face image.

However, for a long time, the wide spread use of 3DMM in face recognition has been

inhibited by inefﬁcient 3D face model ﬁtting algorithms. The ESO ﬁtting algorithm

presented in Section 4 offers a new means that greatly enhances the applicability of the

3D face model ﬁtting approach.

Most existing 3DMM methods [7, 8, 9, 10] assume that accurate facial landmarks

are known. To the best of our knowledge, only one previous work [43] proposes the use

of automatically detected landmarks. In [43], the automatic landmark detection and

3DMM ﬁtting are combined by a data-driven Markov chain Monte Carlo method. This

method is robust to automatically detected landmarks but is rather slow. In contrast, we

use an efﬁcient cascaded regression technique [25] to automatically detect landmarks,

which are then fed into a fully automatic face recognition system.

The conventional pipeline of a 3DMM face recognition system, shown as Scheme

1 in Fig. 11, involves the use of the generative 3D shape and texture parameters (α

and β), which are isolated from the input image appearance by suppressing the pose

and illumination nuisance parameters. As in previous works [7, 9, 10], αand βare

concatenated into a single vector to work as a holistic descriptor.

The drawback of holistic features is their inability to capture local facial proper-

ties, e.g. a scar, which may be very discriminative between people. To overcome this

problem, we propose to extract local features as an alternative. Speciﬁcally, with the as-

sistance of ESO ﬁtting, we can render a pose- and illumination-normalised face image

27

Scheme1

Scheme 2

Local Feature

3DMM

Holistic Feature

[, ]

LPQ

Holistic

Feature

Local

Feature

Gallery

Gallery

Model

Fitting

Normali

sation

Matching

Matching

Figure 11: Face recognition pipeline. Scheme 1 and 2 use holistic and local features for face recognition

respectively.

from an unconstrained input face as shown in Scheme 2 in Fig. 11. The pose normali-

sation is achieved by setting ρ=ρ0to transform the input face to a canonical frontal

view. The illumination-normalised input image aI

in is obtained using Equation 29. Lo-

cal features, such as those from the Local Phase Quantisation (LPQ) [44] descriptor

used in this work, can then be extracted from this rendered image.

We evaluate the merit of the ESO ﬁtting approach in the context of face recognition

on the PIE [45] and Multi-PIE [46] databases which both have large pose and illumi-

nation variations. We set the hyperparameters {λ1, l1, λ2, l2}of ESO to {0.5, 3, 0.7,

4}as discussed in Section 5.1.

5.2.1. PIE Database

PIE is a benchmark database that can be used to compare different 3DMM ﬁtting

methods.

Protocol To compare all the methods fairly, the standard experimental protocol is used

by our system. In particular, the recognition performance is measured using a subset

of PIE including 3 poses (frontal, side and proﬁle) and 24 illuminations. In order to

conform to the protocol, in this experiment the ﬁtting is initialised by manual land-

marks. The gallery set contains frontal face images under neutral illumination, and the

remaining images are probes. The holistic features α,βare used to represent a face.

Results Face recognition performance in the presence of combined pose and illumina-

28

Table 2: Face recognition rate (%) on different poses averaging over all the illuminations on PIE

frontal side proﬁle average

LiST [8] 97 91 60 82.6

Zhang [9] 96.5 94.6 78.7 89.9

Aldrian [10] 99.5 95.1 70.4 88.3

MFF [7] 98.9 96.1 75.7 90.2

ESO 100 97.4 73.9 90.4

tion variations is reported in Table 2. ESO performs substantially better than [8], and

marginally better than [7, 9, 10]. Note that MFF [7], whose performance is very close

to ESO, has more than 10 hyperparameters, causing difﬁculties for optimal parameter

selection. In contrast, ESO has only 4 hyperparameters.

Runtime The optimisation time was measured on a computer with Intel Core2 Duo

E8400 CPU and 4GB RAM memory. The results obtained for our implementation of

the SimOpt method (MFF [7]) and the results reported for the SeqOpt method [10]

are compared with those obtained with ESO. MFF took 23.1 seconds to ﬁt one image,

while ESO took only 2.1 seconds on average per ﬁtting. The authors of [10] did not

report their run time, but they also determined the albedo estimation to be the dominant

step, with the same complexity of O(n rt

2). Note however that [10] uses not only one

group of global αand βbut also four additional local groups to represent a face, while

we only use the global parameters. Therefore rtin our approach is one ﬁfth of [10],

giving a 25-fold speed advantage.

5.2.2. Multi-PIE Database

To compare with other state-of-the-art methods, evaluations are also conducted on

a larger database, Multi-PIE, containing more than 750,000 images of 337 people. In

addition, our face recognition systems, initialised by both manually and automatically

detected landmarks, are compared. We used a cascaded regression-based automatic

landmark detection method [25].

Protocol There are two settings, Setting-I and Setting-II, widely used in previous work

[11, 12, 36, 38]. Setting-I is used for face recognition in the presence of combined pose

29

Table 3: Face recognition rate (%) on different poses averaging all the illuminations on Multi-PIE (Setting-I)

Method Annotation Feature -45◦-30◦-15◦+15◦+30◦+45◦Mean 0◦

Li [31] Manual Gabor 63.5 69.3 79.7 75.6 71.6 54.6 69.1 N/A

Deep

Learning Automatic

RL [11] 66.1 78.9 91.4 90.0 82.5 62.0 78.5 94.3

FIP [11] 63.6 77.5 90.5 89.8 80.0 59.5 76.81 94.3

MVP [12] 75.2 83.4 93.3 92.2 83.9 70.6 83.1 95.7

ESO

Automatic Holistic 73.8 87.5 95.0 95.1 90.0 76.2 86.3 98.7

Local 79.6 91.6 98.2 97.9 92.6 81.3 90.2 99.4

Manual Holistic 80.8 88.9 96.7 97.6 93.3 81.1 89.7 99.1

Local 81.1 93.3 97.7 98.0 93.3 82.4 91.0 99.6

Table 4: Face recognition rate (%) on different poses under neutral illumination on Multi-PIE (Setting-II)

Method Annotation -45◦-30◦-15◦+15◦+30◦+45◦Mean

2D

PLS [32]

Manual

51.1 76.9 88.3 88.3 78.5 56.5 73.3

CCA [47] 53.3 74.2 90.0 90.0 85.5 48.2 73.5

GMA [48] 75.0 74.5 82.7 92.6 87.5 65.2 79.6

DAE [49] Automatic 69.9 81.2 91.0 91.9 86.5 74.3 82.5

SPAE [36] 84.9 92.6 96.3 95.7 94.3 84.4 91.4

3D

Asthana [38]

Automatic

74.1 91.0 95.7 95.7 89.5 74.8 86.8

MDF [50] 78.7 94.0 99.0 98.7 92.2 81.8 90.7

ESO+LPQ 91.7 95.3 96.3 96.7 95.3 90.3 94.4

and illumination variations, Setting-II for that with only pose variations.

In common with [11, 12], Setting-I uses a subset in session 01 consisting of 249

subjects with 7 poses and 20 illumination variations. The images of the ﬁrst 100 sub-

jects constitute the training set. The remaining 149 subjects form the test set. In the test

set, the frontal images under neutral illumination work as the gallery and the remaining

are probe images. Following [36, 38], Setting-II uses the images of all the 4 sessions

(01-04) under 7 poses and only neutral illumination. The images from the ﬁrst 200

subjects are used for training and the remaining 137 subjects for testing. In the test set,

the frontal images from session 01 work as gallery, and the others are probes.

ESO vs Deep Learning (Setting-I) In recent years, deep learning methods have achieved

considerable success in a range of vision applications. In particular, deep learning

works well for pose- and illumination-invariant face recognition [11, 12]. To our

30

knowledge, these methods have reported the best face recognition rate so far on Multi-

PIE over both pose and illumination variations. Systems deploying these methods

learned 3 pose- and illumination-invariant features: FIP (face identity-preserving), RL

(FIP reconstructed features), and MVP (multi-view perceptron) using convolutional

neural networks (CNN). Table 3 compares ESO with these deep learning methods and

the baseline method [31]. Not surprisingly, deep learning methods work better than

[31] because of their powerful feature learning capability. However, ESO with auto-

matic annotation, using either holistic or local features, outperforms these three deep

learning solutions as shown in Table 3. We conclude that the superior performance

of ESO results from the fact that the ﬁtting process of ESO can explicitly model the

pose. In contrast, the deep learning methods try to learn the view/pose-invariant fea-

tures across different poses. This learning objective is highly non-linear so that the

methods tend to get trapped in local minima. In contrast, ESO solves several convex

problems and avoids this pitfall.

Automatic vs Manual Annotation (Setting-I) Table 3 also compares the performance

of ESO with fully automatic annotation against that based on manual annotation. This

table shows that the mean face recognition rates of the fully automatic system are close

to those relying on manual annotation: 88.0% vs 91.2% for holistic features, and 91.5%

vs 92.2% for local features. It means that ESO is reasonably robust to the errors caused

by automatically detected landmarks.The superiority of local features, which can cap-

ture more facial details than holistic features, is also evident from the results.

ESO for Pose-robust Face Recognition (Setting-II) Table 4 compares ESO with the

state-of-the-art methods for pose-robust face recognition. The methods can be clas-

siﬁed into 2D and 3D approaches as discussed in Section 5.2. In the 2D category,

PLS [32] and CCA [47] are unsupervised methods, and consequently they deliver in-

ferior performance. GMA [48] beneﬁts from its use of some additional supervisory

information. DAE [49] and SPAE [36] are auto-encoder-based methods, which have

superior capability to learn the non-linear relationships between images of different

poses. SPAE set the state-of-the-art in performance, even compared with 3D methods

[38] and [50]. However, our ESO outperforms SPAE, speciﬁcally 94.4% vs 91.4%,

31

because of its accurate shape and albedo reconstruction capability.

6. Conclusions

We proposed a new optimisation method — Efﬁcient Stepwise Optimisation (ESO)

— for ﬁtting a 3D morphable face model to a 2D face image. In order to improve the

optimisation efﬁciency, the method decouples the geometric and photometric optimi-

sations and uses least squares sequentially to optimise the reconstructed shape, light di-

rection, light strength and albedo parameters in separate steps. It includes a perspective

camera model that becomes important in view of the growing interest in near-camera

applications.

The computational efﬁciency of ESO is achieved thanks to the proposed lineari-

sation of the model ﬁtting steps, leading to closed-form solutions. ESO improves the

optimisation efﬁciency by an order of magnitude in comparison with [7]. Moreover, it

overcomes the weaknesses of earlier SeqOpt methods:

•The shape reconstruction of ESO supports a perspective camera.

•ESO linearises the Phong model.

•It models specularity.

•Occluding contour landmarks (Section 4.3) are used for a more robust ﬁtting.

The experimental results demonstrate that the face reconstruction achievable by ESO

is an improvement on that obtained from the state-of-the-art methods.

The ESO ﬁtting algorithm can extract both holistic features and local features. A

face recognition system that incorporates ESO to facilitate pose and illumination in-

variance was constructed, and evaluated on the PIE and Multi-Pie benchmark datasets

with very promising results.

7. Acknowledgments

Support for this work is gratefully acknowledged from: EPSRC/DSTL project

EP/K014307/1 “Signal processing in a networked battlespace”; EPSRC Programme

32

Grant EP/L000539 “S3A: Future spatial audio for immersive listener experiences at

home”; and the European Commission FP7 project 284989 “BEAT”.

References

[1] R. Ramamoorthi, P. Hanrahan, A signal-processing framework for inverse ren-

dering, in: Proceedings of the 28th annual conference on Computer graphics and

interactive techniques, ACM, 2001, pp. 117–128.

[2] V. Blanz, T. Vetter, Face recognition based on ﬁtting a 3D morphable model,

Pattern Analysis and Machine Intelligence, IEEE Transactions on 25 (9) (2003)

1063–1074.

[3] X. Bai, E. R. Hancock, R. C. Wilson, A generative model for graph matching and

embedding, Computer Vision and Image Understanding 113 (7) (2009) 777–789.

[4] X. Bai, E. R. Hancock, R. C. Wilson, Graph characteristics from the heat kernel

trace, Pattern Recognition 42 (11) (2009) 2589–2606.

[5] V. Blanz, T. Vetter, A morphable model for the synthesis of 3D faces, in: Pro-

ceedings of the 26th annual conference on Computer graphics and interactive

techniques, 1999, pp. 187–194.

[6] S. Romdhani, T. Vetter, Efﬁcient, robust and accurate ﬁtting of a 3D morphable

model, in: ICCV, IEEE, 2003, pp. 59–66.

[7] S. Romdhani, T. Vetter, Estimating 3D shape and texture using pixel intensity,

edges, specular highlights, texture constraints and a prior, in: CVPR, IEEE, 2005,

pp. 986–993.

[8] S. Romdhani, V. Blanz, T. Vetter, Face identiﬁcation by ﬁtting a 3D morphable

model using linear shape and texture error functions, in: ECCV, Springer, 2002,

pp. 3–19.

[9] L. Zhang, D. Samaras, Face recognition from a single training image under arbi-

trary unknown lighting using spherical harmonics, Pattern Analysis and Machine

Intelligence, IEEE Transactions on 28 (3) (2006) 351–363.

33

[10] O. Aldrian, W. A. Smith, Inverse rendering of faces with a 3D morphable model,

Pattern Analysis and Machine Intelligence, IEEE Transactions on 35 (5) (2013)

1080–1093.

[11] Z. Zhu, P. Luo, X. Wang, X. Tang, Deep learning identity preserving face space,

in: Proc. ICCV, Vol. 1, 2013, p. 2.

[12] Z. Zhu, P. Luo, X. Wang, X. Tang, Deep learning multi-view representation for

face recognition, arXiv preprint arXiv:1406.6947.

[13] C. P. Huynh, A. Robles-Kelly, E. R. Hancock, Shape and refractive index from

single-view spectro-polarimetric images, International Journal of Computer Vi-

sion 101 (1) (2013) 64–94.

[14] R. Ramamoorthi, P. Hanrahan, A signal-processing framework for reﬂection,

ACM Transactions on Graphics (TOG) 23 (4) (2004) 1004–1042.

[15] G. Hu, P. Mortazavian, J. Kittler, W. Christmas, A facial symmetry prior for im-

proved illumination ﬁtting of 3D morphable model, in: International Conference

on Biometrics, IEEE, 2013, pp. 1–6.

[16] G. Hu, C. Chan, J. Kittler, W. Christmas, Resolution-aware 3D morphable model,

in: British Machine Vision Conference, 2012, pp. 1–10.

[17] Y. Wang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, D. Samaras, Face re-lighting from

a single image under harsh lighting conditions, in: Computer Vision and Pattern

Recognition, IEEE Conference on, IEEE, 2007.

[18] X. Zhu, Z. Lei, J. Yan, D. Yi, S. Z. Li, High-ﬁdelity pose and expression normal-

ization for face recognition in the wild, in: Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, 2015, pp. 787–796.

[19] X. Zhu, J. Yan, D. Yi, Z. Lei, S. Z. Li, Discriminative 3D morphable model

ﬁtting, in: Automatic Face and Gesture Recognition (FG), IEEE International

Conference on, 2015.

34

[20] A. Patel, W. A. Smith, 3D morphable face models revisited, in: CVPR, IEEE,

2009, pp. 1327–1334.

[21] P. Huber, Z. Feng, W. Christmas, J. Kittler, M. R¨

atsch, Fitting 3D Morphable

Models using local features, in: IEEE International Conference on Image Pro-

cessing, (ICIP), 2015. doi:10.1109/ICIP.2015.7350989.

URL http://dx.doi.org/10.1109/ICIP.2015.7350989

[22] W. A. P. Smith, E. R. Hancock, Estimating facial reﬂectance properties using

shape-from-shading, International Journal of Computer Vision 86 (2–3) (2010)

152–170.

[23] J. T. Rodriguez, 3D face modelling for 2D+3D face recognition, Ph.D. thesis,

Surrey University, Guildford, UK (2007).

URL http://www.ee.surrey.ac.uk/CVSSP/Publications/

papers/tena-2007.pdf

[24] A. Bas, W. A. P. Smith, T. Bolkart, S. Wuhrer, Fitting a 3D morphable

model to edges: A comparison between hard and soft correspondences, CoRR

abs/1602.01125.

URL http://arxiv.org/abs/1602.01125

[25] Z.-H. Feng, P. Huber, J. Kittler, W. Christmas, X.-J. Wu, Random cascaded-

regression copse for robust facial landmark detection, Signal Processing Letters,

IEEE 22 (1) (2015) 76–80.

[26] I. Kemelmacher-Shlizerman, R. Basri, 3D face reconstruction from a single image

using a single reference face shape, Pattern Analysis and Machine Intelligence,

IEEE Transactions on 33 (2) (2011) 394–405.

[27] S. R. Marschner, S. H. Westin, E. P. Lafortune, K. E. Torrance, D. P. Greenberg,

Image-based brdf measurement including human skin, in: Rendering Techniques

99, Springer, 1999, pp. 131–144.

[28] R. Hartley, A. Zisserman, Multiple view geometry in computer vision, Cambridge

university press, 2003.

35

[29] C. Ding, D. Tao, A comprehensive survey on pose-invariant face recognition,

arXiv preprint arXiv:1502.04383.

[30] S. R. Arashloo, J. Kittler, Energy normalization for pose-invariant face recogni-

tion based on mrf model image matching, Pattern Analysis and Machine Intelli-

gence, IEEE Transactions on 33 (6) (2011) 1274–1280.

[31] A. Li, S. Shan, W. Gao, Coupled bias-variance tradeoff for cross-pose face recog-

nition, Image Processing, IEEE Transactions on 21 (1) (2012) 305–315.

[32] A. Sharma, D. W. Jacobs, Bypassing synthesis: PLS for face recognition with

pose, low-resolution and sketch, in: CVPR, IEEE, 2011, pp. 593–600.

[33] A. B. Ashraf, S. Lucey, T. Chen, Learning patch correspondences for improved

viewpoint invariant face recognition, in: CVPR, IEEE, 2008, pp. 1–8.

[34] T.-K. Kim, J. Kittler, R. Cipolla, Discriminative learning and recognition of im-

age set classes using canonical correlations, Pattern Analysis and Machine Intel-

ligence, IEEE Transactions on 29 (6) (2007) 1005–1018.

[35] S. J. Prince, J. Warrell, J. H. Elder, F. M. Felisberti, Tied factor analysis for face

recognition across large pose differences, Pattern Analysis and Machine Intelli-

gence, IEEE Transactions on 30 (6) (2008) 970–984.

[36] M. Kan, S. Shan, H. Chang, X. Chen, Stacked progressive auto-encoders (spae)

for face recognition across poses, in: Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, 2013, pp. 1883–1890.

[37] C. Ding, C. Xu, D. Tao, Multi-task pose-invariant face recognition, Image Pro-

cessing, IEEE Transactions on 24 (3) (2015) 980–993.

[38] A. Asthana, T. K. Marks, M. J. Jones, K. H. Tieu, M. Rohith, Fully automatic

pose-invariant face recognition via 3D pose normalization, in: Computer Vision,

International Conference on, IEEE, 2011, pp. 937–944.

36

[39] R. Abiantun, U. Prabhu, M. Savvides, Sparse feature extraction for pose-tolerant

face recognition, Pattern Analysis and Machine Intelligence, IEEE Transactions

on 36 (10) (2014) 2061–2073.

[40] K. Niinuma, H. Han, A. K. Jain, Automatic multi-view face recognition via 3D

model based pose regularization, in: Biometrics: Theory, Applications and Sys-

tems (BTAS), IEEE Conference on, 2013.

[41] U. Prabhu, J. Heo, M. Savvides, Unconstrained pose-invariant face recognition

using 3D generic elastic models, Pattern Analysis and Machine Intelligence,

IEEE Transactions on 33 (10) (2011) 1952–1961.

[42] D. Yi, Z. Lei, S. Z. Li, Towards pose robust face recognition, in: CVPR, IEEE,

2013, pp. 3539–3545.

[43] S. Sch ¨

onborn, A. Forster, B. Egger, T. Vetter, A monte carlo strategy to integrate

detection and model-based face analysis, in: Pattern Recognition, 2013.

[44] T. Ahonen, E. Rahtu, V. Ojansivu, J. Heikkila, Recognition of blurred faces using

local phase quantization, in: ICPR, IEEE, 2008.

[45] T. Sim, S. Baker, M. Bsat, The CMU pose, illumination, and expression (PIE)

database, in: Automatic Face and Gesture Recognition, IEEE International Con-

ference on, 2002, pp. 46–51.

[46] R. Gross, I. Matthews, J. Cohn, T. Kanade, S. Baker, Multi-PIE, Image and Vision

Computing 28 (5) (2010) 807–813.

[47] H. Hotelling, Relations between two sets of variates, Biometrika (1936) 321–377.

[48] A. Sharma, A. Kumar, H. Daume, D. W. Jacobs, Generalized multiview analysis:

A discriminative latent space, in: CVPR, IEEE, 2012, pp. 2160–2167.

[49] Y. Bengio, Learning deep architectures for ai, Foundations and trends R

in Ma-

chine Learning 2 (1) (2009) 1–127.

37

[50] S. Li, X. Liu, X. Chai, H. Zhang, S. Lao, S. Shan, Morphable displacement

ﬁeld based image matching for face recognition across pose, in: ECCV, Springer,

2012.

38