Conference PaperPDF Available

Evaluation of Dense 3D Reconstruction from 2D Face Images in the Wild


Abstract and Figures

This paper investigates the evaluation of dense 3D face reconstruction from a single 2D image in the wild. To this end, we organise a competition that provides a new benchmark dataset that contains 2000 2D facial images of 135 subjects as well as their 3D ground truth face scans. In contrast to previous competitions or challenges, the aim of this new benchmark dataset is to evaluate the accuracy of a 3D dense face reconstruction algorithm using real, accurate and high-resolution 3D ground truth face scans. In addition to the dataset, we provide a standard protocol as well as a Python script for the evaluation. Last, we report the results obtained by three state-of-the-art 3D face reconstruction systems on the new benchmark dataset. The competition is organised along with the 2018 13th IEEE Conference on Automatic Face & Gesture Recognition.
Content may be subject to copyright.
Evaluation of Dense 3D Reconstruction from 2D Face Images in the Wild
Zhen-Hua Feng1Patrik Huber1Josef Kittler1Peter Hancock2Xiao-Jun Wu3
Qijun Zhao4Paul Koppen1Matthias R¨
1Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford GU2 7XH, UK
2Faculty of Natural Sciences, University of Stirling, Stirling FK9 4LA, UK
3School of Internet of Things Engineering, Jiangnan University, Wuxi 214122, China
4Biometrics Research Lab, College of Computer Science, Sichuan University, Chengdu 610065, China
5Image Understanding and Interactive Robotics, Reutlingen University, 72762 Reutlingen, Germany
{z.feng, j.kittler, p.koppen},,,,
Abstract—This paper investigates the evaluation of dense
3D face reconstruction from a single 2D image in the wild.
To this end, we organise a competition that provides a new
benchmark dataset that contains 2000 2D facial images of
135 subjects as well as their 3D ground truth face scans. In
contrast to previous competitions or challenges, the aim of this
new benchmark dataset is to evaluate the accuracy of a 3D
dense face reconstruction algorithm using real, accurate and
high-resolution 3D ground truth face scans. In addition to the
dataset, we provide a standard protocol as well as a Python
script for the evaluation. Last, we report the results obtained
by three state-of-the-art 3D face reconstruction systems on the
new benchmark dataset. The competition is organised along
with the 2018 13th IEEE Conference on Automatic Face &
Gesture Recognition.
3D face reconstruction from 2D images is a very active
topic in many research areas such as computer vision, pattern
recognition and computer graphics [1], [2], [3], [4], [5], [6].
While the topic has been researched for nearly two decades
(with one of the seminal papers being Blanz & Vetter [1]),
over the last two years, these methods have been growing
out of laboratory-applications and have become applicable to
in-the-wild images, containing larger pose variations, diffi-
cult illumination conditions, facial expressions, or different
ethnicity groups and age ranges [7], [8], [9]. However, it
is currently an ongoing challenge to quantitatively evaluate
such algorithms: For 2D images captured in the wild, there
is usually no 3D ground truth available. And, vice-versa,
for 3D data, it is usually captured with a 3D scanner in a
laboratory, and no in-the-wild 2D images of the same subject
Thus, researchers often publish qualitative results, along-
side some effort of quantitative evaluation, which is often not
ideal, for the lack of 3D ground truth. Also, people resort
to using a proxy task, for example face recognition. For an
example of the former, people are using the Florence 2D/3D
hybrid face dataset (MICC) [10], which contains 3D data
with 2D videos (e.g. [11], [12]). However there is no stan-
dard evaluation protocol, and very often synthetic renderings
of the 3D scans are used for evaluation, which do not contain
background (e.g. they are rendered on black background) or
any natural illumination variations (e.g. [13]), for the lack
of better data. Other methods (e.g. 3DDFA [7]) compare
their results against a ‘ground truth’ which is created by
another fitting algorithm, which is itself problematic as these
fitting algorithms have not yet been shown to be effective on
in-the-wild data, even after manual correction. There have
been a limited number of previous competitions which aimed
to improve the situation, but they only solved the problem
partially. For example the workshop organised by Jeni et
al. [14] used their own algorithm as ‘ground truth’ (see also
Section I-C). Other datasets have been recently proposed
like KF-ITW [15] but therein Kinect Fusion is used as 3D
ground truth, which does not consist of very high resolution
meshes, and also the videos are recorded in rather controlled
and similar scenarios (i.e. rotating around a chair in a lab).
In this paper, we report the results of a competition on
3D dense face reconstruction of in-the-wild 2D images,
evaluated with accurate and high-resolution 3D ground truth,
obtained from a 3D structured-light system. The competition
is co-located with a workshop1of the 13th IEEE Conference
on Automatic Face & Gesture Recognition (FG 2018).
A. Outcomes
The competition provides a benchmark dataset with
2000 2D images of 135 subjects as well as their high-
resolution 3D ground truth face scans. Alongside the
dataset we supply a standard benchmark protocol to
be used on the dataset, for future evaluations and
comparisons, beyond the competition.
An independent, objective evaluation and comparison
of state-of-the art 3D face reconstruction algorithms.
The plan is to perform two sets of evaluations: One set
for single-image reconstruction, and another set where
2018 13th IEEE International Conference on Automatic Face & Gesture Recognition
978-1-5386-2335-0/18/$31.00 ©2018 IEEE
DOI 10.1109/FG.2018.00123
it is allowed to use all images of one particular person
to reconstruct the 3D shape, allowing algorithms to
leverage information from multiple images. Note that,
in this paper, we only report results of the single image
fitting protocol.
B. Impact
This is the first challenge in 3D face reconstruction from
single 2D in-the-wild images with real, accurate and high-
resolution 3D ground truth.
The provided benchmark dataset is publicly available, so
that it can become a benchmark and reference point for
future evaluations in the community.
The multi-image challenge allows to test algorithms that
can work with multiple videos as well, having far-reaching
impact, for example also in the face recognition community
(e.g. for set-to-set matching, and recent 2D face recognition
benchmarks such as the IARPA Janus Benchmark face
In addition to that, one of the baseline 3D reconstruction
algorithms and the Surrey Face Model (SFM) is publicly
available too [16].
C. Relationship to previous workshops (competitions)
The topic of evaluating 3D face reconstruction algorithms
on 2D in-the-wild data has gained much traction recently.
The 1st Workshop on 3D Face Alignment in the Wild
(3DFAW) Challenge3[14] was held at ECCV 2016. The
benchmark consisted of images from Multi-PIE, syntheti-
cally rendered images, and some in-the-wild images from
the internet. The 3D ‘ground truth’ was generated by an
automatic algorithm provided by the organisers.
As part of ICCV 2017, the iBUG group from Imperial
College, UK, held a workshop 1st 3D Face Tracking in-
the-wild Competition4. It improved upon the ECCV 2016
challenge in some respects, but the ‘ground truth’ used was
still from an automatic fitting algorithm, introducing bias,
and resulting in the other algorithms being evaluated against
the performance of another algorithm, and not against real
3D ground truth. Also, the evaluation is only done on a set
of sparse 2D and 3D landmarks and not over a dense 3D
mesh, leaving much room for further improvements on the
benchmarking methodology.
The remaining of this paper outlines the data, protocol,
evaluation metrics and results of the competition. The aim of
the competition is to evaluate 3D face shape reconstruction
performance of participants on true 2D in-the-wild images,
with actual 3D ground truth available from 3D face scanners.
The data is released to the public, together with a well-
defined protocol, to provide a standard and public bench-
mark to the 3D face reconstruction community.
(a) high quality images
(b) low quality images
Figure 1. Some examples of the 2D images in the test set, selected from
the Stirling ESRC 3D face dataset.
In general, the data used for the evaluation of a 3D face
reconstruction algorithm should consist of a number of high-
resolution 3D face scans, obtained from a 3D face imaging
device, like for example the 3dMDface5system. Together
with these 3D face ground truth, associated with each subject
are multiple 2D images that have been captured in-the-wild,
with a variety of appearance variations in pose, expression,
illumination and occlusion. The aim is to measure the
accuracy of an algorithm in reconstructing a subject’s neutral
3D face mesh from unconstrained 2D images. To this end,
the Stirling ESRC 3D face dataset6is used to create the test
set and the JNU [6] 3D face dataset is used to form the
validation set. For training, any 2D or 3D face dataset is
allowed except for the Stirling ESRC and JNU datasets.
A. Test set
The test set is a subset of the Stirling ESRC face database
that contains a number of 2D images, video sequences as
well as 3D face scans of more than 100 subjects. The 2D
and 3D faces in the Stirling ESRC dataset were captured
under 7 different expression variations. To create the test
set, 2000 2D neutral face images, including 656 high-quality
and 1344 low-quality images, of 135 subjects were selected
from the Stirling ESRC 3D face database. The high quality
images were captured in constrained scenarios with good
lighting conditions. The resolution of a high quality face
image is higher than 1000×1000. In contrast, the low quality
images or video frames were captured with a variety of
image degradation types such as image blur, low resolution,
poor lighting, large scale pose rotation etc. Some examples
of the selected 2D face images are shown in Fig. 1.
B. Validation set
Participants are allowed to use the validation set to fine-
tune the hyper-parameters of their 3D face reconstruction
systems, if required. The validation set contains 161 2D in-
the-wild images of 10 subjects and their ground-truth 3D
face scans. The validation set is a part of the JNU 3D
face dataset collected by the Jiangnan University using a
3dMDface system. The full JNU 3D face dataset has the
high resolution 3D face scans of 774 Asian subjects. For
more details of the JNU 3D face dataset, please refer to [6].
This section details the exact protocol, rules, and evalua-
tion metrics to be used for the competition.
A. Protocol
The task of the competition is to reconstruct a subject’s
neutral 3D face shape from a single 2D input image. Multi-
image fitting that allows the use of multiple input images
of the same subject for the task of reconstruction of the
subject’s neutral face is not included in this competition.
However, we leave this potential evaluation protocol as our
future work.
For single image reconstruction, an algorithm is expected
to take as input a single 2D image and output a neutral 3D
face shape of the subject’s identity. An algorithm should be
run on each of the images of each subject individually. In
addition, one set of parameters has to be used for all images
of all subjects. No fine-tuning is allowed on a per-image or
per-subject basis.
B. Evaluation metrics
Given an input 2D image or a set of 2D images, we
use the 3D Root-Mean-Square Error (3D-RMSE) between
the reconstructed 3D face shape and the ground truth 3D
face scan calculated over an area consisting of the inner
face as the evaluation metric. The area is defined as each
vertex in the ground truth scan that is inside the radius
from the face centre. The face centre is computed as the
point between the annotated nose bottom point and the
nose bridge (which is computed as the middle of the two
annotated eye landmarks): face centre =nose bottom +
0.3×(nose bridge nose bottom). The radius is computed
as the average of the outer-eye-distance and the distance
between nose bridge and nose bottom, times a factor of 1.2:
radius =1.2×(outer eye dist +nose dist)/2. The radius is
defined in a relative way because it is desired that the radius
covers roughly the same (semantic) area on each scan, and
we would like to avoid that e.g. with a very wide face, the
Figure 2. Left: The pre-defined seven landmarks used for the rigid
alignment of the predicted face mesh with its ground-truth. In order: 1)
right eye outer corner, 2) right eye inner corner, 3) left eye inner corner,
4) left eye outer corner, 5) nose bottom, 6) right mouth corner, and 7) left
mouth corner. Right: The area over which face reconstruction is evaluated
is defined for each ground-truth 3D scan by a radius around the face centre.
This radius is relative to the subject’s inter-ocular and eye-nose distance
(see Section III-B for details).
evaluated area would cover a smaller part of that particular
face. Typically, the resulting radius is around 80 mm. The
area is depicted for an example scan in Figure 2.
The following steps are performed to compute the 3D-
RMSE between two meshes:
1) The predicted and ground truth meshes will be rigidly
aligned (by translation, rotation, and scaling). Scal-
ing is compensated for because participants’ resulting
meshes might be in a different coordinate system,
whereas the ground truth scans are in the unit of
millimetres. The rigid alignment is based on seven
points: both inner and outer eye corners, the nose
bottom and the mouth corners (see Fig. 2). The ground
truth scans have been annotated with these seven
points, whereas participants are expected to specify
those on their resulting meshes.
2) For each vertex in the ground truth 3D face scan,
the distance is computed to the closest point on the
surface of the predicted mesh. These distances are
used to compute the 3D-RMSE as well as more
specific analysis of a 3D face reconstruction system,
e.g. the distribution of errors across different face
A Python script is provided7, performing the alignment
and distance computations. The output of the script is an
ordered list of distances.
Three 3D face reconstruction systems have been evaluated
for the competition, including the system submitted by
the Biometrics Research Lab at the Sichuan University
(SCU-BRL) [17] and two baseline systems implemented by
the competition organisers from the University of Surrey.
Results were provided in the form of text files with per-
vertex errors. In addition, participants were asked to provide
a brief summary of their approach. The descriptions below
are based upon these provided summaries. A thorough
comparison will then be presented in Section V.
A. University of Surrey
The baseline systems developed by the Centre for Vision,
Speech and Signal Processing (CVSSP) from the University
of Surrey have three main stages: face detection, facial
landmark localisation and 3D face reconstruction.
1) Face detection: For face detection, the Multi-Task
CNN (MTCNN) face detector was adopted to obtain the
bounding box for each input 2D face image [18]. However,
the faces of some low resolution images with extreme pose
variations were missed by MTCNN. For those face images,
a bounding box regression approach was used to obtain the
face bounding box, as described in [19], [20].
2) Facial landmark detection: To perform facial land-
mark localisation, the CNN6 model [21], a simple CNN-
based facial landmark localisation algorithm, was adopted.
The model was trained on multiple in-the-wild face datasets,
including the HELEN [22], LFPW [23], ibug [24] and
AFW [25] datasets. These datasets have 3837 2D face
images and each image has 68 facial landmarks annotated by
the iBUG group from Imperial College London. In addition,
a subset of the Multi-PIE [26] face dataset was also used
for the CNN6 model training. This Multi-PIE subset has
25,200 2D face images and each face image was manually
annotated using 68 facial landmarks [27]. Some examples
from the Stirling low quality subset with the detected 68
facial landmarks using the CNN6 are shown in Fig. 3.
3) 3D face reconstruction: Given an input 2D image as
well as its 2D facial landmarks, the eos fitting algorithm
is used to recover the 3D face of the input [16]. The eos
fitting algorithm reconstructs the 3D face shape based on
the landmarks, using a 3D morphable shape and expression
model. It consists of shape identity and blendshapes fitting, a
scaled orthographic projection camera model, and a dynamic
face contour fitting. For this evaluation, the SFM 3448
shape-only model was used, with the 6 standard Surrey
expression blendshapes. The fitting was run for 5 iterations,
fitting all shape coefficients of the model, and with a shape
regularisation parameter of λ=30. We use the term
‘MTCNN-CNN6-eos’ for this system.
The eos fitting algorithm is tailored for real-time 3D face
fitting applications with a speed of more than 100 fps on
a single CPU. It only relies on 2D facial landmarks, and
does not use any texture information. Therefore, it would
be interesting to explore more complicated 3D face model
fitting algorithms exploiting textural information. To this
end, we also evaluated the 3DDFA 3D face model fitting
algorithm [7]. To be more specific, we first use the MTCNN
and CNN6 to obtain the same 68 facial landmark for an
input 2D image. Then the 68 landmarks are used to initialise
Figure 3. Some examples of the detected 68 landmarks by CNN6.
the 3DDFA fitting algorithm provided by its authors8.As
face model, the 3DDFA fitting uses a modified Basel Face
Model [28] with expressions. We use the term ‘MTCNN-
CNN6-3DDFA’ for this system.
B. Sichuan University (SCU-BRL)
The system developed by the Biometrics Research Lab at
the Sichuan University is based on a novel method that is
able to reconstruct 3D faces from arbitrary number of 2D im-
ages using 2D facial landmarks. The method is implemented
via cascaded regression in shape space. It can effectively
exploit complementary information in unconstrained images
of varying poses and expressions. It begins with extracting
2D facial landmarks on the images, and then progressively
updates the estimated 3D face shape for the input subject via
a set of cascaded regressors, which are off-line learned based
on a training set of pairing 3D face shapes and unconstrained
face images.
1) Facial landmark detection: For 2D facial landmark
detection, the state-of-the-art Face Alignment Network
(FAN) [8] was adopted by in the system developed by SCU-
BRL. For some low quality images, FAN failed, and they
manually annotated the 68 landmarks for them.
2) 3D face reconstruction: Given an arbitrary number
of unconstrained face images {Ii}p
i=1,1pNof a
subject, the goal is to reconstruct the person-specific frontal
and neutral 3D face shape of the subject. We represent the
3D face shape by SR3×qbased on the 3D coordinates
of its qvertices, and denote a subset of Swith columns
corresponding to lannotated landmarks (l=68in our
implementation) as SL. The projection of SLon 2D planes
are represented by UiR2×l. The relationship between 2D
facial landmarks Uiand its corresponding 3D landmarks SL
Table I
Method HQ LQ Full
MTCNN-CNN6-eos 2.70±0.88 2.78±0.95 2.75±0.93
MTCNN-CNN6-3DDFA 2.04±0.67 2.19±0.70 2.14±0.69
SCU-BRL 2.65±0.67 2.87±0.81 2.81±0.80
can be described as:
where fiis the scale factor, Piis the orthographic projection
matrix, Riis the 3×3rotation matrix and tiis the translation
vector. Here, we employ weak perspective projection Mi
to approximate the 3D-to-2D mapping. To fully utilize the
correlation between the landmarks on all the images, we
concatenate them to form a unified 2D facial landmark
vector U=(U1,U2,··· ,Up,Up+1 ,··· ,UN), where Ui
are zero vectors for (p+1)iN.
We reconstruct Sfrom the given ‘ground truth’ landmarks
U(either manually marked or automatically detected by a
standalone method) for the unconstrained image set {Ii}p
Let Sk1be the currently reconstructed 3D shape after
k1iterations. The corresponding landmarks Uk1can
be obtained by projecting Sk1onto the image according to
Eqn. (1). Then the updated 3D shape Skcan be computed
where Wkis the regressor in kth iteration.
The Kregressors {Wk}K
1involved in the reconstruction
process can be learned via optimising the following objective
function over the mtraining samples (each sample contains
up to Nannotated 2D images and one ground truth 3D face
arg min
where {S
j}is one training sample consisting of ground
truth landmarks U
jon the images of a subject and the
subject’s ground truth frontal and neutral 3D face shape S
A comprehensive description of the SCU-BRL system can
be found in the paper Tian et al. [17].
We first compare the average RMSE of different systems
on the benchmark dataset. The results are reported in Table I.
It is clear that the low quality subset is more challenging
than the high quality subset. All the three methods have
higher reconstruction errors on the low quality subset. How-
ever, this difference is minor for all the three systems. The
reason is two-fold. Firs t, the high quality subset contains
many 2D face images with extreme pose variations, up to
(a) High quality set
(b) Low quality set
(c) Full set
Figure 4. A comparison of different systems in their CED curves. The
results were evaluated on the (a) high quality subset, (b) low quality subset,
and (c) full set of the test set selected from the Stirling ESRC dataset. The
evaluation metric is RMSE in millimetre.
±90, as well as strong lighting changes thus makes the
task more challenging. Second, both MTCNN+CNN6+eos
and SCU-BRL systems are landmark-only 3D face recon-
struction methods and their performance only relies on the
accuracy of the detected 2D facial landmarks. The area of
2D facial landmark detection has already been very well
developed for unconstrained facial images in the presence
of a variety of image degradation types. Thus the results
of these two landmark-only fitting algorithms measured on
the high quality and low quality subsets do not have very
big difference. For the MTCNN+CNN6+3DDFA system, it
uses the same state-of-the-art CNN-based face detector and
landmark detector, but as initialisation of the 3D face fitting
stage of 3DDFA, which is also CNN-based and trained on
a large number of unconstrained faces. In addition, 3DDFA
has multiple iterations cascaded for 3D face reconstruction
using textural information. This is also why it performs
significantly better than the other two landmark-only fitting
systems. In this scenario, the 3DDFA algorithm benefits
immensely from the detected 2D facial landmarks by the
state-of-the-art CNN6 model.
To better understand the performance of different systems,
we also plot the Cumulative Error Distribution (CED) curves
of the three systems in Fig. 4. The MTCNN+CNN6+3DDFA
outperforms the other two systems that only fit a 3D shape
model to 2D facial landmarks. This is an interesting result
for the community. It means that the textural information
plays a very important role for high-performance 3D face
reconstruction. However, it should be noted that the fitting
of 3DDFA involves multiple iterations with cascaded CNN
networks hence the speed of such a system cannot satisfy
the requirement of a real-time application. The speed of the
3DDFA implementation provided by its authors is around
1.5 fps tested on a Intel Core i7 6700HQ CPU @ 3.2GHz.
In contrast, the speed of eos is more than 100 fps, which is
orders of magnitude faster than 3DDFA.
It should be noted that the SCU-BRL group also con-
ducted multi-image fitting that used all the input 2D image
of a subject for 3D face reconstruction. On average, their
system reduces the 3D-RMSE from 2.81±0.80 to 2.26±0.72
on the benchmark dataset by fitting all the input 2D images
together for a subject. This result shows that the SCU-BRL
system can effectively utilise the complementary information
in multiple images for 3D face reconstruction. For more
details of their multi-image fitting method, please refer
to [17].
A new benchmark dataset was presented in this paper,
used for the evaluation of 3D face reconstruction from single
2D face images in the wild. To this end, a subset of the
Stirling ESRC 3D face dataset has been used to create the
test set. The competition was conducted on the real 2D face
images of 135 subjects and the evaluation was performed
based on their real 3D face ground truth scans. To facilitate
the competition, an evaluation protocol as well as a Python
script were provided.
We have compared three state-of-the-art 3D face re-
construction systems on the proposed benchmark dataset,
including a system submitted by the Sichuan University
and two baseline approaches implemented by the organisers
from the University of Surrey. From the performance dif-
ference between purely landmark based and texture based
reconstruction methods one main conclusion is that texture
bolsters a significant amount of extra information about 3D
shape. The exploitation of this, however, comes at a price
of increased computational time.
The presented benchmark with evaluation data and pro-
tocol, together with a comprehensive analysis of different
competing algorithms, support future evaluation in the com-
The authors gratefully acknowledge the support from
the EPSRC programme grant (EP/N007743/1), the National
Natural Science Foundation of China (61373055, 61672265)
and the NVIDIA GPU grant program. We would also like to
convey our great thanks to Dr. Muhammad Awais, Dr. Chi-
Ho Chan and Mr. Michael Danner from the University of
Surrey, Mr. Hefeng Yin and Mr. Yu Yang from the Jiangnan
University for their help on creating the benchmark dataset
and revising the paper.
[1] V. Blanz and T. Vetter, “A Morphable Model for the Synthesis
of 3D Faces,” in the 26th Annual Conference on Computer
Graphics and Interactive Techniques (SIGGRAPH),W.N.
Waggenspack, Ed., 1999, pp. 187–194.
[2] Z.-H. Feng, G. Hu, J. Kittler, W. Christmas, and X.-J. Wu,
“Cascaded collaborative regression for robust facial landmark
detection trained using a mixture of synthetic and real images
with dynamic weighting,” IEEE Transactions on Image Pro-
cessing, vol. 24, no. 11, pp. 3425–3440, 2015.
[3] J. Kittler, P. Huber, Z.-H. Feng, G. Hu, and W. Christmas, “3d
morphable face models and their applications,” in 9th Inter-
national Conference on Articulated Motion and Deformable
Objects (AMDO), vol. 9756, 2016, pp. 185–206.
[4] D. Zeng, Q. Zhao, S. Long, and J. Li, “Examplar coherent 3d
face reconstruction from forensic mugshot database,” Image
and Vision Computing, vol. 58, pp. 193–203, 2017.
[5] F. Liu, J. Hu, J. Sun, Y. Wang, and Q. Zhao, “Multi-dim:
A multi-dimensional face database towards the application
of 3d technology in real-world scenarios,” in 2017 IEEE
International Joint Conference on Biometrics (IJCB), 2017,
pp. 342–351.
[6] P. Koppen, Z.-H. Feng, J. Kittler, M. Awais, W. Christmas, X.-
J. Wu, and H.-F. Yin, “Gaussian mixture 3d morphable face
model,” Pattern Recognition, vol. 74, pp. 617–628, 2018.
[7] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li, “Face Alignment
Across Large Poses: A 3D Solution,” in IEEE Conference on
Computer Vision and Pattern Recognition CVPR, 2016, pp.
[8] A. Bulat and G. Tzimiropoulos, “How Far are We from Solv-
ing the 2D & 3D Face Alignment Problem? (and a Dataset
of 230, 000 3D Facial Landmarks),” in IEEE International
Conference on Computer Vision (ICCV), 2017, pp. 1021–
[9] Y. Liu, A. Jourabloo, W. Ren, and X. Liu, “Dense face
alignment,” arXiv preprint arXiv:1709.01442, 2017.
[10] A. D. Bagdanov, A. Del Bimbo, and I. Masi, “The Florence
2D/3D Hybrid Face Dataset,” in the Joint ACM Workshop on
Human Gesture and Behavior Understanding, 2011, p. 7980.
[11] M. Hernandez, T. Hassner, J. Choi, and G. G. Medioni, “Ac-
curate 3D face reconstruction via prior constrained structure
from motion,” Computers & Graphics, vol. 66, pp. 14–22,
[12] A. T. Tran, T. Hassner, I. Masi, and G. G. Medioni, “Re-
gressing Robust and Discriminative 3D Morphable Models
with a Very Deep Neural Network,” in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2017, pp.
[13] A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropoulos,
“Large Pose 3D Face Reconstruction from a Single Image via
Direct Volumetric CNN Regression,” arXiv, 2017.
[14] L. A. Jeni, S. Tulyakov, L. Yin, N. Sebe, and J. F. Cohn, “The
First 3D Face Alignment in the Wild (3DFAW) Challenge,”
in Europearn Conference on Computer Vision Workshops -
(ECCVW), G. Hua and H. J´
egou, Eds., vol. 9914, 2016, pp.
[15] J. Booth, E. Antonakos, S. Ploumpis, G. Trigeorgis, Y. Pana-
gakis, and S. Zafeiriou, “3D Face Morphable Models ”In-the-
Wild”,arXiv, 2017.
[16] P. Huber, G. Hu, J. R. Tena, P. Mortazavian, W. P. Koppen,
W. J. Christmas, M. R¨
atsch, and J. Kittler, “A Multiresolution
3D Morphable Face Model and Fitting Framework,” in the
11th Joint Conference on Computer Vision, Imaging and
Computer Graphics Theory and Applications (VISIGRAPP),
2016, pp. 79–86.
[17] W. Tian, F. Liu, and Q. Zhao, “Landmark-based 3D Face
Reconstruction from an Arbitrary Number of Unconstrained
Images,” in IEEE International Conference and Workshops
on Automatic Face and Gesture Recognition (FG), 2018.
[18] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detec-
tion and alignment using multitask cascaded convolutional
networks,” IEEE Signal Processing Letters, vol. 23, no. 10,
pp. 1499–1503, 2016.
[19] Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X.-J. Wu,
“Face Detection, Bounding Box Aggregation and Pose Es-
timation for Robust Facial Landmark Localisation in the
Wild,” in IEEE Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), 2017, pp. 160–169.
[20] Z.-H. Feng, J. Kittler, W. Christmas, P. Huber, and X.-J.
Wu, “Dynamic attention-controlled cascaded shape regression
exploiting training data augmentation and fuzzy-set sample
weighting,” in The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2017, pp. 2481–2490.
[21] Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X.-J. Wu,
“Wing loss for robust facial landmark localisation with convo-
lutional neural networks,” in IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2018.
[22] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang, “In-
teractive facial feature localization,” in European Conference
on Computer Vision (ECCV). Springer, 2012, pp. 679–692.
[23] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Ku-
mar, “Localizing parts of faces using a consensus of exem-
plars,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 35, no. 12, pp. 2930–2940, 2013.
[24] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic,
“300 faces in-the-wild challenge: The first facial landmark
localization challenge,” in IEEE International Conference on
Computer Vision Workshops (ICCVW), 2013, pp. 397–403.
[25] X. Zhu and D. Ramanan, “Face detection, pose estimation,
and landmark localization in the wild,” in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2012,
pp. 2879–2886.
[26] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker,
“Multi-pie,” Image and Vision Computing, vol. 28, no. 5, pp.
807–813, 2010.
[27] Z.-H. Feng, J. Kittler, W. Christmas, and X.-J. Wu, “A unified
tensor-based active appearance face model,arXiv, 2016.
[28] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter,
“A 3d face model for pose and illumination invariant face
recognition,” in IEEE International Conference on Advanced
Video and Signal Based Surveillance AVSS, 2009, pp. 296–
... Quality evaluation needs to be time-efficient so that it can run in parallel to the reconstruction pipeline without eating into the latency savings from application-and system-side optimizations. This in turn makes effective quality evaluation tricky as most state-of-art evaluation techniques [6]- [8] assume the availability of ground truth which might not be practical when such evaluations need to be lightweight and quick. Contrarily, the absence of ground truth makes the systematic evaluation of 3D reconstruction challenging. ...
... Therefore, (P1) transforms to: In (P3), Q(r, π(N )) and T (r, π(N )) are still unknown to the MEC system. However, by fixing one of the two parameters Algorithm 1: Online Bi-section search algorithm 1 Initialization: Set N = N , r min = 0.3, rmax = 1.0, r * = 1.0; 2 Run the first task with (r * , N ), solve P2 and initial π(N ) : N → N ; 3 r = r min , solutions = []; 4 optimization = True; 5 for each upcoming task i ∈ I do 6 Observe T (r, π(N )), Q(r, π(N )) 7 if not optimization then (e.g., Q(r|π(N )), T (r|π(N )) and Q(π(N )|r), T (π(N )|r)), we can easily get that the quality and the processing time are monotonic functions of both r and N . Therefore, the problem can be effectively solved by a two-dimensional Bi-section algorithm. ...
... 2 Related Work 3D Face Reconstruction Many progresses on 3D face reconstruction (Tuan Tran et al. 2017;Dou, Shah, and Kakadiaris 2017) cannot be achieved without 3D morphable models (3DMM) (Blanz and Vetter 1999) and its variants like BFM (Gerig et al. 2018) and FLAME (Li et al. 2017), where the geometry and texture of a photorealistic 3D face are parameterized as a vector through linear transform. Recent researches also explore representing 3D faces in other formats like dense landmarks (Feng et al. 2018b) or position maps (Feng et al. 2018a). However, stylized avatar auto-creation is not 3D face reconstruction. ...
The creation of a parameterized stylized character involves careful selection of numerous parameters, also known as the "avatar vectors" that can be interpreted by the avatar engine. Existing unsupervised avatar vector estimation methods that auto-create avatars for users, however, often fail to work because of the domain gap between realistic faces and stylized avatar images. To this end, we propose SwiftAvatar, a novel avatar auto-creation framework that is evidently superior to previous works. SwiftAvatar introduces dual-domain generators to create pairs of realistic faces and avatar images using shared latent codes. The latent codes can then be bridged with the avatar vectors as pairs, by performing GAN inversion on the avatar images rendered from the engine using avatar vectors. Through this way, we are able to synthesize paired data in high-quality as many as possible, consisting of avatar vectors and their corresponding realistic faces. We also propose semantic augmentation to improve the diversity of synthesis. Finally, a light-weight avatar vector estimator is trained on the synthetic pairs to implement efficient auto-creation. Our experiments demonstrate the effectiveness and efficiency of SwiftAvatar on two different avatar engines. The superiority and advantageous flexibility of SwiftAvatar are also verified in both subjective and objective evaluations.
... DECA develops this ability by adopting a consistency loss to effectively disentangle details specific to a person from wrinkles induced by expressions. Next, we choose 3DDFA-V2 [24] since its global shape reconstruction error is very close to that of DECA [17], [40]. The last model that we evaluate here is ExpNet [19], which directly regresses the 3DMM expression coefficients inferred using the 3DDFA model [41], a predecessor to the 3DDFA-V2 formulation. ...
Recognising continuous emotions and action unit (AU) intensities from face videos requires a spatial and temporal understanding of expression dynamics. Existing works primarily rely on 2D face appearances to extract such dynamics. This work focuses on a promising alternative based on parametric 3D face shape alignment models, which disentangle different factors of variation, including expression-induced shape variations. We aim to understand how expressive 3D face shapes are in estimating valence-arousal and AU intensities compared to the state-of-the-art 2D appearance-based models. We benchmark four recent 3D face alignment models: ExpNet, 3DDFA-V2, DECA, and EMOCA. In valence-arousal estimation, expression features of 3D face models consistently surpassed previous works and yielded an average concordance correlation of .739 and .574 on SEWA and AVEC 2019 CES corpora, respectively. We also study how 3D face shapes performed on AU intensity estimation on BP4D and DISFA datasets, and report that 3D face features were on par with 2D appearance features in AUs 4, 6, 10, 12, and 25, but not the entire set of AUs. To understand this discrepancy, we conduct a correspondence analysis between valence-arousal and AUs, which points out that accurate prediction of valence-arousal may require the knowledge of only a few AUs.
The evaluation of 3D face reconstruction results typically relies on a rigid shape alignment between the estimated 3D model and the ground-truth scan. We observe that aligning two shapes with different reference points can largely affect the evaluation results. This poses difficulties for precisely diagnosing and improving a 3D face reconstruction method. In this paper, we propose a novel evaluation approach with a new benchmark REALY, consists of 100 globally aligned face scans with accurate facial keypoints, high-quality region masks, and topology-consistent meshes. Our approach performs region-wise shape alignment and leads to more accurate, bidirectional correspondences during computing the shape errors. The fine-grained, region-wise evaluation results provide us detailed understandings about the performance of state-of-the-art 3D face reconstruction methods. For example, our experiments on single-image based reconstruction methods reveal that DECA performs the best on nose regions, while GANFit performs better on cheek regions. Besides, a new and high-quality 3DMM basis, HIFI3D++, is further derived using the same procedure as we construct REALY to align and retopologize several 3D face datasets. We will release REALY, HIFI3D++, and our new evaluation pipeline at
Face reconstruction and tracking is a building block of numerous applications in AR/VR, human-machine interaction, as well as medical applications. Most of these applications rely on a metrically correct prediction of the shape, especially, when the reconstructed subject is put into a metrical context (i.e., when there is a reference object of known size). A metrical reconstruction is also needed for any application that measures distances and dimensions of the subject (e.g., to virtually fit a glasses frame). State-of-the-art methods for face reconstruction from a single image are trained on large 2D image datasets in a self-supervised fashion. However, due to the nature of a perspective projection they are not able to reconstruct the actual face dimensions, and even predicting the average human face outperforms some of these methods in a metrical sense. To learn the actual shape of a face, we argue for a supervised training scheme. Since there exists no large-scale 3D dataset for this task, we annotated and unified small- and medium-scale databases. The resulting unified dataset is still a medium-scale dataset with more than 2k identities and training purely on it would lead to overfitting. To this end, we take advantage of a face recognition network pretrained on a large-scale 2D image dataset, which provides distinct features for different faces and is robust to expression, illumination, and camera changes. Using these features, we train our face shape estimator in a supervised fashion, inheriting the robustness and generalization of the face recognition network. Our method, which we call MICA (MetrIC fAce), outperforms the state-of-the-art reconstruction methods by a large margin, both on current non-metric benchmarks as well as on our metric benchmarks (15% and 24% lower average error on NoW, respectively). Project website:
We present SfSNet, an end-to-end learning framework for producing an accurate decomposition of an unconstrained human face image into shape, reflectance and illuminance. SfSNet is designed to reflect a physical lambertian rendering model. SfSNet learns from a mixture of labeled synthetic and unlabeled real world images. This allows the network to capture low frequency variations from synthetic and high frequency details from real images through the photometric reconstruction loss. SfSNet consists of a new decomposition architecture with residual blocks that learns a complete separation of albedo and normal. This is used along with the original image to predict lighting. SfSNet produces significantly better quantitative and qualitative results than state-of-the-art methods for inverse rendering and independent normal and illumination estimation. We also introduce a companion network, SfSMesh, that utilizes normals estimated by SfSNet to reconstruct a 3D face mesh. We demonstrate that SfSMesh produces face meshes with greater accuracy than state-of-the-art methods on real world images.
Conference Paper
Full-text available
We present a new loss function, namely Wing loss, for robust facial landmark localisation with Convolutional Neural Networks (CNNs). We first compare and analyse different loss functions including L2, L1 and smooth L1. The analysis of these loss functions suggests that, for the training of a CNN-based localisation model, more attention should be paid to small and medium range errors. To this end, we design a piece-wise loss function. The new loss amplifies the impact of errors from the interval (-w, w) by switching from L1 loss to a modified logarithm function. To address the problem of under-representation of samples with large out-of-plane head rotations in the training set, we propose a simple but effective boosting strategy, referred to as pose-based data balancing. In particular, we deal with the data imbalance problem by duplicating the minority training samples and perturbing them by injecting random image rotation, bounding box translation and other data augmentation approaches. Last, the proposed approach is extended to create a two-stage framework for robust facial landmark localisation. The experimental results obtained on AFLW and 300W demonstrate the merits of the Wing loss function, and prove the superiority of the proposed method over the state-of-the-art approaches.
Conference Paper
Full-text available
We present a framework for robust face detection and landmark localisation of faces in the wild, which has been evaluated as part of `the 2nd Facial Landmark Localisation Competition'. The framework has four stages: face detection, bounding box aggregation, pose estimation and landmark localisation. To achieve a high detection rate, we use two publicly available CNN-based face detectors and two proprietary detectors. We aggregate the detected face bounding boxes of each input image to reduce false positives and improve face detection accuracy. A cascaded shape regressor, trained using faces with a variety of pose variations, is then employed for pose estimation and image pre-processing. Last, we train the final cascaded shape regressor for fine-grained landmark localisation, using a large number of training samples with limited pose variations. The experimental results obtained on the 300W and Menpo benchmarks demonstrate the superiority of our framework over state-of-the-art methods.
Conference Paper
Full-text available
3D face reconstruction is a fundamental Computer Vision problem of extraordinary difficulty. Current systems often assume the availability of multiple facial images (sometimes from the same subject) as input, and must address a number of methodological challenges such as establishing dense correspondences across large facial poses, expressions, and non-uniform illumination. In general these methods require complex and inefficient pipelines for model building and fitting. In this work, we propose to address many of these limitations by training a Convolutional Neural Network (CNN) on an appropriate dataset consisting of 2D images and 3D facial models or scans. Our CNN works with just a single 2D facial image, does not require accurate alignment nor establishes dense correspondence between images, works for arbitrary facial poses and expressions, and can be used to reconstruct the whole 3D facial geometry (including the non-visible parts of the face) bypassing the construction (during training) and fitting (during testing) of a 3D Morphable Model. We achieve this via a simple CNN architecture that performs direct regression of a volumetric representation of the 3D facial geometry from a single 2D image. We also demonstrate how the related task of facial landmark localization can be incorporated into the proposed framework and help improve reconstruction quality, especially for the cases of large poses and facial expressions. Testing code will be made available online, along with pre-trained models
3D Morphable Face Models (3DMM) have been used in pattern recognition for some time now. They have been applied as a basis for 3D face recognition, as well as in an assistive role for 2D face recognition to perform geometric and photometric normalisation of the input image, or in 2D face recognition system training. The statistical distribution underlying 3DMM is Gaussian. However, the single-Gaussian model seems at odds with reality when we consider different cohorts of data, e.g. Black and Chinese faces. Their means are clearly different. This paper introduces the Gaussian Mixture 3DMM (GM-3DMM) which models the global population as a mixture of Gaussian subpopulations, each with its own mean. The proposed GM-3DMM extends the traditional 3DMM naturally, by adopting a shared covariance structure to mitigate small sample estimation problems associated with data in high dimensional spaces. We construct a GM-3DMM, the training of which involves a multiple cohort dataset, SURREY-JNU, comprising 942 3D face scans of people with mixed backgrounds. Experiments in fitting the GM-3DMM to 2D face images to facilitate their geometric and photometric normalisation for pose and illumination invariant face recognition demonstrate the merits of the proposed mixture of Gaussians 3D face model.
We propose a novel structure from motion (SfM) based method for reconstructing the 3D shapes of faces appearing in unconstrained videos. SfM techniques were studied since the early days of computer vision and their limitations are well known. In particular, they are unsuited for the low resolution scenes typical of videos of faces in the wild. To address this, we propose using a parametric face representation as a shape prior to constrain the estimated 3D face shape. Our proposed reconstruction method explores the space around the prior, modifying its 3D shape along with the estimated, per frame cameras and expressions in a manner designed to maximize photometric consistency across video frames yet produce face shapes. Our tests show that this process is both faster and provides superior 3D reconstruction accuracy when compared to existing alternatives.