Content uploaded by Philipp Werner
Author content
All content in this area was uploaded by Philipp Werner on Oct 18, 2019
Content may be subject to copyright.
DETECTING ARBITRARILY ROTATED FACES FOR FACE ANALYSIS
Frerk Saxen, Sebastian Handrich, Philipp Werner, Ehsan Othman, Ayoub Al-Hamadi
Faculty of Electrical Engineering and Information Technology, Neuro-Information Technology
Otto von Guericke University, Magdeburg, Germany
ABSTRACT
Current face detection concentrates on detecting tiny faces
and severely occluded faces. Face analysis methods, how-
ever, require a good localization and would benefit greatly
from some rotation information. We propose to predict a face
direction vector (FDV), which provides the face size and ori-
entation and can be learned by a common object detection
architecture better than the traditional bounding box. It pro-
vides a more consistent definition of face location and size.
Using the FDV is promising for all succeeding face analysis
methods. As an example, we show that facial landmark de-
tection can highly benefit from pre-aligned faces.
Index Terms—Face detection, face analysis, rotation in-
variance, face alignment, facial landmark detection
1. INTRODUCTION
Face detection algorithms are widely used and key for the suc-
cess of many face analysis methods. While current face de-
tection research mainly concentrates on detecting tiny faces
[1, 2], current face analysis methods, however, focus on deal-
ing with head rotations [3, 4], which is their main difficulty.
We argue that the traditional bounding box is not an ideal
starting point for further facial analysis because (1) bound-
ing boxes do not provide any hint of the head rotation to
initialize face analysis methods; (2) the edges and the center
of face bounding boxes do not correlate with facial features;
(3) bounding boxes vary significantly between face detection
datasets; (4) what constitutes a face is not consistent across
face detection datasets.
Almost all face analysis methods align the face by predict-
ing facial landmarks as a preprocessing step [5, 6] or learning
the face alignment within the network [7, 8]. Both require
additional resources which could be reduced because the face
detector already has a rough idea about the location of facial
parts. Some authors argue that the inconsistent bounding box
output requires an additional cascade stage [9] or a refinement
step [10] prior to landmark localization.
This work has been funded by the Federal Ministry of Education and
Research (BMBF), projects 03ZZ0443G, 03ZZ0459C, and 03ZZ0470. The
sole responsibility for the content lies with the authors.
Fig. 1: Example image from WIDER [14] with the output
of our proposed model – the face direction vector (black line
defining the rotated red box) – and the original bounding box
annotation (dashed blue).
There are in general two ways of utilizing this informa-
tion of the face detector: (1) by including the face detection
network into the face analysis method (using e.g. parameter
sharing, region pooling, or local transformer networks) or (2)
by changing the face detection output so that it serves the face
analysis more effectively.
Including the face detection network into a fully end-to-
end model is an interesting topic and there is already research
towards this goal [11, 12]. However, most face analysis
datasets just provide the images of cropped faces, which lack
the variability needed for high performance face detection
in fully end-to-end models. In fact, current face analysis
methods depend on a preceding face detection step [13, 7, 4].
Contributions: We propose to redefine the face bounding
box using a face direction vector (FDV) based on 5 facial
landmarks and to change the face detection output for includ-
ing rotation information (without increasing the number of
parameters; Sec. 2-3). We show that a CNN can learn the
FDV better than common bounding boxes (Sec. 4). Another
experiment shows that landmark localization accuracy of a
state-of-the-art method can be improved by using our FDV
approach for face detection.
Saxen et al., "Detecting Arbitrarily Rotated Faces for Face Analysis," IEEE International Conference on Image Processing (ICIP), 2019,
DOI: 10.1109/ICIP.2019.8803631.
This is the accepted manuscript. The final, published version is available on IEEEXplore.
(C) 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works,
for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
2. FACE DIRECTION VECTOR
As traditional bounding boxes, we define a face through four
parameters. Two are describing the position of the face, in
our case the origin or center of the face c∈R2. The other
two, which are width and height traditionally, are redefined
as our face direction v∈R2describing the rotation and size
of the face. Both are defined by 5 facial landmarks lm: left
eye center, right eye center, nose tip, left mouth corner, and
right mouth corner. These 5 landmarks are typically anno-
tated in face alignment datasets. We define the face center c=
1
5P5
i=1 lm(i)as the mean of all 5 landmarks and the direc-
tion vector length l=θ
10 P4
i=1 P5
j=i+1 klm(i)−lm(j)kby
the average pairwise euclidean norm k·k of all 5 facial land-
marks multiplied by a constant θ. The direction vector length
ldirectly links to the size of the face and we chose θ= 1.1to
mimic the output of dlib’s face detector [15]. The direction of
vis defined by the eyes’ center point e=1
2P2
i=1 lm(i)and
has length l, i.e. v=l·e−c
ke−ck. Thus, the face direction vector
vpoints from the face center ctowards the center of the eyes
eand its vector length kvk=ldirectly links to the size of
the face. Finally, we scale cand vso that both are defined
in relative image coordinates with s= [sw, sh]Tdenoting the
image width and height. The relative and absolute width and
height of the face can easily be obtained: wr=hr= 2 · kvk,
wa=sw·wr, and ha=sh·hr. The black line in each face
in Fig. 1 shows the FDV.
The use of our proposed face direction vector has some
major advantages: (1) A face rotation is given that provides a
rough alignment, e.g. for face analysis or as a better starting
point for landmark localization. (2) Our face direction vec-
tor can simply be rotated. Thus data augmentation utilizing
image rotation is easily possible during training. Traditional
upright bounding boxes in contrast do not allow rotation. (3)
Our face definition is based on facial features that are quite
easy to locate, which makes it consistent. Bounding boxes
often vary across (and within) datasets because they are not
bound to distinguishable facial features. (4) Most face analy-
sis methods require a square cropped face because most net-
work architectures are built upon square input images. Our
definition provides this cropping consistently. (5) Our defi-
nition also allows to neglect the rotation to yield an upright
square box if necessary.
3. FACE DETECTION
Dataset: The one major disadvantage of our proposed FDV
based face definition is the lack of annotated facial landmarks
in common face detection datasets like WIDER [14] and
FDDB [16]. The IJB-A face detection dataset [17] provides
3 landmarks. IJB-A was not used because we observed that
the nose tip is crucial for a robust FDV against high pitch
and yaw angles as it is never in the same plane with eye and
mouth points. We use CelebA [18], which provides bounding
boxes, 5 facial landmarks, and 40 facial attributes for about
200k faces. CelebA comprises a rich set of facial expressions,
head rotations, and identities. It is accepted in the domain
of face analysis for e.g. face attribute estimation. However,
CelebA is not ideal to benchmark face detection algorithms
due to a lack of occlusions, out of focus faces, and challeng-
ing lighting conditions. Therefore, we compare our proposed
FDV with the traditional bounding box approach using the
same network and data.
Method: Our face detection architecture is based on the
YOLOv3 object detection network [19]. YOLO (you only
look once) is a fully convolutional neural network that simul-
taneously predicts the object score, location, and size. We use
the tiny model that has about 8.7 million parameters and can
process 220 fps on a Pascal Titan X. The tiny model inputs
4162images and has two output layers. The architecture and
training mechanism is explained in the original paper [19] and
also quite well in the Medium blog post by Ayoosh Kathuria1.
Our training only differs in the loss function of the bound-
ing box prediction. Specifically, in YOLOv3 the width’s net-
work output xwfrom the last convolutional layer is activated
linearly and the width is trained using MSE loss
Lw=1
2log tw
aw−xw2
. (1)
twis the ground truth width and awis the width of the nearest
anchor to tin euclidean space. The height is trained equally.
In our FDV approach we train the FDV length l= 2 · kvk
with linear activation of xland the MSE loss
Ll=1
2 log
vt
kak!−xl!2
. (2)
ais the nearest anchor to the ground truth FDV vtin normal-
ized euclidean space as defined in Sec. 2. If we assume square
bounding boxes with 2·
vt
=tw=th, both loss functions
Lwand Llare the same. However, we propose to also train
the angle offset
∆α=atan ay
ax−atan vt
y
vt
x, (3)
between the ground truth FDV vtand its anchor awith linear
activation of xαand the MSE loss
Lα=1
2(∆α−xα)2. (4)
In the forward pass (linear activated network output xland
xα), we rearrange the equations (based on the loss functions)
1https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b
-0.6 -0.4 -0.2 0 0.2 0.4 0.6
v
x
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
v
y
Fig. 2: FDV values for the augmented data distribution, an-
chor positions on two circles, and anchor membership of each
data element in color. The anchors are placed such that each
anchor covers the same amount of data.
to obtain the FDV prediction v
α=atan ay
ax−xα(5)
vx=cos (α)·exl· kak(6)
vy=sin (α)·exl· kak.(7)
Augmentation: For each training sample we choose a ran-
dom face and rotate, scale, shift, and crop the image based
on the following 4 criteria. (1) The target FDV length lis
randomly selected from the probability density function (pdf)
Pl=1
l·log lmax
lmin , (8)
with lmax, and lmin denoting the minimum and maximum
FDV length (we used lmin = 0.015 and lmax = 0.65 – i.e.
the smallest face covers 1.5% of the image height and the
biggest face 65%). Plassures that small faces are dominant
in the training set. We hypothesize from dataset statistics [14]
and augmentation customs [19] that faces that are half as big
are twice as difficult to detect and thus should be present twice
as often during training, which leads us to Pl. (2) The target
FDV angle (α=tan (vx/−vy)) is a random value from the
pdf of a normal distribution Pr=P(0, σ)(we used σ= 80◦)
to cover a wide range of head rotation while making sure that
upright faces are dominant. Fig. 2 shows a set of randomly
sampled FDVs from Pland Pr. Note that upright faces have
negative vybecause it is defined in relative image coordinates
with its origin at the upper left corner. (3) The target center
position cof the face is uniformly distributed. (4) Depending
on (1)-(3) the image needs to be extended and/or cropped to
match the image size of 4162. We also slightly augment in
HSV space, randomly mirror the image, and convert to gray
25% of the time.
0.98 0.985 0.99 0.995 1
0.996
0.997
0.998
0.999
1
Recall
Precision
BB (traditional)
BB (FDV-based)
FDV (proposed)
Fig. 3: Face detection performance on the original CelebA
test set (low variation in face rotation and size). Almost all
20k faces in the test set are correctly detected by the three
approaches.
Anchor Membership and Placement: Redmon and Farhadi
[19] suggest k-means clustering (on BB width and height) to
obtain the anchors (which we do for the BB baseline). How-
ever, because we define the training set distribution of the
FDV, we manually place the anchors such that each anchor
can expect the same amount of training data. We place 20
anchors on two circles. The anchor angles are distributed us-
ing the normal inverse cumulative distribution function with
σfrom Eq. 8. Fig. 2 shows the anchors and the membership
of each sample (to the nearest anchor with the normalized eu-
clidean distance) in color.
4. EXPERIMENTS AND RESULTS
Face detection: We compare the bounding box (BB) baseline
with our FDV prediction on the celebA test set. The BB base-
line is trained on the original bounding box annotations of the
CelebA training set. We calculated 6 anchors (3 anchors per
output layer) by k-means clustering the width and height of
the BB training set and trained tiny-yolo using the code and
architecture from Redmon and Farhadi [19]. For our FDV
prediction we use the same architecture but change the aug-
mentation and the loss layer as explained in Sec. 3. The FDV
is based on the landmark annotations as described in Sec. 2.
Because the traditional BB approach can not utilize our aug-
mentation strategy, we additionally trained a bounding box
approach (standard yolo loss function) based on the FDV by
predicting upright square bounding boxes (width and height
are both set to the vector length l). This shows that the ad-
vantage of our method does not result from our augmentation
strategy because it is the same for BB (FDV-based) and FDV.
We use the evaluation protocol from WIDER [14] to calcu-
late the precision and recall curve. Fig. 3 shows the perfor-
mance of the baseline method and our proposed FDV predic-
tion. All methods correctly detected almost the entire test set.
[20] on detected BB
[20] on gt BB (FDV-based)
[20] on FDV aligned images
[20] on gt FDV aligned images
0.00 0.01 0.02 0.03 0.04 0.05
0.0
0.2
0.4
0.6
0.8
1.0
Normalized Point-to-Point Error
Proportion of Faces
Fig. 4: Facial landmark localization accuracy with [20] (CED
curve) on the augmented CelebA test set. Blue: landmark
localization on traditional upright bounding boxes predicted
by [21] (solid blue) and calculated from FDV ground truth
(dashed blue); Red: landmark localization on pre-aligned
faces (rotation compensation) using our predicted FDV (solid
red) and ground truth FDV (dashed red).
However, our FDV approach outperforms the BB method and
the FDV-based BB method. It is somehow surprising that the
FDV-based BB approach performs similar to the original BB
approach despite the fact that training set distribution has been
changed significantly with our augmentation strategy. This
might indicate that our augmentation strategy does not help
very much, probably because the CelebA dataset has mainly
big faces without a lot of rotation; however we want a model
that generalizes well across different face analysis datasets.
The proposed FDV approach utilizes the same augmentation
strategy and only differs from the FDV-based BB approach by
additionally training the rotation component. This shows the
benefit of including a rotation loss term and the superiority of
our FDV approach.
Landmark localization: Traditional face analysis methods
first detect the face, predict facial landmarks and align the
face based on the landmarks before proceeding with the spe-
cific task like face recognition, attribute detection, etc. To
show the impact of our FDV approach in a more challenging
setup, we first augment the CelebA test set with high variation
in face rotation, size, and partial truncation (the augmentation
strategy is explained in Sec. 3). Next, we predict the FDV and
roughly align each face by compensating the predicted face
rotation. We then predict the landmarks of the aligned faces
using [20]. To compare this with the traditional methodology,
we use S3F D [21] to predict the bounding boxes of the aug-
mented CelebA test set and use the predicted bounding boxes
as input of the same landmark detector [20]. We use the eval-
uation protocol from [22]. A normalized point-to-point (p2p)
error higher than 0.05 relative to the face diagonal is consid-
ered a failure. Fig. 4 shows the cumulative error distribution
(CED) curve for the FDV aligned images (red) and for the
detected bounding boxes (blue). The average p2p error for
the FDV aligned model is 0.0306. The traditional bounding
boxes model has an average p2p error of 2.002. Our FDV
approach outperforms the traditional bounding box strategy
significantly. To remove the effect of the bounding box pre-
dictor we provide the ground truth bounding boxes (based on
the ground truth FDV) of the augmented test set as input to
the landmark detector (dashed blue curve) – with an average
p2p error of 0.0449. We also use the ground truth FDVs to
compensate the face rotation and then estimate the landmarks
(dashed red curve) – with an average p2p error of 0.0282.
Fig. 4 and the reported p2p errors show that compensating
rotation can highly improve the landmark detection accuracy.
Further, using the predicted FDVs even outperform the usage
of the ground truth bounding boxes.
5. CONCLUSIONS
Current face detection concentrates on detecting tiny faces [1]
and severely occluded faces [23]. In contrast, face analysis
methods require a good localization and would benefit greatly
from rotation information. The traditional bounding box is
not quite suited for face analysis. We propose to predict a face
direction vector (FDV), which we define based on 5 facial
landmarks. It provides a consistent definition of face location,
size, and orientation. We have shown that a common object
detection architecture can learn the FDV more efficiently than
bounding boxes. We belief that this has two reasons: 1) The
FDV approach can utilize its anchors much better than the
BB approach. That means the FDV has several anchors for
the same face size. This is usually not possible with bound-
ing box methods, because most face boxes have very similar
width/height ratio. 2) The FDV is based on facial features and
not on the face shape (as bounding boxes). We expect the lat-
ter to be harder to learn by the network. Further, forcing the
network to distinguish between different poses might act as
a rotation dependent regularization which might explain the
improved performance.
More research needs to be done especially towards more
competitive datasets. The major drawback of the FDV is the
dependency on facial landmarks, which are not available in
many datasets. The major advantage is a better face local-
ization with additional rotation information for simplifying
succeeding face analysis tasks. The proposed approach pro-
vides the necessary information for applying similarity align-
ment without needing additional resources (compared to tra-
ditional face detection), i.e. neither a higher capacity network
nor a subsequent landmark localization are needed. Similarity
alignment is widely known to improve face analysis results
compared to no alignment, e.g. see [11]. Additionally, we
have shown that similarity alignment can improve landmark
localization, which e.g. may be used for gaining further head
pose invariance through advanced face frontalization [24].
6. REFERENCES
[1] Yancheng Bai, Yongqiang Zhang, Mingli Ding, and Bernard
Ghanem, “Finding Tiny Faces in the Wild with Generative
Adversarial Network,” in 2018 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2018.
[2] Xu Tang, Daniel K. Du, Zeqiang He, and Jingtuo Liu, “Pyra-
midBox: A Context-assisted Single Shot Face Detector,” in
The European Conference on Computer Vision (ECCV), sep
2018.
[3] Kaidi Cao, Yu Rong, Cheng Li, Xiaoou Tang, and
Chen Change Loy, “Pose-Robust Face Recognition via Deep
Residual Equivariant Mapping,” in IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR). mar 2018,
IEEE.
[4] Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma, “Deep
Adaptive Attention for Joint Facial Action Unit Detection and
Face Alignment,” in ECCV, Munich, sep 2018.
[5] Hu Han, Anil K. Jain, Fang Wang, Shiguang Shan, and Xilin
Chen, “Heterogeneous Face Attribute Estimation: A Deep
Multi-Task Learning Approach,” IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, jun 2017.
[6] Junliang Xing, Kai Li, Weiming Hu, Chunfeng Yuan, and
Haibin Ling, “Diagnosing deep learning models for high accu-
racy age estimation from a single image,” Pattern Recognition,
vol. 66, pp. 106–116, jun 2017.
[7] Hui Ding, Hao Zhou, Shaohua Kevin Zhou, and Rama Chel-
lappa, “A Deep Cascade Network for Unaligned Face Attribute
Classification,” in The Thirty-Second AAAI Conference on Ar-
tificial Intelligence (AAAI-18). 2018, AAAI.
[8] Pau Rodr´
ıguez, Guillem Cucurull, Josep M. Gonfaus, F. Xavier
Roca, and Jordi Gonz`
alez, “Age and gender recognition in the
wild with deep attention,” Pattern Recognition, vol. 72, pp.
563–571, dec 2017.
[9] Amit Kumar, Azadeh Alavi, and Rama Chellappa, “KE-
PLER: Keypoint and Pose Estimation of Unconstrained Faces
by Learning Efficient H-CNN Regressors,” in IEEE Interna-
tional Conference on Automatic Face and Gesture Recognition
(FG). may 2017, pp. 258–265, IEEE.
[10] Zhenliang He, Jie Zhang, Meina Kan, Shiguang Shan, and
Xilin Chen, “Robust FEC-CNN: A High Accuracy Fa-
cial Landmark Detection System,” in 2017 IEEE Confer-
ence on Computer Vision and Pattern Recognition Workshops
(CVPRW). jul 2017, pp. 2044–2050, IEEE.
[11] Yuanyi Zhong, Jiansheng Chen, and Bo Huang, “Toward End-
to-End Face Recognition Through Alignment Learning,” IEEE
Signal Processing Letters, vol. 24, no. 8, pp. 1213–1217, aug
2017.
[12] Xiaohu Shao, Junliang Xing, Jiangjing Lv, Chunlin Xiao,
Pengcheng Liu, Youji Feng, and Cheng Cheng, “Uncon-
strained Face Alignment Without Face Detection,” in CVPRW.
jul 2017, pp. 2069–2077, IEEE.
[13] Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou,
“Joint 3D Face Reconstruction and Dense Alignment with Po-
sition Map Regression Network,” in ECCV, sep 2018.
[14] Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang,
“WIDER FACE: A Face Detection Benchmark,” in IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR),
2016.
[15] Davis E. King, “Easily Create High Quality Object Detectors
with Deep Learning,” 2016.
[16] Vidit Jain and Erik Learned-Miller, “FDDB: A Benchmark for
Face Detection in Unconstrained Settings,” Tech. Rep. UM-
CS-2010-009, University of Massachusetts, Amherst, 2010.
[17] Brendan F. Klare, Ben Klein, Emma Taborsky, Austin Blan-
ton, Jordan Cheney, Kristen Allen, Patrick Grother, Alan Mah,
Mark Burge, and Anil K. Jain, “Pushing the frontiers of uncon-
strained face detection and recognition: IARPA Janus Bench-
mark A,” in 2015 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR). jun 2015, pp. 1931–1939, IEEE.
[18] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang,
“Deep Learning Face Attributes in the Wild,” in 2015 IEEE In-
ternational Conference on Computer Vision (ICCV). dec 2015,
pp. 3730–3738, IEEE.
[19] Joseph Redmon and Ali Farhadi, “YOLOv3: An Incremental
Improvement,” arXiv, 2018.
[20] Adrian Bulat and Georgios Tzimiropoulos, “How Far are We
from Solving the 2D and 3D Face Alignment Problem? (and a
Dataset of 230,000 3D Facial Landmarks),” in ICCV. oct 2017,
pp. 1021–1030, IEEE.
[21] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo
Wang, and Stan Z. Li, “S3FD: Single Shot Scale-Invariant
Face Detector,” in 2017 IEEE International Conference on
Computer Vision (ICCV). oct 2017, pp. 192–201, IEEE.
[22] Stefanos Zafeiriou, George Trigeorgis, Grigorios Chrysos,
Jiankang Deng, and Jie Shen, “The Menpo Facial Landmark
Localisation Challenge: A Step Towards the Solution,” in
CVPRW. jul 2017, pp. 2116–2125, IEEE.
[23] Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang,
“Faceness-Net: Face Detection through Deep Facial Part Re-
sponses,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 40, no. 8, pp. 1845–1859, aug 2018.
[24] Philipp Werner, Frerk Saxen, Ayoub Al-Hamadi, and Hui
Yu, “Generalizing to Unseen Head Poses in Facial Expres-
sion Recognition and Action Unit Intensity Estimation,” in
IEEE International Conference on Automatic Face and Ges-
ture Recognition (FG 2019), Lille, France, 2019, IEEE.