Conference PaperPDF Available

Abstract

Current face detection concentrates on detecting tiny faces and severely occluded faces. Face analysis methods, however, require a good localization and would benefit greatly from some rotation information. We propose to predict a face direction vector (FDV), which provides the face size and orientation and can be learned by a common object detection architecture better than the traditional bounding box. It provides a more consistent definition of face location and size. Using the FDV is promising for all succeeding face analysis methods. As an example, we show that facial landmark detection can highly benefit from pre-aligned faces.
DETECTING ARBITRARILY ROTATED FACES FOR FACE ANALYSIS
Frerk Saxen, Sebastian Handrich, Philipp Werner, Ehsan Othman, Ayoub Al-Hamadi
Faculty of Electrical Engineering and Information Technology, Neuro-Information Technology
Otto von Guericke University, Magdeburg, Germany
ABSTRACT
Current face detection concentrates on detecting tiny faces
and severely occluded faces. Face analysis methods, how-
ever, require a good localization and would benefit greatly
from some rotation information. We propose to predict a face
direction vector (FDV), which provides the face size and ori-
entation and can be learned by a common object detection
architecture better than the traditional bounding box. It pro-
vides a more consistent definition of face location and size.
Using the FDV is promising for all succeeding face analysis
methods. As an example, we show that facial landmark de-
tection can highly benefit from pre-aligned faces.
Index TermsFace detection, face analysis, rotation in-
variance, face alignment, facial landmark detection
1. INTRODUCTION
Face detection algorithms are widely used and key for the suc-
cess of many face analysis methods. While current face de-
tection research mainly concentrates on detecting tiny faces
[1, 2], current face analysis methods, however, focus on deal-
ing with head rotations [3, 4], which is their main difficulty.
We argue that the traditional bounding box is not an ideal
starting point for further facial analysis because (1) bound-
ing boxes do not provide any hint of the head rotation to
initialize face analysis methods; (2) the edges and the center
of face bounding boxes do not correlate with facial features;
(3) bounding boxes vary significantly between face detection
datasets; (4) what constitutes a face is not consistent across
face detection datasets.
Almost all face analysis methods align the face by predict-
ing facial landmarks as a preprocessing step [5, 6] or learning
the face alignment within the network [7, 8]. Both require
additional resources which could be reduced because the face
detector already has a rough idea about the location of facial
parts. Some authors argue that the inconsistent bounding box
output requires an additional cascade stage [9] or a refinement
step [10] prior to landmark localization.
This work has been funded by the Federal Ministry of Education and
Research (BMBF), projects 03ZZ0443G, 03ZZ0459C, and 03ZZ0470. The
sole responsibility for the content lies with the authors.
Fig. 1: Example image from WIDER [14] with the output
of our proposed model – the face direction vector (black line
defining the rotated red box) – and the original bounding box
annotation (dashed blue).
There are in general two ways of utilizing this informa-
tion of the face detector: (1) by including the face detection
network into the face analysis method (using e.g. parameter
sharing, region pooling, or local transformer networks) or (2)
by changing the face detection output so that it serves the face
analysis more effectively.
Including the face detection network into a fully end-to-
end model is an interesting topic and there is already research
towards this goal [11, 12]. However, most face analysis
datasets just provide the images of cropped faces, which lack
the variability needed for high performance face detection
in fully end-to-end models. In fact, current face analysis
methods depend on a preceding face detection step [13, 7, 4].
Contributions: We propose to redefine the face bounding
box using a face direction vector (FDV) based on 5 facial
landmarks and to change the face detection output for includ-
ing rotation information (without increasing the number of
parameters; Sec. 2-3). We show that a CNN can learn the
FDV better than common bounding boxes (Sec. 4). Another
experiment shows that landmark localization accuracy of a
state-of-the-art method can be improved by using our FDV
approach for face detection.
Saxen et al., "Detecting Arbitrarily Rotated Faces for Face Analysis," IEEE International Conference on Image Processing (ICIP), 2019,
DOI: 10.1109/ICIP.2019.8803631.
This is the accepted manuscript. The final, published version is available on IEEEXplore.
(C) 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works,
for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
2. FACE DIRECTION VECTOR
As traditional bounding boxes, we define a face through four
parameters. Two are describing the position of the face, in
our case the origin or center of the face cR2. The other
two, which are width and height traditionally, are redefined
as our face direction vR2describing the rotation and size
of the face. Both are defined by 5 facial landmarks lm: left
eye center, right eye center, nose tip, left mouth corner, and
right mouth corner. These 5 landmarks are typically anno-
tated in face alignment datasets. We define the face center c=
1
5P5
i=1 lm(i)as the mean of all 5 landmarks and the direc-
tion vector length l=θ
10 P4
i=1 P5
j=i+1 klm(i)lm(j)kby
the average pairwise euclidean norm k·k of all 5 facial land-
marks multiplied by a constant θ. The direction vector length
ldirectly links to the size of the face and we chose θ= 1.1to
mimic the output of dlib’s face detector [15]. The direction of
vis defined by the eyes’ center point e=1
2P2
i=1 lm(i)and
has length l, i.e. v=l·ec
keck. Thus, the face direction vector
vpoints from the face center ctowards the center of the eyes
eand its vector length kvk=ldirectly links to the size of
the face. Finally, we scale cand vso that both are defined
in relative image coordinates with s= [sw, sh]Tdenoting the
image width and height. The relative and absolute width and
height of the face can easily be obtained: wr=hr= 2 · kvk,
wa=sw·wr, and ha=sh·hr. The black line in each face
in Fig. 1 shows the FDV.
The use of our proposed face direction vector has some
major advantages: (1) A face rotation is given that provides a
rough alignment, e.g. for face analysis or as a better starting
point for landmark localization. (2) Our face direction vec-
tor can simply be rotated. Thus data augmentation utilizing
image rotation is easily possible during training. Traditional
upright bounding boxes in contrast do not allow rotation. (3)
Our face definition is based on facial features that are quite
easy to locate, which makes it consistent. Bounding boxes
often vary across (and within) datasets because they are not
bound to distinguishable facial features. (4) Most face analy-
sis methods require a square cropped face because most net-
work architectures are built upon square input images. Our
definition provides this cropping consistently. (5) Our defi-
nition also allows to neglect the rotation to yield an upright
square box if necessary.
3. FACE DETECTION
Dataset: The one major disadvantage of our proposed FDV
based face definition is the lack of annotated facial landmarks
in common face detection datasets like WIDER [14] and
FDDB [16]. The IJB-A face detection dataset [17] provides
3 landmarks. IJB-A was not used because we observed that
the nose tip is crucial for a robust FDV against high pitch
and yaw angles as it is never in the same plane with eye and
mouth points. We use CelebA [18], which provides bounding
boxes, 5 facial landmarks, and 40 facial attributes for about
200k faces. CelebA comprises a rich set of facial expressions,
head rotations, and identities. It is accepted in the domain
of face analysis for e.g. face attribute estimation. However,
CelebA is not ideal to benchmark face detection algorithms
due to a lack of occlusions, out of focus faces, and challeng-
ing lighting conditions. Therefore, we compare our proposed
FDV with the traditional bounding box approach using the
same network and data.
Method: Our face detection architecture is based on the
YOLOv3 object detection network [19]. YOLO (you only
look once) is a fully convolutional neural network that simul-
taneously predicts the object score, location, and size. We use
the tiny model that has about 8.7 million parameters and can
process 220 fps on a Pascal Titan X. The tiny model inputs
4162images and has two output layers. The architecture and
training mechanism is explained in the original paper [19] and
also quite well in the Medium blog post by Ayoosh Kathuria1.
Our training only differs in the loss function of the bound-
ing box prediction. Specifically, in YOLOv3 the width’s net-
work output xwfrom the last convolutional layer is activated
linearly and the width is trained using MSE loss
Lw=1
2log tw
awxw2
. (1)
twis the ground truth width and awis the width of the nearest
anchor to tin euclidean space. The height is trained equally.
In our FDV approach we train the FDV length l= 2 · kvk
with linear activation of xland the MSE loss
Ll=1
2 log
vt
kak!xl!2
. (2)
ais the nearest anchor to the ground truth FDV vtin normal-
ized euclidean space as defined in Sec. 2. If we assume square
bounding boxes with 2·
vt
=tw=th, both loss functions
Lwand Llare the same. However, we propose to also train
the angle offset
α=atan ay
axatan vt
y
vt
x, (3)
between the ground truth FDV vtand its anchor awith linear
activation of xαand the MSE loss
Lα=1
2(∆αxα)2. (4)
In the forward pass (linear activated network output xland
xα), we rearrange the equations (based on the loss functions)
1https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b
-0.6 -0.4 -0.2 0 0.2 0.4 0.6
v
x
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
v
y
Fig. 2: FDV values for the augmented data distribution, an-
chor positions on two circles, and anchor membership of each
data element in color. The anchors are placed such that each
anchor covers the same amount of data.
to obtain the FDV prediction v
α=atan ay
axxα(5)
vx=cos (α)·exl· kak(6)
vy=sin (α)·exl· kak.(7)
Augmentation: For each training sample we choose a ran-
dom face and rotate, scale, shift, and crop the image based
on the following 4 criteria. (1) The target FDV length lis
randomly selected from the probability density function (pdf)
Pl=1
l·log lmax
lmin , (8)
with lmax, and lmin denoting the minimum and maximum
FDV length (we used lmin = 0.015 and lmax = 0.65 – i.e.
the smallest face covers 1.5% of the image height and the
biggest face 65%). Plassures that small faces are dominant
in the training set. We hypothesize from dataset statistics [14]
and augmentation customs [19] that faces that are half as big
are twice as difficult to detect and thus should be present twice
as often during training, which leads us to Pl. (2) The target
FDV angle (α=tan (vx/vy)) is a random value from the
pdf of a normal distribution Pr=P(0, σ)(we used σ= 80)
to cover a wide range of head rotation while making sure that
upright faces are dominant. Fig. 2 shows a set of randomly
sampled FDVs from Pland Pr. Note that upright faces have
negative vybecause it is defined in relative image coordinates
with its origin at the upper left corner. (3) The target center
position cof the face is uniformly distributed. (4) Depending
on (1)-(3) the image needs to be extended and/or cropped to
match the image size of 4162. We also slightly augment in
HSV space, randomly mirror the image, and convert to gray
25% of the time.
0.98 0.985 0.99 0.995 1
0.996
0.997
0.998
0.999
1
Recall
Precision
BB (traditional)
BB (FDV-based)
FDV (proposed)
Fig. 3: Face detection performance on the original CelebA
test set (low variation in face rotation and size). Almost all
20k faces in the test set are correctly detected by the three
approaches.
Anchor Membership and Placement: Redmon and Farhadi
[19] suggest k-means clustering (on BB width and height) to
obtain the anchors (which we do for the BB baseline). How-
ever, because we define the training set distribution of the
FDV, we manually place the anchors such that each anchor
can expect the same amount of training data. We place 20
anchors on two circles. The anchor angles are distributed us-
ing the normal inverse cumulative distribution function with
σfrom Eq. 8. Fig. 2 shows the anchors and the membership
of each sample (to the nearest anchor with the normalized eu-
clidean distance) in color.
4. EXPERIMENTS AND RESULTS
Face detection: We compare the bounding box (BB) baseline
with our FDV prediction on the celebA test set. The BB base-
line is trained on the original bounding box annotations of the
CelebA training set. We calculated 6 anchors (3 anchors per
output layer) by k-means clustering the width and height of
the BB training set and trained tiny-yolo using the code and
architecture from Redmon and Farhadi [19]. For our FDV
prediction we use the same architecture but change the aug-
mentation and the loss layer as explained in Sec. 3. The FDV
is based on the landmark annotations as described in Sec. 2.
Because the traditional BB approach can not utilize our aug-
mentation strategy, we additionally trained a bounding box
approach (standard yolo loss function) based on the FDV by
predicting upright square bounding boxes (width and height
are both set to the vector length l). This shows that the ad-
vantage of our method does not result from our augmentation
strategy because it is the same for BB (FDV-based) and FDV.
We use the evaluation protocol from WIDER [14] to calcu-
late the precision and recall curve. Fig. 3 shows the perfor-
mance of the baseline method and our proposed FDV predic-
tion. All methods correctly detected almost the entire test set.
[20] on detected BB
[20] on gt BB (FDV-based)
[20] on FDV aligned images
[20] on gt FDV aligned images
0.00 0.01 0.02 0.03 0.04 0.05
0.0
0.2
0.4
0.6
0.8
1.0
Normalized Point-to-Point Error
Proportion of Faces
Fig. 4: Facial landmark localization accuracy with [20] (CED
curve) on the augmented CelebA test set. Blue: landmark
localization on traditional upright bounding boxes predicted
by [21] (solid blue) and calculated from FDV ground truth
(dashed blue); Red: landmark localization on pre-aligned
faces (rotation compensation) using our predicted FDV (solid
red) and ground truth FDV (dashed red).
However, our FDV approach outperforms the BB method and
the FDV-based BB method. It is somehow surprising that the
FDV-based BB approach performs similar to the original BB
approach despite the fact that training set distribution has been
changed significantly with our augmentation strategy. This
might indicate that our augmentation strategy does not help
very much, probably because the CelebA dataset has mainly
big faces without a lot of rotation; however we want a model
that generalizes well across different face analysis datasets.
The proposed FDV approach utilizes the same augmentation
strategy and only differs from the FDV-based BB approach by
additionally training the rotation component. This shows the
benefit of including a rotation loss term and the superiority of
our FDV approach.
Landmark localization: Traditional face analysis methods
first detect the face, predict facial landmarks and align the
face based on the landmarks before proceeding with the spe-
cific task like face recognition, attribute detection, etc. To
show the impact of our FDV approach in a more challenging
setup, we first augment the CelebA test set with high variation
in face rotation, size, and partial truncation (the augmentation
strategy is explained in Sec. 3). Next, we predict the FDV and
roughly align each face by compensating the predicted face
rotation. We then predict the landmarks of the aligned faces
using [20]. To compare this with the traditional methodology,
we use S3F D [21] to predict the bounding boxes of the aug-
mented CelebA test set and use the predicted bounding boxes
as input of the same landmark detector [20]. We use the eval-
uation protocol from [22]. A normalized point-to-point (p2p)
error higher than 0.05 relative to the face diagonal is consid-
ered a failure. Fig. 4 shows the cumulative error distribution
(CED) curve for the FDV aligned images (red) and for the
detected bounding boxes (blue). The average p2p error for
the FDV aligned model is 0.0306. The traditional bounding
boxes model has an average p2p error of 2.002. Our FDV
approach outperforms the traditional bounding box strategy
significantly. To remove the effect of the bounding box pre-
dictor we provide the ground truth bounding boxes (based on
the ground truth FDV) of the augmented test set as input to
the landmark detector (dashed blue curve) – with an average
p2p error of 0.0449. We also use the ground truth FDVs to
compensate the face rotation and then estimate the landmarks
(dashed red curve) – with an average p2p error of 0.0282.
Fig. 4 and the reported p2p errors show that compensating
rotation can highly improve the landmark detection accuracy.
Further, using the predicted FDVs even outperform the usage
of the ground truth bounding boxes.
5. CONCLUSIONS
Current face detection concentrates on detecting tiny faces [1]
and severely occluded faces [23]. In contrast, face analysis
methods require a good localization and would benefit greatly
from rotation information. The traditional bounding box is
not quite suited for face analysis. We propose to predict a face
direction vector (FDV), which we define based on 5 facial
landmarks. It provides a consistent definition of face location,
size, and orientation. We have shown that a common object
detection architecture can learn the FDV more efficiently than
bounding boxes. We belief that this has two reasons: 1) The
FDV approach can utilize its anchors much better than the
BB approach. That means the FDV has several anchors for
the same face size. This is usually not possible with bound-
ing box methods, because most face boxes have very similar
width/height ratio. 2) The FDV is based on facial features and
not on the face shape (as bounding boxes). We expect the lat-
ter to be harder to learn by the network. Further, forcing the
network to distinguish between different poses might act as
a rotation dependent regularization which might explain the
improved performance.
More research needs to be done especially towards more
competitive datasets. The major drawback of the FDV is the
dependency on facial landmarks, which are not available in
many datasets. The major advantage is a better face local-
ization with additional rotation information for simplifying
succeeding face analysis tasks. The proposed approach pro-
vides the necessary information for applying similarity align-
ment without needing additional resources (compared to tra-
ditional face detection), i.e. neither a higher capacity network
nor a subsequent landmark localization are needed. Similarity
alignment is widely known to improve face analysis results
compared to no alignment, e.g. see [11]. Additionally, we
have shown that similarity alignment can improve landmark
localization, which e.g. may be used for gaining further head
pose invariance through advanced face frontalization [24].
6. REFERENCES
[1] Yancheng Bai, Yongqiang Zhang, Mingli Ding, and Bernard
Ghanem, “Finding Tiny Faces in the Wild with Generative
Adversarial Network, in 2018 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2018.
[2] Xu Tang, Daniel K. Du, Zeqiang He, and Jingtuo Liu, “Pyra-
midBox: A Context-assisted Single Shot Face Detector, in
The European Conference on Computer Vision (ECCV), sep
2018.
[3] Kaidi Cao, Yu Rong, Cheng Li, Xiaoou Tang, and
Chen Change Loy, “Pose-Robust Face Recognition via Deep
Residual Equivariant Mapping, in IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR). mar 2018,
IEEE.
[4] Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma, “Deep
Adaptive Attention for Joint Facial Action Unit Detection and
Face Alignment, in ECCV, Munich, sep 2018.
[5] Hu Han, Anil K. Jain, Fang Wang, Shiguang Shan, and Xilin
Chen, “Heterogeneous Face Attribute Estimation: A Deep
Multi-Task Learning Approach, IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, jun 2017.
[6] Junliang Xing, Kai Li, Weiming Hu, Chunfeng Yuan, and
Haibin Ling, “Diagnosing deep learning models for high accu-
racy age estimation from a single image, Pattern Recognition,
vol. 66, pp. 106–116, jun 2017.
[7] Hui Ding, Hao Zhou, Shaohua Kevin Zhou, and Rama Chel-
lappa, “A Deep Cascade Network for Unaligned Face Attribute
Classification,” in The Thirty-Second AAAI Conference on Ar-
tificial Intelligence (AAAI-18). 2018, AAAI.
[8] Pau Rodr´
ıguez, Guillem Cucurull, Josep M. Gonfaus, F. Xavier
Roca, and Jordi Gonz`
alez, “Age and gender recognition in the
wild with deep attention,” Pattern Recognition, vol. 72, pp.
563–571, dec 2017.
[9] Amit Kumar, Azadeh Alavi, and Rama Chellappa, “KE-
PLER: Keypoint and Pose Estimation of Unconstrained Faces
by Learning Efficient H-CNN Regressors, in IEEE Interna-
tional Conference on Automatic Face and Gesture Recognition
(FG). may 2017, pp. 258–265, IEEE.
[10] Zhenliang He, Jie Zhang, Meina Kan, Shiguang Shan, and
Xilin Chen, “Robust FEC-CNN: A High Accuracy Fa-
cial Landmark Detection System,” in 2017 IEEE Confer-
ence on Computer Vision and Pattern Recognition Workshops
(CVPRW). jul 2017, pp. 2044–2050, IEEE.
[11] Yuanyi Zhong, Jiansheng Chen, and Bo Huang, “Toward End-
to-End Face Recognition Through Alignment Learning, IEEE
Signal Processing Letters, vol. 24, no. 8, pp. 1213–1217, aug
2017.
[12] Xiaohu Shao, Junliang Xing, Jiangjing Lv, Chunlin Xiao,
Pengcheng Liu, Youji Feng, and Cheng Cheng, “Uncon-
strained Face Alignment Without Face Detection, in CVPRW.
jul 2017, pp. 2069–2077, IEEE.
[13] Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou,
“Joint 3D Face Reconstruction and Dense Alignment with Po-
sition Map Regression Network, in ECCV, sep 2018.
[14] Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang,
“WIDER FACE: A Face Detection Benchmark,” in IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR),
2016.
[15] Davis E. King, “Easily Create High Quality Object Detectors
with Deep Learning,” 2016.
[16] Vidit Jain and Erik Learned-Miller, “FDDB: A Benchmark for
Face Detection in Unconstrained Settings, Tech. Rep. UM-
CS-2010-009, University of Massachusetts, Amherst, 2010.
[17] Brendan F. Klare, Ben Klein, Emma Taborsky, Austin Blan-
ton, Jordan Cheney, Kristen Allen, Patrick Grother, Alan Mah,
Mark Burge, and Anil K. Jain, “Pushing the frontiers of uncon-
strained face detection and recognition: IARPA Janus Bench-
mark A,” in 2015 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR). jun 2015, pp. 1931–1939, IEEE.
[18] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang,
“Deep Learning Face Attributes in the Wild, in 2015 IEEE In-
ternational Conference on Computer Vision (ICCV). dec 2015,
pp. 3730–3738, IEEE.
[19] Joseph Redmon and Ali Farhadi, “YOLOv3: An Incremental
Improvement,arXiv, 2018.
[20] Adrian Bulat and Georgios Tzimiropoulos, “How Far are We
from Solving the 2D and 3D Face Alignment Problem? (and a
Dataset of 230,000 3D Facial Landmarks), in ICCV. oct 2017,
pp. 1021–1030, IEEE.
[21] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo
Wang, and Stan Z. Li, “S3FD: Single Shot Scale-Invariant
Face Detector, in 2017 IEEE International Conference on
Computer Vision (ICCV). oct 2017, pp. 192–201, IEEE.
[22] Stefanos Zafeiriou, George Trigeorgis, Grigorios Chrysos,
Jiankang Deng, and Jie Shen, “The Menpo Facial Landmark
Localisation Challenge: A Step Towards the Solution, in
CVPRW. jul 2017, pp. 2116–2125, IEEE.
[23] Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang,
“Faceness-Net: Face Detection through Deep Facial Part Re-
sponses,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 40, no. 8, pp. 1845–1859, aug 2018.
[24] Philipp Werner, Frerk Saxen, Ayoub Al-Hamadi, and Hui
Yu, “Generalizing to Unseen Head Poses in Facial Expres-
sion Recognition and Action Unit Intensity Estimation,” in
IEEE International Conference on Automatic Face and Ges-
ture Recognition (FG 2019), Lille, France, 2019, IEEE.
... This face recognition module was implemented according to Saxen et al. [30] and is based on the YOLO architecture (yolov3-tiny) [31] trained on the ImageNet dataset [32]. The transposed YOLO architecture for face detection, using transfer learning and adapted loss function, is used to predict a face direction vector (FDV), which can be used to calculate a rotated bounding box around a face. ...
... The transposed YOLO architecture for face detection, using transfer learning and adapted loss function, is used to predict a face direction vector (FDV), which can be used to calculate a rotated bounding box around a face. This approach proved to work better than alternative face detection systems, capable only calculating ordinary bounding boxes [30]. For head pose estimation, a simple landmark-based method was explored (called head pose on top of facial landmarks, or HPFL), which gave significantly better results when compared to different landmark detection methods, which is also capable of estimating the head pose [30]. ...
... This approach proved to work better than alternative face detection systems, capable only calculating ordinary bounding boxes [30]. For head pose estimation, a simple landmark-based method was explored (called head pose on top of facial landmarks, or HPFL), which gave significantly better results when compared to different landmark detection methods, which is also capable of estimating the head pose [30]. Compared to the head pose estimation provided by OpenFace [33], this method provides a significantly better head pose estimation. ...
Article
Full-text available
Multimodal user interfaces promise natural and intuitive human–machine interactions. However, is the extra effort for the development of a complex multisensor system justified, or can users also be satisfied with only one input modality? This study investigates interactions in an industrial weld inspection workstation. Three unimodal interfaces, including spatial interaction with buttons augmented on a workpiece or a worktable, and speech commands, were tested individually and in a multimodal combination. Within the unimodal conditions, users preferred the augmented worktable, but overall, the interindividual usage of all input technologies in the multimodal condition was ranked best. Our findings indicate that the implementation and the use of multiple input modalities is valuable and that it is difficult to predict the usability of individual input modalities for complex systems.
... For the face recognition task, a deep face recognition model is being implemented based on ResNet [11] and MobileFaceNet [12] to extract the face features with ArcFace [13] as a loss function. In addition an arbitrarily rotated face detector for face analysis, using a redefined bounding-box aligned with the face direction vector (FDV) [14]. Alternatively an adopted Single Shot Detector (SSD) [15] could be used. ...
... Chen et al. [58] proposed to combine face detection and alignment in one framework, because they observed that aligned face shapes provide better features for face detection. Furthermore, Saxen et al. [59] proved that a CNN can detect faces more easily by adding face orientation as a training target. Inspired by these approaches, various methods for face detection were developed, which incorporated the prediction of additional facial features into the network for improved performance: MTCNN [40] and RetinaFace [46] predict five ancillary face landmarks, He et al. [60] predict plenty facial attributes and Wu et al. [61] predict the head pose. ...
Article
Full-text available
Face and person detection are important tasks in computer vision, as they represent the first component in many recognition systems, such as face recognition, facial expression analysis, body pose estimation, face attribute detection, or human action recognition. Thereby, their detection rate and runtime are crucial for the performance of the overall system. In this paper, we combine both face and person detection in one framework with the goal of reaching a detection performance that is competitive to the state of the art of lightweight object-specific networks while maintaining real-time processing speed for both detection tasks together. In order to combine face and person detection in one network, we applied multi-task learning. The difficulty lies in the fact that no datasets are available that contain both face as well as person annotations. Since we did not have the resources to manually annotate the datasets, as it is very time-consuming and automatic generation of ground truths results in annotations of poor quality, we solve this issue algorithmically by applying a special training procedure and network architecture without the need of creating new labels. Our newly developed method called Simultaneous Face and Person Detection (SFPD) is able to detect persons and faces with 40 frames per second. Because of this good trade-off between detection performance and inference time, SFPD represents a useful and valuable real-time framework especially for a multitude of real-world applications such as, e.g., human–robot interaction.
Preprint
Full-text available
Multimodal user interfaces promise a natural and intuitive human machine interactions. But is the extra effort for the development of a complex multi-sensor system justified, or can users also be satisfied with one input modality already? This study investigates interactions in an industrial weld inspection workstation. Three unimodal interfaces, including spatial interaction with buttons augmented on a workpiece or a worktable, and speech commands, were tested individually and in a multimodal combination. Within the unimodal conditions, users preferred the augmented worktable, but overall, the interindividual usage of all input technologies in the multimodal condition was ranked best. Our findings indicate that the implementation and use of multiple input modalities is valuable, and that it is difficult to predict the usability of individual input modalities for complex systems.
Preprint
Full-text available
Multimodal user interfaces hold the promise to enable natural and intuitive means of human-machine interaction. But is the effort of using many sensors rewarded by the users - or is a single modality sufficient? For an industrial use case of a weld seam inspection workstation we study three independent interfaces (augmented spatial button interaction on (1) a workpiece or (2) a worktable and through (3) speech commands) and their combination (4). While users preferred the interaction on the augmented worktable when comparing single modalities, they used all three in multimodal user interface. From the isolated test of an input modality it is not possible to infer its use in a complex system. Therefore, the effort implementing and using multiple input modalities is reasonable.
Conference Paper
Full-text available
Facial expression analysis is challenged by the numerous degrees of freedom regarding head pose, identity, illumination, occlusions, and the expressions itself. It currently seems hardly possible to densely cover this enormous space with data for training a universal well-performing expression recognition system. In this paper we address the sub-challenge of generalizing to head poses that were not seen in the training data, aiming at getting along with sparse coverage of the pose subspace. For this purpose we (1) propose a novel face normalization method called FaNC that massively reduces pose-induced image variance; (2) we compare the impact of the proposed and other normalization methods on (a) action unit intensity estimation with the FERA 2017 challenge data (achieving new state of the art) and (b) facial expression recognition with the Multi-PIE dataset; and (3) we discuss the head pose distribution needed to train a pose-invariant CNN-based recognition system. The proposed FaNC method normalizes pose and facial proportions while retaining expression information and runs in less than 2 ms. When comparing results achieved by training a CNN on the output images of FaNC and other normalization methods, FaNC generalizes significantly better than others to unseen poses if they deviate more than 20° from the poses available during training.
Article
Full-text available
We propose a straightforward method that simultaneously reconstructs the 3D facial structure and provides dense alignment. To achieve this, we design a 2D representation called UV position map which records the 3D shape of a complete face in UV space, then train a simple Convolutional Neural Network to regress it from a single 2D image. We also integrate a weight mask into the loss function during training to improve the performance of the network. Our method does not rely on any prior face model, and can reconstruct full facial geometry along with semantic meaning. Meanwhile, our network is very light-weighted and spends only 9.8ms to process an image, which is extremely faster than previous works. Experiments on multiple challenging datasets show that our method surpasses other state-of-the-art methods on both reconstruction and alignment tasks by a large margin.
Article
Full-text available
Face detection has been well studied for many years and one of the remaining challenges is to detect small, blurred and partially occluded faces in uncontrolled environment. This paper proposes a novel context-assisted single shot face detector, named PyramidBox, to handle the hard face detection problem. Observing the importance of the context, we improve the utilization of contextual information in the following three aspects. First, we design a novel contextual anchor to supervise high-level contextual feature learning by a semi-supervised method, which we call it PyramidAnchors. Second, we propose the Low-level Feature Pyramid Network to combine adequate high-level contextual semantic feature and Low-level facial feature together, which also allows the PyramidBox to predict faces of all scales in a single shot. Third, we introduce a context-sensitive structure to increase the capacity of prediction network to improve the final accuracy of output. In addition, we use the method of Data-anchor-sampling to augment the training samples across different scales, which increases the diversities of training data for smaller faces. By exploiting the value of context, PyramidBox achieves superior performance among the state-of-the-art on the two common face detection benchmarks, FDDB and WIDER FACE.
Article
Full-text available
Facial action unit (AU) detection and face alignment are two highly correlated tasks since facial landmarks can provide precise AU locations to facilitate the extraction of meaningful local features for AU detection. Most existing AU detection works often treat face alignment as a preprocessing and handle the two tasks independently. In this paper, we propose a novel end-to-end deep learning framework for joint AU detection and face alignment, which has not been explored before. In particular, multi-scale shared features are learned firstly, and high-level features of face alignment are fed into AU detection. Moreover, to extract precise local features, we propose an adaptive attention learning module to refine the attention map of each AU adaptively. Finally, the assembled local features are integrated with face alignment features and global features for AU detection. Experiments on BP4D and DISFA benchmarks demonstrate that our framework significantly outperforms the state-of-the-art methods for AU detection.
Conference Paper
Full-text available
This paper presents a real-time face detector, named Single Shot Scale-invariant Face Detector (S³FD), which performs superiorly on various scales of faces with a single deep neural network, especially for small faces. Specifically, we try to solve the common problem that anchorbased detectors deteriorate dramatically as the objects become smaller. We make contributions in the following three aspects: 1) proposing a scale-equitable face detection framework to handle different scales of faces well. We tile anchors on a wide range of layers to ensure that all scales of faces have enough features for detection. Besides, we design anchor scales based on the effective receptive field and a proposed equal proportion interval principle; 2) improving the recall rate of small faces by a scale compensation anchor matching strategy; 3) reducing the false positive rate of small faces via a max-out background label. As a consequence, our method achieves state-of-the-art detection performance on all the common face detection benchmarks, including the AFW, PASCAL face, FDDB and WIDER FACE datasets, and can run at 36 FPS on a Nvidia Titan X (Pascal) for VGA-resolution images.
Chapter
Facial action unit (AU) detection and face alignment are two highly correlated tasks since facial landmarks can provide precise AU locations to facilitate the extraction of meaningful local features for AU detection. Most existing AU detection works often treat face alignment as a preprocessing and handle the two tasks independently. In this paper, we propose a novel end-to-end deep learning framework for joint AU detection and face alignment, which has not been explored before. In particular, multi-scale shared features are learned firstly, and high-level features of face alignment are fed into AU detection. Moreover, to extract precise local features, we propose an adaptive attention learning module to refine the attention map of each AU adaptively. Finally, the assembled local features are integrated with face alignment features and global features for AU detection. Experiments on BP4D and DISFA benchmarks demonstrate that our framework significantly outperforms the state-of-the-art methods for AU detection.
Article
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/
Article
Face recognition achieves exceptional success thanks to the emergence of deep learning. However, many contemporary face recognition models still perform relatively poor in processing profile faces compared to frontal faces. A key reason is that the number of frontal and profile training faces are highly imbalanced - there are extensively more frontal training samples compared to profile ones. In addition, it is intrinsically hard to learn a deep representation that is geometrically invariant to large pose variations. In this study, we hypothesize that there is an inherent mapping between frontal and profile faces, and consequently, their discrepancy in the deep representation space can be bridged by an equivariant mapping. To exploit this mapping, we formulate a novel Deep Residual EquivAriant Mapping (DREAM) block, which is capable of adaptively adding residuals to the input deep representation to transform a profile face representation to a canonical pose that simplifies recognition. The DREAM block consistently enhances the performance of profile face recognition for many strong deep networks, including ResNet models, without deliberately augmenting training data of profile faces. The block is easy to use, light-weight, and can be implemented with a negligible computational overhead.