ArticlePDF Available

Abstract and Figures

We present an algorithm for simultaneous face detection, landmarks localization, pose estimation and gender recognition using deep convolutional neural networks (CNN). The proposed method called, Hyperface, fuses the intermediate layers of a deep CNN using a separate CNN and trains multi-task loss on the fused features. It exploits the synergy among the tasks which boosts up their individual performances. Extensive experiments show that the proposed method is able to capture both global and local information of faces and performs significantly better than many competitive algorithms for each of these four tasks.
Content may be subject to copyright.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 1
HyperFace: A Deep Multi-task Learning
Framework for Face Detection, Landmark
Localization, Pose Estimation, and Gender
Recognition
Rajeev Ranjan, Member, IEEE, Vishal M. Patel, Senior Member, IEEE, and Rama Chellappa, Fellow, IEEE
Abstract—We present an algorithm for simultaneous face detection, landmarks localization, pose estimation and gender recognition
using deep convolutional neural networks (CNN). The proposed method called, HyperFace, fuses the intermediate layers of a deep
CNN using a separate CNN followed by a multi-task learning algorithm that operates on the fused features. It exploits the synergy
among the tasks which boosts up their individual performances. Extensive experiments show that the proposed method is able to
capture both global and local information in faces and performs significantly better than many competitive algorithms for each of these
four tasks.
Index Terms—Face Detection, Landmarks Localization, Head Pose Estimation, Gender Recognition, Deep Convolutional Neural
Networks, Multi-task Learning.
F
1 INTRODUCTION
DETECTION and analysis of faces is a challenging prob-
lem in computer vision, and has been actively re-
searched for applications such as face verification, face
tracking, person identification, etc. Although recent meth-
ods based on deep Convolutional Neural Networks (CNN)
have achieved remarkable results for the face detection task
[10], [35], [50], it is still difficult to obtain facial landmark
locations, head pose estimates and gender information from
face images containing extreme poses, illumination and
resolution variations. The tasks of face detection, landmark
localization, pose estimation and gender classification have
generally been solved as separate problems. Recently, it has
been shown that learning correlated tasks simultaneously
can boost the performance of individual tasks [58] , [57], [5].
In this paper, we present a novel framework based
on CNNs for simultaneous face detection, facial landmark
localization, head pose estimation and gender recognition
from a given image (see Figure 1). We design a CNN
architecture to learn common features for these tasks and
exploit the synergy among them. We exploit the fact that in-
formation contained in features is hierarchically distributed
throughout the network as demonstrated in [53]. Lower
layers respond to edges and corners, and hence contain
better localization properties. They are more suitable for
learning landmark localization and pose estimation tasks.
On the other hand, higher layers are class-specific and
suitable for learning complex tasks such as face detection
R. Ranjan and R. Chellappa are with the Department of Electrical and
Computer Engineering, University of Maryland, College Park, MD,
20742.
E-mail: {rranjan1,rama}@umiacs.umd.edu
V. M. Patel is with Rutgers University.
5°,−3°,14°
1°,−11°,1°
−6°,−1°,21°
4°,−4°,10°
6°,−10°,7°
−9°,−11°,13°
−15°,−8°,−15°
−30°,1°,−15°
12°,0°,−1°
8°,8°,12°
Fig. 1. Our method can simultaneously detect the face, localize land-
marks, estimate the pose and recognize the gender. The blue boxes
denote detected male faces, while pink boxes denote female faces. The
green dots provide the landmark locations. Pose estimates for each face
are shown on top of the boxes in the order of roll, pitch and yaw.
and gender recognition. It is evident that we need to make
use of all the intermediate layers of a deep CNN in order
to train different tasks under consideration. We refer the set
of intermediate layer features as hyperfeatures. We borrow
this term from [1] which uses it to denote a stack of local
histograms for multilevel image coding.
Since a CNN architecture contains multiple layers with
hundreds of feature maps in each layer, the overall di-
mension of hyperfeatures is too large to be efficient for
learning multiple tasks. Moreover, the hyperfeatures must
be associated in a way that they efficiently encode the
features common to the multiple tasks. This can be handled
using feature fusion techniques. Features fusion aims to
transform the features to a common subspace where they
can be combined linearly or non-linearly. Recent advances
arXiv:1603.01249v2 [cs.CV] 11 May 2016
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 2
Fig. 2. The architecture of the proposed HyperFace. The network is able to classify a given image region as face or non-face, estimate the head
pose, locate face landmarks and recognize gender.
in deep learning have shown that CNNs are capable of
estimating any complex function. Hence, we construct a
separate fusion-CNN to fuse the hyperfeatures. In order
to learn the tasks, we train them simultaneously using
multiple loss functions. In this way, the features get better
at understanding faces, which leads to improvements in the
performances of individual tasks. The deep CNN combined
with the fusion-CNN can be learned together end-to-end.
We also study the performance of face detection, land-
marks localization, pose estimation and gender recognition
using off-the-shelf Region-based CNN (R-CNN [12]) ap-
proach. Although R-CNN for face detection has been ex-
plored in DP2MFD [35], we provide a comprehensive study
of all these tasks based on R-CNN. Furthermore, we study
the multitask approach without fusing the intermediate
layers of CNN. Detailed experiments show that multitask
learning performs better compared to individual learning.
Fusing the intermediate layers features provides additional
performance boost. This paper makes the following contri-
butions.
1) We propose a novel CNN architecture that performs
face detection, landmarks localization, pose estimation
and gender recognition by fusing the intermediate lay-
ers of the network.
2) We propose two post-processing methods: iterative re-
gion proposals and landmarks-based non-maximum
suppression, which leverage the multitask information
obtained from the CNN to improve the overall perfor-
mance.
3) We study the performance of R-CNN based approaches
for individual tasks and the multitask approach with-
out intermediate layer fusion.
4) We achieve new state-of-the-art performances on chal-
lenging unconstrained datasets for all of these four
tasks.
This paper is organized as follows. Section 2 reviews
related work. Section 3 describes the proposed HyperFace
framework in detail. Section 4 describes the implementa-
tion of R-CNN based approach as well as Multitask Face.
Section 5 provides the results of HyperFace along with R-
CNN baselines on challenging datasets. Finally, Section 6
concludes the paper with a brief summary and discussion.
2 RE LATED WORK
One of the earlier approaches for jointly addressing the tasks
of face detection, pose estimation, and landmark localization
was proposed in [57] and later extended in [58]. This method
is based on a mixture of trees with a shared pool of parts in
the sense that every facial landmark is modeled as a part
and uses global mixtures to capture the topological changes
due to viewpoint changes. A joint cascade-based method
was recently proposed in [5] for simultaneously detecting
faces and landmark points on a given image. This method
yields improved detection performance by incorporating
a face alignment step in the cascade structure. Multi-task
learning using CNNs has also been studied recently in [55],
which learns gender and other attributes to improve land-
mark localization, while [13] trains a CNN for person pose
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 3
estimation and action detection, using features only from the
last layer. The intermediate layer features have been used
for image segmentation [14], image classification [51] and
pedestrian detection [37].
Face detection: Viola-Jones detector [44] is a classic
method which uses cascaded classifiers on Haar-like fea-
tures to detect faces. This method provides realtime face
detection, but works best for full, frontal, and well lit faces.
Deformable Parts Model (DPM) [11]-based face detection
methods have also been proposed in the literature where
a face is essentially defined as a collection of parts [57],
[32]. It has been shown that in unconstrained face detection,
features like HOG or Haar wavelets do not capture the
discriminative facial information at different illumination
variations or poses. To overcome these limitations, vari-
ous deep CNN-based face detection methods have been
proposed in the literature [35], [27], [50], [10], [49]. These
methods have produced state-of-the-art results on many
challenging publicly available face detection datasets. Some
of the other recent face detection methods include NDFaces
[30], PEP-Adapt [26], and [5].
Landmark localization: Fiducial point extraction or
landmark localization is one of the most important steps in
face recognition. Several approaches have been proposed in
the literature. These include both regression-based [4], [47],
[46], [45], [21], [42] and model-based [6] , [33], [29] methods.
While the former learns the shape increment given a mean
initial shape, the latter trains an appearance model to predict
the keypoint locations. CNN-based landmark localization
methods have also been proposed in recent years [40], [55],
[24] and have achieved remarkable performance. Although
a lot of work has been done for localizing landmarks for
frontal faces, not much attention has been given to profile
faces which occur often in real world scenarios. Recent
methods like PIFA [20], CLVM [16] and 3DDFA [56] have
attempted the landmark localization task on faces with
varying pose angles.
Pose estimation: The task of head pose estimation is to
infer the orientation of person’s head relative to the camera
view. It is extremely useful in face verification for matching
face similarity across different orientations. However, not
much research has gone into pose estimation from uncon-
strained images. Non-linear manifold-based methods have
been proposed in [2], [15], [38] to classify face images based
on pose. A survey of various head pose estimation methods
is provided in [34].
Gender recognition: Previous works on gender recogni-
tion have focused on finding good discriminative features
for classification. Most previous methods use one or more
combination of features such as LBP, SURF, HOG or SIFT.
In recent years, attribute-based methods for face recognition
have gained a lot of traction. Binary classifiers were used
in [25] for each attribute such as male, long hair, white etc.
Different features were computed for different features and
they were used to train a different SVM for each attribute.
CNN-based methods have also been proposed for learning
attribute-based representations in [31], [54].
3 HYPERFACE
We propose a single CNN model for simultaneous face
detection, landmark localization, pose estimation and gen-
der classification. The network architecture is deep in both
vertical and horizontal directions and is shown in Figure 2.
In this section, we provide a brief overview of the system
and then discuss the different components in detail.
The proposed algorithm called HyperFace consists of
three modules. The first one generates class independent
region-proposals from the given image and scales them to
227 ×227 pixels. The second module is a CNN which takes
in the resized candidate regions and classifies them as face
or non-face. If a region gets classified as a face, the network
additionally provides facial landmarks locations, estimated
head pose and gender information. The third module is a
post-processing step which involves iterative region propos-
als and landmarks-based non-maximum suppression (NMS)
to boost the face detection score and improve the perfor-
mance of individual tasks.
3.1 HyperFace Architecture
We start with Alexnet [23] for image classification. The net-
work consists of five convolutional layers along with three
fully connected layers. We initialize the network with the
weights of RCNN Face network trained for face detection
task as described in Section 4. All the fully connected layers
are removed as they encode image-classification specific
information, which is not needed for pose estimation and
landmark extraction. We exploit the following two obser-
vations to create our network. 1) The features in CNN are
distributed hierarchically in the network. While the lower
layer features are informative for landmark localization and
pose estimation, the higher layer features are suitable for
more complex tasks such as detection or classification [53].
2) Learning multiple correlated tasks simultaneously builds
a synergy and improves the performance of individual tasks
as shown in [5], [55]. Hence, in order to simultaneously learn
face detection, landmarks, pose and gender, we need to fuse
the features from the intermediate layers of the network
(hyperfeatures), and learn multiple tasks on top of it. Since
the adjacent layers are highly correlated, we do not consider
all the intermediate layers for fusion.
We fuse the max1,conv3and pool5layers of Alexnet,
using a separate network. A naive way for fusion is directly
concatenating the features. Since the feature maps for these
layers have different dimensions 27×27×96,13×13×384,
6×6×256, respectively, they cannot be concatenated directly.
We therefore add conv1aand conv3aconvolutional layers
to pool1,conv3layers to obtain consistent feature maps of
dimensions 6×6×256 at the output. We then concatenate the
output of these layers along with pool5to form a 6×6×768
dimensional feature maps. The dimension is still quite high
to train a multi-task framework. Hence, a 1×1kernel con-
volution layer (convall) is added to reduce the dimensions
[41] to 6×6×192. We add a fully connected layer (fcall) to
convall, which outputs a 3072 dimensional feature vector. At
this point, we split the network into five separate branches
corresponding to the different tasks. We add fcdetection ,
fclandmarks ,fcvisibility ,fcpose and fcgender fully connected
layers, each of dimension 512, to fcall. Finally, a fully
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 4
connected layer is added to each of the branch to predict
the individual task labels. After every convolution or a fully
connected layer, we deploy the Rectified Layer Unit (ReLU)
non-linearity. We did not include any pooling operation in
the fusion network as it provides local invariance which is
not desired for the face landmark localization task. Task-
specific loss functions are then used to learn the weights of
the network.
3.2 Training
We use AFLW [22] dataset for training the HyperFace net-
work. It contains 25,993 faces in 21,997 real-world images
with full pose, expression, ethnicity, age and gender vari-
ations. It provides annotations for 21 landmark points per
face, along with the face bounding-box, face pose (yaw, pitch
and roll) and gender information. We randomly selected
1000 images for testing, and keep the rest for training the
network. Different loss functions are used for training the
tasks of face detection, landmark localization, pose estima-
tion and gender classification tasks.
Face Detection: We use the Selective Search [43] algo-
rithm in RCNN [12] to generate region proposals for faces
in an image. A region having an overlap of more than 0.5
with the ground truth bounding box is considered a positive
sample (l= 1). The candidate regions with overlap less than
0.35 are treated as negative instance (l= 0). All the other
regions are ignored. We use the softmax loss function given
by (1) for training the face detection task.
lossD=(1 l)·log(1 p)l·log(p),(1)
where p is the probability that the candidate region is a face.
The probability values pand 1pare obtained from the last
fully connected layer for the detection task.
Landmark Localization: We use the 21 point markups
for face landmark locations as provided in the AFLW [22]
dataset. Since the faces have full pose variations, some of
the landmark points are invisible. The dataset provides the
annotations for the visible landmarks. We consider regions
with overlap greater than 0.35 with the ground truth for
learning this task, while ignoring the rest. A region can
be characterized by {x, y, w, h}where (x, y)are the co-
ordinates of the center of the region and w,hare the width
and height of the region respectively. Each visible landmark
point is shifted with respect to the region center (x, y), and
normalized by (w, h) as given by (2)
(ai, bi) = xix
w,yiy
h.(2)
where (xi, yi)’s are the given ground truth fiducial co-
ordinates. The (ai, bi)’s are treated as labels for training
the landmark localization task using the Euclidean loss
weighted by the visibility factor. The labels for landmarks
which are not visible are taken to be (0,0). The loss in
predicting the landmark location is computed from (3)
lossL=1
2N
N
X
i=1
vi(( ˆxiai)2+ (( ˆyibi)2),(3)
where ( ˆxi,ˆyi) is the ith landmark location predicted by the
network, relative to a given region, Nis the total number
of landmark points (21 for AFLW [22]). The visibility factor
viis 1if the ith landmark is visible in the candidate region,
else it is 0. This implies that there is no loss corresponding
to invisible points and hence they do not take part during
back-propagation.
Learning Visibility: We also learn the visibility factor in
order to test the presence of the predicted landmark. For a
given region with overlap higher than 0.35, we use a simple
Euclidean loss to train the visibility as shown in (4)
lossV=1
N
N
X
i=1
( ˆvivi)2,(4)
where ˆviis the predicted visibility of ith landmark. The true
visibility viis 1if the ith landmark is visible in the candidate
region, else it is 0.
Pose Estimation: We use the Euclidean loss to train the
head pose estimates of roll (p1), pitch (p2) and yaw (p3). We
compute the loss for a candidate region having an overlap
more than 0.5with the ground truth, from (5)
lossP=( ˆp1p1)2+ ( ˆp2p2)2+ ( ˆp3p3)2
3,(5)
where ( ˆp1,ˆp2,ˆp3) are the estimated pose labels.
Gender Recognition: Predicting gender is a two class
problem similar to face detection. For a candidate region
with overlap of 0.5with the ground truth, we compute the
softmax loss given in (6)
lossG=(1 g)·log(1 p0)g·log(p1),(6)
where g= 0 if the gender is male, or else g= 1. Here,
(p0, p1) is the two dimensional probability vector computed
from the network.
The total loss is computed as the weighted sum of the
five individual losses as shown in (7)
lossf ull =
t=5
X
t=1
λtlosst,(7)
where losstis the individual loss corresponding to the
tth task. The weight parameter λtis decided based on
the importance of the task in the overall loss. We choose
(λD= 1, λL= 5, λV= 0.5, λP= 5, λG= 2) for our
experiments. Higher weights are assigned to landmark lo-
calization and pose estimation tasks as they need spatial
accuracy.
3.3 Testing
From a given test image, we first extract the candidate re-
gion proposals using [43]. For each of the regions, we predict
the task labels by a forward-pass through the HyperFace
network. Only regions with detection scores above a certain
threshold are classified as face and processed for subsequent
tasks. The predicted landmark points are scaled and shifted
to the image co-ordinates using (8)
(xi, yi) = ( ˆxiw+x, ˆyih+y),(8)
where ( ˆxi,ˆyi) are the predicted locations of ith landmark
from the network, and {x, y, w, h}are the region parameters
defined in (2). Points obtained with predicted visibility less
than a certain threshold are marked invisible. The pose
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 5
Fig. 3. Candidate face region (red box on left) obtained using Selective
Search gives a low score for face detection, while landmarks are cor-
rectly localized. We generate a new face region (red box on right) using
the landmarks information and feed it through the network to increase
the detection score.
labels obtained from the network are the estimated roll,
pitch and yaw for the face region. The gender is assigned
according to the label with maximum predicted probability.
The fact that we obtain the landmark locations along
with the detections, enables us to improve the post-
processing step so that all the tasks benefit from it. We
propose two novel methods: Iterative Region Proposals
(IRP) and Landmarks-based Non-Maximum Suppression
(L-NMS) to improve the performance.
Iterative Region Proposals (IRP): We use a fast version
of Selective Search [43] which extracts around 2000 regions
from an image. We call this version F ast SS . It is quite
possible that some faces with poor illumination or small
size fail to get captured by any candidate region with a
high overlap. The network would fail to detect that face
due to low score. In these situations, it is desirable to have a
candidate box which precisely captures the face. Hence, we
generate a new candidate bounding box from the predicted
landmark points using the FaceRectCalculator provided by
[22], and pass it again through the network. The new region,
being more localized yields a higher detection score and the
corresponding tasks output, thus increasing the recall. This
procedure can be repeated (say Ttime), so that boxes at
a given step will be more localized to faces as compared
to the previous step. From our experiments, we found that
the localization component saturates in just one step (T=
1), which shows the strength of the predicted landmarks.
The pseudo-code of IRP is presented in Algorithm 1. The
usefulness of IRP can be seen in Figure 3, which shows a
low-resolution face region cropped from the top-right image
in Figure 14.
Landmarks-based Non-Maximum Suppression (L-
NMS): The traditional approach of non-maximum suppres-
sion involves selecting the top scoring region and discarding
all the other regions with overlap more than a certain thresh-
old. This method can fail in the following two scenarios: 1)
If a region corresponding to the same detected face has less
Algorithm 1 Iterative Region Proposals
1: boxes selective search(image)
2: scores get hyperface scores(boxes)
3: detected boxes boxes(scores threshold)
4: new boxes detected boxes
5: for stage = 1 to T do
6: fids get hyperface fiducials(new boxes)
7: new boxes F aceRectC alculator(fids)
8: deteced boxes [deteced boxes|new boxes]
9: end
10: final scores get hyperface scores(detected boxes)
overlap with the highest scoring region, it can be detected
as a separate face. 2) The highest scoring region might not
always be localized well for the face, which can create some
discrepancy if two faces are close together. To overcome
these issues, we perform NMS on a new region whose
bounding box is defined by the boundary co-ordinates as
[minixi,miniyi,maxixi,maxiyi]of the landmarks for the
given region. In this way, the candidate regions would get
close to each other, thus decreasing the ambiguity of the
overlap and improving localization.
Algorithm 2 Landmarks-based NMS
1: Get detected boxes from Algorithm 1
2: fids get hyperface fiducials(detected boxes)
3: precise boxes [minx, miny, maxx, maxy](fids)
4: faces nms(precise boxes, overlap)
5: for each face in faces do
6: top-k boxes Get top-k scoring boxes
7: final fids median(fids(top-k boxes))
8: final pose median(pose(top-k boxes))
9: final gender median(gender(top-k boxes))
10: final visibility median(visibility(top-k boxes))
11: final bounding box
F aceRectC alculator(final fids)
12: end
We apply landmarks-based NMS to keep the top-k
boxes, based on the detection scores. The detected face cor-
responds to the region with maximum score. The landmark
points, pose estimates and gender classification scores are
decided by the median of the top kboxes obtained. Hence,
the predictions do not rely only on one face region, but
considers the votes from top-kregions for generating the
final output. From our experiments, we found that the best
results are obtained with the value of kbeing 5. The pseudo-
code for L-NMS is given in Algorithm 2.
4 NE TWORK ARCHITECTURES
To emphasize the importance of multitask approach and
fusion of the intermediate layers of CNN, we study the
performance of simpler CNNs devoid of such features. We
evaluate four R-CNN-based models, one for each task of
face detection, landmark localization, pose estimation and
gender recognition. We also build a separate Multitask Face
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 6
(a) (b) (c) (d)
Fig. 4. R-CNN-based network architectures for (a) Face Detection (RCNN Face), (b) Landmark Localization (RCNN Fiducial), (c) Pose Estimation
(RCNN Pose), and (d) Gender Recognition (RCNN Gender). The numbers on the left denote the kernel size and the numbers on the right denote
the cardinality of feature maps for a given layer.
model which performs multitask learning just like Hyper-
Face, but doesn’t fuse the information from the intermediate
layers. These models are described as follows:
RCNN Face: This model is used for face detection task.
The network architecture is shown in Figure 4(a). For
training RCNN Face, we use the region proposals from
AFLW [22] training set, each associated with a face label
based on the overlap with the ground truth. The loss is
computed as per (1). The model parameters are initialized
using the Alexnet [23] weights trained on the Imagenet
dataset [7]. Once trained, the learned parameters from
this network are used to initialize other models including
Multitask Face and HyperFace as the standard Imagenet
initialization doesn’t converge well. We also perform a linear
bounding box regression to localize the face co-ordinates.
RCNN Fiducial: This model is used for locating the
landmarks. The network architecture is shown in Fig-
ure 4(b). It simultaneously learns the visibility of the points
to account for the invisible points at test time, and thus
can be used as a standalone fiducial extractor. The loss
functions for landmarks localization and visibility of points
are computed using (3) and (4), respectively. Only region
proposals which have an overlap>0.5with the ground
truth bounding box are used for training. The model pa-
rameters are initialized from RCNN Face.
RCNN Pose: This model is used for head pose estima-
tion task. The outputs of the network are roll, pitch and
yaw of the face. Figure 4(c) presents the network architec-
ture. Similar to RCNN Fiducial, only region proposals with
overlap>0.5with the ground truth bounding box are used
for training. The training loss is computed using (5).
Fig. 5. Network Architecture of Multitask Face. The numbers on the
left denote the kernel size and the numbers on the right denote the
cardinality of feature maps for a given layer.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 7
0.5 0.6 0.7 0.8 0.9 1
0.5
0.6
0.7
0.8
0.9
1
(a) (b)
Fig. 6. Performance evaluation on (a) the AFW dataset, (b) the PASCAL faces dataset. The numbers in the legend are the mean average precision
for the corresponding datasets.
RCNN Gender: This model is used for face gender
recognition task. The network architecture is shown in Fig-
ure 4(d). It has the same training set as RCNN Fiducial and
RCNN Pose. The training loss is computed using (6).
Multitask Face: Similar to HyperFace, this model is
used to simultaneously detect face, localize landmarks,
estimate pose and predict its gender. The only difference
between Multitask Face and HyperFace is that HyperFace
fuses the intermediate layers of the network whereas Mul-
titask Face combines the tasks using the common fully
connected layer at the end of the network as shown in
Figure 5. Since it provides the landmarks and face score,
it leverages iterative region proposals and landmark-based
NMS post-processing algorithms during evaluation.
The performance of all the above models for their respec-
tive tasks are evaluated and discussed in details in Section 5.
5 EXPERIMENTAL RESULTS
We evaluated the proposed HyperFace method, along with
Multask Face, RCNN Face, RCNN Fiducial, RCNN Pose
and RCNN Gender on six challenging datasets:
Annotated Face in-the-Wild (AFW) [57] for evaluating
face detection, landmark localization, and pose estima-
tion tasks
Annotated Facial Landmarks in the Wild (AFLW) [22]
for evaluating landmark localization and pose estima-
tion tasks
Face Detection Dataset and Benchmark (FDDB) [18] and
PASCAL faces [48] for evaluating the face detection
results
Large-scale CelebFaces Attributes (CelebA) [31] and
LFWA [17] for evaluating gender recognition results.
Our method was trained on randomly selected 20,997 im-
ages from the AFLW dataset using Caffe [19]. The remaining
1000 images were used for testing.
5.1 Face Detection
We show face detection results for AFW, PASCAL and FDDB
datasets. The AFW dataset [57] was collected from Flickr
and the images in this dataset contain large variations in
appearance and viewpoint. In total there are 205 images
with 468 faces in this dataset. The FDDB dataset [18] consists
of 2,845 images containing 5,171 faces collected from news
articles on the Yahoo website. This dataset is the most
widely used benchmark for unconstrained face detection.
The PASCAL faces dataset [48] was collected from the test
set of PASCAL person layout dataset, which is a subset from
PASCAL VOC [8]. This dataset contains 1335 faces from 851
images with large appearance variations. For improved face
detection performance, we learn a SVM classifier on top of
fcdetection features using the training splits from the FDDB
dataset.
Some of the recent published methods compared in our
evaluations include DP2MFD [35], Faceness [50], Head-
Hunter [32], JointCascade [5], CCF [49], SquaresChnFtrs-
5 [32], CascadeCNN [27], Structured Models [48],
DDFD [10], NDPFace [30], PEP-Adapt [26], TSM [57], as well
as three commercial systems Face++, Picasa and Face.com.
0 50 100 150 200 250
0.4
0.5
0.6
0.7
0.8
0.9
1
False positive
True positive rate
TSM (0.766)
PEP−Adapt (0.809)
NDPFace (0.817)
DDFD (0.84)
StructuredModels (0.846)
CascadeCNN (0.857)
SquaresChnFtrs−5 (0.858)
CCF (0.859)
JointCascade(0.863)
DPM(HeadHunter) (0.864)
HeadHunter (0.871)
Faceness (0.903)
DP2MFD (0.913)
RCNN_Face (0.838)
Multitask_Face (0.885)
HyperFace (0.901)
Fig. 7. Performance evaluation on the FDDB dataset. The numbers in
the legend are the mean average precision.
The precision-recall curves of different detectors cor-
responding to the AFW and the PASCAL faces datasets
are shown in Figures 6 (a) and (b), respectively. Figure 7
compares the performance of different detectors using the
Receiver Operating Characteristic (ROC) curves on the
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 8
FDDB dataset. As can be seen from these figures, Hyper-
Face outperforms all the reported academic and commercial
detectors on the AFW and PASCAL datasets, with a high
mean average precision (mAP ) of 97.9% and 92.46%, re-
spectively. The FDDB dataset is very challenging for Hy-
perFace and any other R-CNN-based face detection meth-
ods, as the dataset contains many small and blurred faces.
Firstly, few of these faces do not get included in the region
proposals from selective search. Secondly, re-sizing small
faces to the input size of 227 ×227 adds distortion to
the face resulting in low detection score. In spite of these
issues, HyperFace performance is comparable to recently
published deep learning-based face detection methods such
as DP2MFD [35] and Faceness [50] on the FDDB dataset 1
with mAP of 90.1%.
It is interesting to note the performance differences be-
tween RCNN Face, Multitask Face and HyperFace for the
face detection tasks. Figures 6, and 7 clearly show that mul-
titask CNNs (Multitask Face and HyperFace) outperform
RCNN Face by a wide margin. The boost in the perfor-
mance gain is mainly due to the following two reasons.
Firstly, multitask learning approach helps the network to
learn improved features for face detection which is evi-
dent from their mAP values on the AFW dataset. Using
just the linear bounding box regression and traditional
NMS, HyperFace obtains a mAP of 94% (Figure 12) while
RCNN Face achieves a mAP of 90.3%. Secondly, hav-
ing landmark information associated with detection boxes
makes it easier to localize the bounding box to a face,
by using IRP and L-NMS algorithms. On the other hand,
the HyperFace and Multitask Face perform comparably to
each other for all the face detection datasets which suggests
that the network doesn’t gain much by fusing intermediate
layers for the face detection task.
0 0.05 0.1 0.15
0
0.2
0.4
0.6
0.8
1
Normalized Mean Error
Fraction of Test Faces
Multi. AAMs (15.9%)
CLM (24.1%)
Oxford (57.9%)
face.com (69.6%)
Face DPL (76.7%)
RCNN_Fiducial (78.0%)
Multitask_Face (81.2%)
HyperFace (84.5%)
Fig. 8. Cumulative error distribution curves for landmark localization on
the AFW dataset. The numbers in the legend are the fraction of testing
faces that have average error below (5%) of the face size.
5.2 Landmark Localization
We evaluate the performance of different landmark localiza-
tion algorithms on the AFW [57] and AFLW [22] datasets.
Both of these datasets contain faces with full pose varia-
tions. Some of the methods compared include Multiview
Active Appearance Model-based method (Multi. AAM) [57],
Constrained Local Model (CLM) [36], Oxford facial land-
mark detector [9], Zhu [57], FaceDPL [58], JointCascade [5],
1. http://vis-www.cs.umass.edu/fddb/results.html
CDM [52], RCPR [3], ESR [4], SDM [46] and 3DDFA [56]. Al-
though both of these datasets provide ground truth bound-
ing boxes, we do not use them for evaluating on HyperFace,
Multitask Face and RCNN Fiducial. Instead we use the
respective algorithms to detect both the face and its fiducial
points. Since, the RCNN Fiducial cannot detect faces, we
provide it with the detections from the HyperFace.
Figure 8 compares the performance of different land-
mark localization methods on the AFW dataset using the
protocol defined in [58]. In this figure, (*) indicates that
models that are evaluated on near frontal faces or use
hand-initialization [57]. The dataset provides six keypoints
for each face which are: left eye center, right eye center,
nose tip, mouth left, mouth center and mouth right. We
compute the error as the mean distance between the pre-
dicted and ground truth keypoints, normalized by the face
size. The plots for comparison were obtained from [58].
0 5 10 15 20 25
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized Mean Error (%)
Fraction of Test Faces
12.44 CDM
8.24 ESR
7.85 RCPR
6.55 SDM
5.60 3DDFA
5.32 3DDFA+SDM
4.76 RCNN_Fiducial
4.79 Multitask_Face
4.26 HyperFace
Fig. 9. Cumulative error distribution curves for landmark localization on
the AFLW dataset. The numbers in the legend are the fraction of testing
faces that have average error below (5%) of the face size.
For the AFLW dataset, we calculate the error using all the
visible keypoints. For AFW, we adopt the same protocol as
defined in [56]. The only difference is that our AFLW testset
consists of only 1000 images with 1132 face samples, since
we use the rest of the images for training. To be consistent
with the protocol, we randomly create a subset of 450
samples from our testset whose absolute yaw angles within
[0,30], [30,60] and [60,90] are 1/3each. Figure 9
compares the performance of different landmark localiza-
tion methods. We obtain the comparison plots from [56]
where the evaluations for RCPR, ESR and SDM are carried
out after adapting the algorithms to face profiling. Table 1
provides the Normalized Mean Error (NME) for AFLW
dataset, for each of the pose group.
As can be seen from the figures, HyperFace outperforms
many recent state-of-the-art landmark localization methods
including FaceDPL [58], 3DDFA [56] and SDM [46]. Table 1
shows that HyperFace performs consistently accurate over
all pose angles. This clearly suggests that while most of
the methods work well on frontal faces, HyperFace is able
to predict landmarks for faces with full pose variations.
Moreover, we find that RCNN Fiducial and Multitask Face
outperforms the earlier methods as well, performing com-
parably to each other. The HyperFace has an advantage
over them as it uses the intermediate layers for fusion. The
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 9
0 5 10 15 20 25 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pose Estimation Error for Roll (in degrees)
Fraction of Test Faces
4.91 RCNN_Pose
4.64 Multitask_Face
3.92 HyperFace
0 5 10 15 20 25 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pose Estimation Error for Pitch (in degrees)
Fraction of Test Faces
6.47 RCNN_Pose
6.46 Multitask_Face
6.13 HyperFace
0 5 10 15 20 25 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pose Estimation Error for Yaw (in degrees)
Fraction of Test Faces
9.91 RCNN_Pose
9.06 Multitask_Face
7.61 HyperFace
(a) (b) (c)
Fig. 10. Performance evaluation of Pose Estimation on AFLW dataset for (a) roll (b) pitch and (c) yaw angles. The numbers in the legend are the
mean error in degrees for the respective pose angles.
TABLE 1
The NME(%) of face alignment results on AFLW test set with the best
results highlighted.
AFLW Dataset (21 pts)
Method [0, 30] [30, 60] [60, 90] mean std
CDM 8.15 13.02 16.17 12.44 4.04
RCPR 5.43 6.58 11.53 7.85 3.24
ESR 5.66 7.12 11.94 8.24 3.29
SDM 4.75 5.55 9.34 6.55 2.45
3DDFA 5.00 5.06 6.74 5.60 0.99
3DDFA+SDM 4.75 4.83 6.38 5.32 0.92
RCNN Fiducial 4.49 4.70 5.09 4.76 0.30
Multitask Face 4.20 4.93 5.23 4.79 0.53
HyperFace 3.93 4.14 4.71 4.26 0.41
local information is contained well in the lower layers of
CNN and becomes invariant as depth increases. Fusing the
layers brings out that hidden information which boosts the
performance for the landmark localization task.
5.3 Pose Estimation
We evaluate RCNN Pose, Multitask Face and HyperFace
on the AFW [57] and AFLW [22] datasets for pose estimation
task. The detection boxes used for evaluating the landmark
localization task are used here as well for initialization.
For the AFW dataset, we compare our approach with
Multi. AAM [57], Multiview HoG [57], FaceDPL2[58] and
face.com. Note that multiview AAMs are initialized using
the ground truth bounding boxes (denoted by *). Figure 11
shows the cumulative error distribution curves on AFW
dataset. The curve provides the fraction of faces for which
the estimated pose is within some error tolerance. As can
be seen from the figure, the HyperFace method achieves
the best performance and beats FaceDPL by a large margin.
For the AFLW dataset, we do not have pose estimation
evaluation for any previous method. Hence, we show the
performance of our method for different pose angles: roll,
pitch and yaw in Figure 10 (a), (b) and (c) respectively. It
can be seen that the network is able to learn roll, and pitch
information better than yaw.
The performance traits of RCNN Pose, Multitask Face
and HyperFace for pose estimation task are similar to
that of the landmarks localization task. RCNN Pose and
Multitask Face perform comparable to each other whereas
2. Available at: http://www.ics.uci.edu/dramanan/software/
face/face journal.pdf
0 5 10 15 20 25 30
0
0.2
0.4
0.6
0.8
1
Pose Estimation Error (in degrees)
Fraction of Test Faces
Multi. HoG (74.6%)
Multi. AAMs (36.8%)
face.com (64.3%)
FaceDPL (89.4%)
RCNN_Pose (95.0%)
Multitask_Face (95.9%)
HyperFace (97.7%)
Fig. 11. Cumulative error distribution curves for pose estimation on AFW
dataset. The numbers in the legend are the percentage of faces that are
labeled within ±15error tolerance.
HyperFace achieves a boosted performance due to the inter-
mediate layers fusion. It shows that tasks which rely on the
structure and orientation of the face work well with features
from lower layers of the CNN.
5.4 Gender Recognition
We show the gender recognition performance on
CelebA [31] and LFWA [17] datasets since these datasets
come with gender information. The CelebA and LFWA
datasets contain labeled images selected from the Celeb-
Faces [39] and LFW [17] datasets, respectively [31]. The
CelebA dataset contains 10,000 identities and there are
200,000 images in total. The LFWA dataset has 13,233 images
of 5,749 identities. We compare our approach with Face-
Tracer [25], PANDA-w [54], PANDA-1 [54], [28] with ANet
and [31]. The gender recognition performance of different
methods is reported in Table 2. On the LFWA dataset, our
method outperforms PANDA [54] and FaceTracer [25], and
is equal to [31]. On the CelebA dataset, our method performs
comparably to [31]. Unlike [31] which uses 180,000 images
for training and validation, we only use 20,000 images from
validation set of CelebA to fine-tune the network.
Similar to the face detection task, we find that gen-
der recognition performs better for HyperFace and Mul-
titask Face as compared to RCNN Gender proving that
learning related tasks together improves the discriminating
capability of the individual tasks. Again, we do not see
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 10
TABLE 2
Performance comparison (in %) of gender recognition on CelebA and
LFWA datasets.
Method CelebA LFWA
FaceTracer [25] 91 84
PANDA-w [54] 93 86
PANDA-1 [54] 97 92
[28]+ANet 95 91
LNets+ANet [31] 98 94
RCNN Gender 95 91
Multitask Face 97 93
HyperFace 97 94
much difference in the performance of Multitask Face and
HyperFace suggesting intermediate layers do not contribute
much for the gender recognition task.
5.5 Effect of Post-Processing
Figure 12 provides an experimental analysis of the post-
processing methods: IRP and L-NMS, on face detection task
on the AFW dataset. Fast SS denotes the quick version of
selective search which produces around 2000 region pro-
posals and takes 2sper image to compute. On the other
hand, Quality SS refers to its slow version which outputs
more than 10,000 region proposals consuming more than
10sfor one image. The HyperFace with a linear bounding
box regression and traditional NMS achieves a mAP of
94%. Just by replacing them with L-NMS provides a boost
of 1.2%. In this case, bounding-box is constructed using the
landmarks information rather linear regression. Addition-
aly, we can see from the figure that although Quality SS
generates more region proposals, it performs worse than
Fast SS with ierative region proposals. IRP adds 300 new
regions for a typical image consuming less than 0.5swhich
makes it highly efficient as compared to Quality SS.
0.7 0.75 0.8 0.85 0.9 0.95 1
0.7
0.75
0.8
0.85
0.9
0.95
1
Recall
Precision
Fast SS + LR (ap = 94%)
Fast SS + L−NMS (ap = 95.2%)
Quality SS + L−NMS (ap = 97.3%)
Fast SS + L−NMS + IRP (ap = 97.9%)
Fig. 12. Variations in performance of HyperFace with respect to the
Iterative Region Proposals and Landmarks-based NMS. The numbers
in the legend are the mean average precision.
5.6 Runtime
The Hyperface method was tested on a machine with 8 cores
and GTX TITAN-X GPU. The overall time taken to perform
all the four tasks was 3sper image. The limitation was not
because of CNN, but due to selective search which takes
approximately 2sto generate candidate region proposals.
One forward pass through the HyperFace network takes
only 0.2s.
6 DISCUSSION
We discuss few crucial observations from our experiments.
First, all the face related tasks benefit from using the mul-
titask learning framework. The gain is mainly due to the
network’s ability to learn more discriminative features, and
post-processing methods which can be leveraged by having
landmarks as well as detection scores for a region. Secondly,
fusing intermediate layers improves the performance for
structure dependent tasks of pose estimation and landmarks
localization, as the features become invariant to geometry
in deeper layers of CNN. The HyperFace exploits these
observations to improve the performance for all the four
tasks.
We also visualize the features learned by the HyperFace
network. Figure 13 shows the network activation for a few
selected feature maps out of 192 from the convall layer. It
can be seen that some feature maps are dedicated solely for
a single task while others can be used to predict different
tasks. For example, feature map 27 and 186 can be used
for face detection and gender recognition, respectively. The
former distinguishes the face and non-face regions whereas
the latter outputs high activation for the female faces. Sim-
ilarly, feature map 19 shows high activation near eyes and
mouth regions, while feature map 96 gives a rough contour
of the face orientation. Both of these features can be used for
landmark localization and pose estimation tasks.
Several qualitative results of our method on the AFW,
PASCAL and FDDB datasets are shown in Figure 14. As
can be seen from this figure, our method is able to simul-
taneously perform all the four tasks on images containing
extreme pose, illumination, and resolution variations with
cluttered background.
7 CONCLUSION
In this paper, we presented a multi-task deep learning
method called HyperFace for simultaneously detecting
faces, localizing landmarks, estimating head pose and iden-
tifying gender. Extensive experiments using various pub-
licly available unconstrained datasets demonstrate the ef-
fectiveness of our method on all four tasks. In future, we
will evaluate the performance of our method on other appli-
cations such as simultaneous human detection and human
pose estimation, object recognition and pedestrian detection.
ACKNOWLEDGMENTS
This research is based upon work supported by the Office
of the Director of National Intelligence (ODNI), Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 11
Fig. 13. Activations of selected feature maps from conv all layer of the HyperFace architecture. Green and yellow colors denote high activation
whereas blue denotes low activation units. These features depict the distinguishable face traits for the tasks of face detection, landmarks localization,
pose estimation and gender recognition.
Advanced Research Projects Activity (IARPA), via IARPA
R&D Contract No. 2014-14071600012. The views and conclu-
sions contained herein are those of the authors and should
not be interpreted as necessarily representing the official
policies or endorsements, either expressed or implied, of
the ODNI, IARPA, or the U.S. Government. The U.S. Gov-
ernment is authorized to reproduce and distribute reprints
for Governmental purposes notwithstanding any copyright
annotation thereon.
REFERENCES
[1] A. Agarwal and B. Triggs. Multilevel image coding with hyperfea-
tures. International Journal of Computer Vision, pages 15–27, 2008.
[2] V. Balasubramanian, J. Ye, and S. Panchanathan. Biased manifold
embedding: A framework for person-independent head pose es-
timation. In Computer Vision and Pattern Recognition, 2007. CVPR
’07. IEEE Conference on, pages 1–7, June 2007.
[3] X. Burgos-Artizzu, P. Perona, and P. Doll´
ar. Robust face landmark
estimation under occlusion. In Proceedings of the IEEE International
Conference on Computer Vision, pages 1513–1520, 2013.
[4] X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shape
regression. International Journal of Computer Vision, 107(2):177–190,
2014.
[5] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade
face detection and alignment. In D. Fleet, T. Pajdla, B. Schiele,
and T. Tuytelaars, editors, European Conference on Computer Vision,
volume 8694, pages 109–122. 2014.
[6] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active
shape models—their training and application. Comput. Vis.
Image Underst., 61(1):38–59, Jan. 1995.
[7] J.Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet:
A large-scale hierarchical image database. In Computer Vision and
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages
248–255. IEEE, 2009.
[8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
A. Zisserman. The pascal visual object classes (voc) challenge.
International Journal of Computer Vision, 88(2):303–338, June 2010.
[9] M. R. Everingham, J. Sivic, and A. Zisserman. Hello! my name is...
buffy” – automatic naming of characters in tv video. In Proceedings
of the British Machine Vision Conference, pages 92.1–92.10, 2006.
[10] S. S. Farfade, M. Saberian, and L.-J. Li. Multi-view face detection
using deep convolutional neural networks. In International Confer-
ence on Multimedia Retrieval, 2015.
[11] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan.
Object detection with discriminatively trained part-based mod-
els. IEEE Transactions on Pattern Analysis and Machine Intelligence,
32(9):1627–1645, Sept 2010.
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hi-
erarchies for accurate object detection and semantic segmentation.
In Computer Vision and Pattern Recognition, 2014.
[13] G. Gkioxari, B. Hariharan, R. B. Girshick, and J. Malik. R-cnns for
pose estimation and action detection. CoRR, abs/1406.5212, 2014.
[14] B. Hariharan, P. Arbel´
aez, R. Girshick, and J. Malik. Hypercolumns
for object segmentation and fine-grained localization. In Computer
Vision and Pattern Recognition (CVPR), 2015.
[15] N. Hu, W. Huang, and S. Ranganath. Head pose estimation by
non-linear embedding and mapping. In Image Processing, 2005.
ICIP 2005. IEEE International Conference on, volume 2, pages II–
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 12
Fig. 14. Qualitative results of our method. The blue boxes denote detected male faces, while pink boxes denote female faces. The green dots
provide the landmark locations. Pose estimates for each face are shown on top of the boxes in the order of roll, pitch and yaw.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , VOL. XX, NO. XX, 2016 13
342–5, Sept 2005.
[16] P. Hu and D. Ramanan. Bottom-up and top-down reasoning with
convolutional latent-variable models. CoRR, abs/1507.05699, 2015.
[17] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled
faces in the wild: A database for studying face recognition in
unconstrained environments. Technical Report 07-49, University
of Massachusetts, Amherst, Oct. 2007.
[18] V. Jain and E. Learned-Miller. Fddb: A benchmark for face
detection in unconstrained settings. Technical Report UM-CS-
2010-009, University of Massachusetts, Amherst, 2010.
[19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture
for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
[20] A. Jourabloo and X. Liu. Pose-invariant 3d face alignment. In The
IEEE International Conference on Computer Vision (ICCV), December
2015.
[21] V. Kazemi and J. Sullivan. One millisecond face alignment with
an ensemble of regression trees. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 1867–1874, June 2014.
[22] M. Kostinger, P. Wohlhart, P. Roth, and H. Bischof. Annotated
facial landmarks in the wild: A large-scale, real-world database
for facial landmark localization. In IEEE International Conference on
Computer Vision Workshops, pages 2144–2151, Nov 2011.
[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi-
fication with deep convolutional neural networks. In F. Pereira,
C. Burges, L. Bottou, and K. Weinberger, editors, Advances in
Neural Information Processing Systems 25, pages 1097–1105. Curran
Associates, Inc., 2012.
[24] A. Kumar, R. Ranjan, V. M. Patel, and R. Chellappa. Face align-
ment by local deep descriptor regression. CoRR, abs/1601.07950,
2016.
[25] N. Kumar, P. N. Belhumeur, and S. K. Nayar. FaceTracer: A Search
Engine for Large Collections of Images with Faces. In European
Conference on Computer Vision (ECCV), pages 340–353, Oct 2008.
[26] H. Li, G. Hua, Z. Lin, J. Brandt, and J. Yang. Probabilistic elastic
part model for unsupervised face detector adaptation. In IEEE
International Conference on Computer Vision, pages 793–800, Dec
2013.
[27] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolutional
neural network cascade for face detection. In IEEE Conference
on Computer Vision and Pattern Recognition, pages 5325–5334, June
2015.
[28] J. Li and Y. Zhang. Learning surf cascade for fast and accurate
object detection. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 3468–3475, 2013.
[29] L. Liang, R. Xiao, F. Wen, and J. S. 0001. Face alignment via
component-based discriminative search. In D. A. Forsyth, P. H. S.
Torr, and A. Zisserman, editors, ECCV (2), volume 5303 of Lecture
Notes in Computer Science, pages 72–85. Springer, 2008.
[30] S. Liao, A. Jain, and S. Li. A fast and accurate unconstrained
face detector. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2015.
[31] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes
in the wild. In International Conference on Computer Vision, Dec.
2015.
[32] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool. Face
detection without bells and whistles. In European Conference on
Computer Vision, volume 8692, pages 720–735. 2014.
[33] I. Matthews and S. Baker. Active appearance models revisited. Int.
J. Comput. Vision, 60(2):135–164, Nov. 2004.
[34] E. Murphy-Chutorian and M. Trivedi. Head pose estimation in
computer vision: A survey. Pattern Analysis and Machine Intelli-
gence, IEEE Transactions on, 31(4):607–626, April 2009.
[35] R. Ranjan, V. M. Patel, and R. Chellappa. A deep pyramid de-
formable part model for face detection. In International Conference
on Biometrics Theory, Applications and Systems, 2015.
[36] J. Saragih, S. Lucey, and J. Cohn. Deformable model fitting by
regularized landmark mean-shift. International Journal of Computer
Vision, 91(2):200–215, 2011.
[37] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. Lecun. Pedes-
trian detection with unsupervised multi-stage feature learning.
In Proceedings of the 2013 IEEE Conference on Computer Vision and
Pattern Recognition, CVPR ’13, pages 3626–3633, Washington, DC,
USA, 2013. IEEE Computer Society.
[38] S. Srinivasan and K. Boyer. Head pose estimation using view
based eigenspaces. In Pattern Recognition, 2002. Proceedings. 16th
International Conference on, volume 4, pages 302–305 vol.4, 2002.
[39] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face
representation by joint identification-verification. In Advances in
Neural Information Processing Systems, pages 1988–1996. 2014.
[40] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade
for facial point detection. In Proceedings of the 2013 IEEE Conference
on Computer Vision and Pattern Recognition, CVPR ’13, pages 3476–
3483, Washington, DC, USA, 2013. IEEE Computer Society.
[41] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with
convolutions. CoRR, abs/1409.4842, 2014.
[42] G. Tzimiropoulos and M. Pantic. Gauss-newton deformable part
models for face alignment in-the-wild. In IEEE Conference on
Computer Vision and Pattern Recognition, pages 1851–1858, June
2014.
[43] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, and A. W. M.
Smeulders. Segmentation as selective search for object recognition.
In Proceedings of the 2011 International Conference on Computer
Vision, ICCV ’11, pages 1879–1886, Washington, DC, USA, 2011.
IEEE Computer Society.
[44] P. A. Viola and M. J. Jones. Robust real-time face detection.
International Journal of Computer Vision, 57(2):137–154, 2004.
[45] X. Xiong and F. D. la Torre. Global supervised descent method. In
CVPR, 2015.
[46] Xuehan-Xiong and F. De la Torre. Supervised descent method and
its application to face alignment. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2013.
[47] J. Yan, Z. Lei, D. Yi, and S. Z. Li. Learn to combine multiple
hypotheses for accurate face alignment. In Proceedings of the 2013
IEEE International Conference on Computer Vision Workshops, ICCVW
’13, pages 392–396, Washington, DC, USA, 2013. IEEE Computer
Society.
[48] J. Yan, X. Zhang, Z. Lei, and S. Z. Li. Face detection by structural
models. Image and Vision Computing, 32(10):790 – 799, 2014.
[49] B. Yang, J. Yan, Z. Lei, and S. Z. Li. Convolutional channel features.
In IEEE International Conference on Computer Vision, 2015.
[50] S. Yang, P. Luo, C. C. Loy, and X. Tang. From facial parts responses
to face detection: A deep learning approach. In IEEE International
Conference on Computer Vision, 2015.
[51] S. Yang and D. Ramanan. Multi-scale recognition with dag-cnns.
In The IEEE International Conference on Computer Vision (ICCV),
December 2015.
[52] X. Yu, J. Huang, S. Zhang, W. Yan, and D. Metaxas. Pose-free
facial landmark fitting via optimized part mixtures and cascaded
deformable shape model. In Proceedings of the IEEE International
Conference on Computer Vision, pages 1944–1951, 2013.
[53] M. D. Zeiler and R. Fergus. Visualizing and understanding
convolutional networks. CoRR, abs/1311.2901, 2013.
[54] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev.
Panda: Pose aligned networks for deep attribute modeling. In
IEEE Conference on Computer Vision and Pattern Recognition, pages
1637–1644, 2014.
[55] Z. Zhang, P. Luo, C. Loy, and X. Tang. Facial landmark detection
by deep multi-task learning. In European Conference on Computer
Vision, pages 94–108, 2014.
[56] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignment across
large poses: A 3d solution. CoRR, abs/1511.07212, 2015.
[57] X. Zhu and D. Ramanan. Face detection, pose estimation, and
landmark localization in the wild. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 2879–2886, June 2012.
[58] X. Zhu and D. Ramanan. FaceDPL: Detection, pose estimation,
and landmark localization in the wild. preprint 2015.
... Furthermore, Fig. 5 presents the comparison of several face detection algorithms on the FDDB and PASCAL datasets. These algorithms include DSFD [22], pyramidbox [27], Single shot scale-invariant face detector (S3FD) [29], joint-cascade CNN [37], HyperFace [38], faceness [39], UnitBox [40], DPSSD [41], Selective Refinement Network (SRN) [42], and Joint face detection and facial motion retargeting [43]. [44] released Eigenfaces in the early 90s, one of the most basic approaches utilizing low-dimensional feature-based segmentation, which marked the beginning of significant development in FR. ...
... In the preprocessing, various techniques are employed to eliminate irrelevant factors from the original ME video, thereby enhancing the recognition performance. Among them, face detection and face alignment are the most common preprocessing methods [8,9]. Face detection aims to remove the background and keep only the face; face alignment reduces the effects of changes in facial shape and pose through aligning facial keypoints and using affine transformations. ...
Article
Full-text available
Micro-expressions (MEs), characterized by their brief duration and subtle facial muscle movements, pose significant challenges for accurate recognition. These ultra-fast signals, typically captured by high-speed vision sensors, require specialized computational methods to extract spatio-temporal features effectively. In this study, we propose a lightweight dual-stream network with an adaptive strategy for efficient ME recognition. Firstly, a motion magnification network based on transfer learning is employed to magnify the motion states of facial muscles in MEs. This process can generate additional samples, thereby expanding the training set. To effectively capture the dynamic changes of facial muscles, dense optical flow is extracted from the onset frame and the magnified apex frame, thereby obtaining magnified dense optical flow (MDOF). Subsequently, we design a dual-stream spatio-temporal network (DSTNet), using the magnified apex frame and MDOF as inputs for the spatial and temporal streams, respectively. An adaptive strategy that dynamically adjusts the magnification factor based on the top-1 confidence is introduced to enhance the robustness of DSTNet. Experimental results show that our proposed method outperforms existing methods in terms of F1-score on the SMIC, CASME II, SAMM, and composite dataset, as well as in cross-dataset tasks. Adaptive DSTNet significantly enhances the handling of sample imbalance while demonstrating robustness and featuring a lightweight design, indicating strong potential for future edge sensor deployment.
... Concurrently, some studies have developed unified models capable of performing multiple face perception tasks. Early works, such as HyperFace [37] and AllinOne [38], used CNN backbones with empirically selected feature layers, enabling them to detect landmarks, estimate head pose and gender, and in AllinOne, conduct face recognition and age estimation. Recently, unified models based on Transformer architectures-such as Swin-Face [39], FaceXformer [40], Qface [41], and Faceptor [42]-have emerged as powerful alternatives. ...
Preprint
Full-text available
Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in understanding general visual content. However, these general-domain MLLMs perform poorly in face perception tasks, often producing inaccurate or misleading responses to face-specific queries. To address this gap, we propose FaceInsight, the versatile face perception MLLM that provides fine-grained facial information. Our approach introduces visual-textual alignment of facial knowledge to model both uncertain dependencies and deterministic relationships among facial information, mitigating the limitations of language-driven reasoning. Additionally, we incorporate face segmentation maps as an auxiliary perceptual modality, enriching the visual input with localized structural cues to enhance semantic understanding. Comprehensive experiments and analyses across three face perception tasks demonstrate that FaceInsight consistently outperforms nine compared MLLMs under both training-free and fine-tuned settings.
... The residual network (ResNet) with the cross-layers operation [27] further improves the non-linear processing ability relative to some common neural networks. Compared with the single regression model, the multi-task model [28] has stronger learning ability. With the multiple output layers, the multi-task solved many sub-problems simultaneously. ...
Article
Full-text available
In this work, we resolve the cascaded channel estimation problem and the reflected channel estimation problem for the reconfigurable intelligent surface (RIS)-assisted millimeter-wave (mmWave) systems. The novel two-step method contains modified multiple population genetic algorithm (MMPGA), least squares (LS), residual network (ResNet), and multi-task regression model. In the first step, the proposed MMPGA-LS optimizes the crossover strategy and mutation strategy. Besides, the ResNet achieves cascaded channel estimation by learning the relationship between the cascaded channel obtained by the MMPGA-LS and the channel of the user (UE)-RIS-base station (BS). Then, the proposed multi-task-ResNet (MTRnet) is introduced for the reflected channel estimation. Relying on the output of ResNet, the MTRnet with multiple output layers estimates the coefficients of reflected channels and reconstructs the channel of UE-RIS and RIS-BS. Remarkably, the proposed MTRnet is capable of using a lower optimization model to estimate multiple reflected channels compared with the classical neural network with the single output layer. A series of experimental results validate the superiority of the proposed method in terms of a lower norm mean square error (NMSE). Besides, the proposed method also obtains a low NMSE in the RIS with the formulation of the uniform planar array.
Article
Multiple Object Tracking (MOT) aims to detect and track multiple targets across consecutive video frames while preserving consistent object identities. While appearance-based approaches have achieved notable success, they often struggle in challenging conditions such as occlusions, motion blur, and the presence of visually similar objects, resulting in identity switches and fragmented trajectories. To address these limitations, we propose Motion-Perception Multi-Object Tracking (MPMOT), a motion-aware tracking framework that emphasizes robust motion modeling and adaptive association. MPMOT incorporates three core components: (1) a Gain Kalman Filter (GKF) that adaptively adjusts detection noise based on confidence scores, stabilizing motion prediction during uncertain observations; (2) an Adaptive Cost Matrix (ACM) that dynamically fuses motion and appearance cues during track–detection association, improving robustness under ambiguity; and (3) a Global Connection Model (GCM) that reconnects fragmented tracklets by modeling spatio-temporal consistency. Extensive experiments on the MOT16, MOT17, and MOT20 benchmarks demonstrate that MPMOT consistently outperforms state-of-the-art trackers, achieving IDF1 scores of 72.8% and 72.6% on MOT16 and MOT17, respectively, surpassing the widely used FairMOT baseline by 1.1% and 1.3%. Additionally, rigorous statistical validation through post hoc analysis confirms that MPMOT’s improvements in tracking accuracy and identity preservation are statistically significant across all datasets. MPMOT delivers these gains while maintaining real-time performance, making it a scalable and reliable solution for multi-object tracking in dynamic and crowded environments.
Preprint
Full-text available
Human epidermal growth factor receptor 2 (HER2) expression is a critical biomarker for assessing breast cancer (BC) severity and guiding targeted anti-HER2 therapies. The standard method for measuring HER2 expression is manual assessment of IHC slides by pathologists, which is both time intensive and prone to inter- and intra-observer variability. To address these challenges, we developed an interpretable deep-learning pipeline with Correlational Attention Neural Network (Corr-A-Net) to predict HER2 score from H&E images. Each prediction was accompanied with a confidence score generated by the surrogate confidence score estimation network trained using incentivized mechanism. The shared correlated representations generated using the attention mechanism of Corr-A-Net achieved the best predictive accuracy of 0.93 and AUC-ROC of 0.98. Additionally, correlated representations demonstrated the highest mean effective confidence (MEC) score of 0.85 indicating robust confidence level estimation for prediction. The Corr-A-Net can have profound implications in facilitating prediction of HER2 status from H&E images.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry