Conference PaperPDF Available

A Comparison of Face Verification with Facial Landmarks and Deep Features


Abstract and Figures

Face verification is a key task in many application fields, such as security and surveillance. Several approaches and methodologies are currently used to try to determine if two faces belong to the same person. Among these, facial landmarks are very important in forensics, since the distance between some characteristic points of a face can be used as an objective measure in court during trials. However, the accuracy of the approaches based on facial landmarks in verifying whether a face belongs to a given person or not is often not quite good. Recently, deep learning approaches have been proposed to address the face verification problem, with very good results. In this paper, we compare the accuracy of facial landmarks and deep learning approaches in performing the face verification task. Our experiments, conducted on a real case scenario, show that the deep learning approach greatly outperforms in accuracy the facial landmarks approach.
Content may be subject to copyright.
A Comparison of Face Verification with Facial Landmarks and Deep Features
Giuseppe Amato, Fabrizio Falchi, Claudio Gennaroand Claudio Vairo
Institute of Information Science and Technologies of the National Research Council of Italy (ISTI-CNR)
via G. Moruzzi 1, 56124 Pisa, Italy
Email: {giuseppe.amato, fabrizio.falchi, claudio.gennaro, claudio.vairo}
Abstract—Face verification is a key task in many application
fields, such as security and surveillance. Several approaches and
methodologies are currently used to try to determine if two faces
belong to the same person. Among these, facial landmarks are
very important in forensics, since the distance between some
characteristic points of a face can be used as an objective
measure in court during trials. However, the accuracy of the
approaches based on facial landmarks in verifying whether
a face belongs to a given person or not is often not quite
good. Recently, deep learning approaches have been proposed
to address the face verification problem, with very good results.
In this paper, we compare the accuracy of facial landmarks and
deep learning approaches in performing the face verification task.
Our experiments, conducted on a real case scenario, show that
the deep learning approach greatly outperforms in accuracy the
facial landmarks approach.
KeywordsFace Verification; Facial Landmarks; Deep Learning;
Surveillance; Security.
Face verification is getting higher importance recently. Face
verification consists in determining if two faces in two different
images belong to the same person or not. Face recognition, on
the other hand, aims at assigning an identity to the person the
faces belong to. In this paper, we are interested in the face
verification problem.
To address the face verification problem, several ap-
proaches and techniques have been proposed. Some ap-
proaches are based on local features of the images, such
as Local Binary Pattern (LBP) [1]. Some other approaches
are based on detecting the facial landmarks from the de-
tected face and on measuring the distance between some
of these landmarks. Recently, Deep Learning approach and
Convolutional Neural Networks (CNNs) have been proposed
to address the face verification problem, such as [2]. Facial
landmarks are particularly useful when forensics cases have
to be discussed in court since they provide objective measures
that can be presented to discuss face verification. However, as
we will show in the paper, face verification with distances of
automatically extracted facial landmarks, is outperformed by
methods based on Deep Learning. Facial landmarks should
be used after verification is executed using Deep Learning
approaches, to provide objective motivation to the decision.
In this paper, we compare the results of performing the
face verification with facial landmarks and a Deep Learning
based approach. We validated our comparison by analyzing
some videos taken in a real-scenario by surveillance cameras
placed in the Instytut Ekspertyz Sdowych in Krakow [3]. To
this purpose, we used the Labeled Faces in the Wild (LFW)
dataset [4] as confusion dataset. In particular, we used the
faces detected in these videos as queries to perform a Nearest
Neighbor (NN) search with a joined dataset comprising both
LFW and the test set videos, in order to classify the persons
according to their face similarity.
The rest of the paper is organized as follows: Section
II gives a brief overview of the current approaches to the
face verification problem. In Section II-A, we describe the
features obtained from the facial landmarks that we analyze
and compare in this work. In Section II-B, we present the
deep feature that we compare to the facial landmarks features.
Section III presents an analysis on some of the facial landmarks
features and the experiments on the accuracy of all the features
considered. Finally, Section IV concludes this work.
The use of face information to verify the identity of a per-
son is a research area experiencing rapid development, thanks
to recent advances in deep learning. This approach falls under
the umbrella of the more general identity verification problem
[5]. Among the various types of facial information that can
be used a fairly obvious one is that coming from the facial
landmarks [6]–[9]. Deep Features learned from convolutional
networks have shown impressive performance in classification
and recognition problems. For instance, 99.77% accuracy of
LFW under 6.000 pair evaluation protocol has been achieved
by Liu at al. [10] and 99.33% by Schroff et al. of Google [11].
As in our proposed approach, approximate nearest neighbor
search methods can be used to improve scalability and works
very well as a lazy learning method [12], [13] and also a full-
text search engine [14].
Figure 1. 68 facial landmarks.
(a) Selected 5 nodal points. (b) 5 nodal points distances.
Figure 2. Nodal points and distances used to build the 5-points features.
A. Facial Landmarks Features
Facial landmarks are key points along the shape of the
detected face and they can be used as face features to perform
several tasks like improve face recognition, align facial images,
distinguish males and females, estimate the head pose, and so
Key points from landmarks are rarely used as a representa-
tion of face verification tasks, typically facial nodal points are
used instead. As nodal points, we can either use directly some
of the facial landmarks or we can compute some new points
starting from the facial landmarks. For example, the eyes, the
nose, and the mouth are very representative parts of a person’s
face, so points relative to these parts of the face can be relevant
to represent that face. In particular, for example, for the eyes,
we can use the centroid of the eye instead of using the facial
landmarks that constitute the contour of the eye.
In order to perform the face detection and to extract the
facial landmarks from an image, we used the dlib library [15].
In particular, the face detector is made using the Histogram
of Oriented Gradients (HOG) feature combined with a linear
classifier, an image pyramid, and sliding window detection
scheme. The facial landmark detector is an implementation
of the approach presented by Kazemi et al. in [16]. It returns
an array of 68 points in form of (x,y) coordinates that map
to facial structures of the face, as shown in Figure 1. The
computational time for extracting the facial landmarks from
the image reported in Figure 2 on a MacBook Pro 2013 with
an i7 Intel Core 2.5 GHz is about 70 ms.
The distances between nodal points and facial landmarks
can be used to build a feature of the face that can be compared
with other faces features. In particular, we computed three
features based on the distances between nodal points and facial
landmarks: the 5-points feature, the 68-points feature and the
Pairs feature. All the distances used to compute these features
are normalized to the size of the bounding box of the face.
In particular, each distance is divided by the diagonal of the
bounding box.
1) 5-points feature: In order to build the 5-points feature,
we used five specific nodal points: the centroids of the two
eyes, the center of the nose, and the sides of the mouth.
The centroids of the two eyes are computed from the six
facial landmarks for each eye returned by the dlib library. For
the nodal points of the nose and of the mouth, instead, we
used directly some of the facial landmarks, respectively the
Figure 3. Distances from the centroid of the face to all 68 facial landmarks,
used to build the 68-points features.
landmark #31 for the nose and the landmarks #49 and #55 for
the sides of the mouth (see Figure 2(a)). We used these nodal
points to compute the following 5 distances (see Figure 2(b)):
left eye centroid - right eye centroid
left eye centroid - nose
right eye centroid - nose
nose - left mouth
nose - right mouth
This produces a 5-dimensional float vector that we used as
5-point feature of the face.
2) 68-points feature: For the 68-points feature, we com-
puted the centroid of all the 68 facial landmarks returned by the
dlib library and we computed the distance between this point
and all the 68 facial landmarks (see Figure 3). This produces
a 68-dimensional float vector that we used as 68-feature of the
3) Pairs feature: The pairs feature is obtained by comput-
ing the distance of all unique pairs of points taken from the
68 facial landmarks computed on the input face, as suggested
in [9]. This produces a vector of 2.278 float distances that we
used as Pairs feature of the face.
B. Deep Features
Deep Learning [17] is a branch of machine learning that
uses lots of labeled data to teach computers how to perform
perceptive tasks like vision or hearing, with a near-human
level of accuracy. In particular, in computer vision tasks,
CNNs are exploited to learn features from labeled data. A
CNN learns a hierarchy of features, starting from low level
(pixels), to high level (classes). The learned feature is thus
optimized for the task and there is no need to handcraft it. Deep
Figure 4. Structure of the VGG-Face CNN used to extract the deep features.
Learning approaches give very good results in executing tasks
like image classification, object detection and recognition,
scene understanding, natural language processing , traffic sign
recognition, cancer cell detection and so on [18]–[21].
However, CNNs are good not only for classification pur-
poses. In fact, as said before, each convolutional layer of a
CNN learns a feature of the input image. In particular, the
output of one of the bottom layers before the output of the
network, is, in fact, a high-level representation of the input
image, that can be used as a feature for that image. We call
deep feature this representation of the image. This feature can
be compared to other deep features computed on other faces,
and close deep features vectors mean that the input faces are
semantically similar. Therefore, if their distance is below a
given threshold, we can conclude that the two faces belong to
the same person.
For this work, we used the VGG-Face network [2] that is
a CNN composed of 16 layers, 13 of which are convolutional.
We took the output of the fully connected layer 7 (FC7) as
deep feature, that is a vector of 4.096 floats (see Figure 4).
The computational time for extracting the deep feature from
the image reported in Figure 2 on a MacBook Pro 2013 with
an i7 Intel Core 2.5 GHz is about 300 ms, that is four times
the time needed to extract the facial landmarks from the same
In this section, we describe the experiments performed to
compare the accuracy of the different features described in
Sections II-A and II-B in performing the face verification task.
We first describe the test set used in our experiments, that
is constituted by six videos acquired by surveillance cameras
deployed in some of the corridors of the Instytut Ekspertyz
Sdowych in Krakow and by the famous face dataset LFW,
that we used as confusion set. We then present an analysis of
the distances computed over the facial landmarks and, finally,
we report some accuracy results obtained by our experiments
on the considered features.
A. Test set
We used six videos as test set, provided by the EU Frame-
work Programme Horizon 2020 COST Association COST
Action CA16101 [22]. These videos are taken from three dif-
ferent surveillance cameras deployed in the Instytut Ekspertyz
Sdowych in Krakow and they capture two different persons (we
call them ”Person1” and ”Person2”). Each of them is recorded
in all the environments where the cameras are installed. So, we
have three videos for Person1 and three videos for Person2.
For each video, we analyzed each frame independently. In
particular, for each frame, we executed the face detection
(a) Sample from P1-video2. (b) Sample from P1-video3.
Figure 5. Samples of videos for Person1.
(a) Sample from P2-video1. (b) Sample from P2-video2.
(c) Sample from P2-video3.
Figure 6. Samples of videos for Person2.
phase, and for the frames where a face has been detected,
we executed the facial landmarks detection algorithm. We
then computed the 5-points, 68-points and Pairs features, by
exploiting the 68 detected landmarks.
The videos used in our experiments are very challenging
because the resolution is low (768x576), and the person is in
the foreground of the scene. We have obtained 59 total frames
containing faces in all the six videos, that are composed as
Person1 (P1):
video1: 0 faces detected (the face was never
recorded clearly in the video);
video2: 5 faces detected;
video3: 19 faces detected;
Person2 (P2):
video1: 5 faces detected;
video2: 16 faces detected;
video3: 14 faces detected;
Figures 5 and 6 show some samples of the Person1 videos
and Person2 videos, respectively.
B. Facial landmarks distances measurements
In order to understand if there is a way to better exploit the
distance between facial landmark points, we have performed
an analysis and computed some measurements on the distances
between 5 nodal points and on the distances between the 68
facial landmarks and the centroid, in different frames collected
by the sample videos that we used as test set.
(a) 5-points features for Person1 videos.
(b) 5-points features for Person2 videos.
Figure 7. Distances between the 5 nodal points in different frames of
Person1 (a) and Person2 (b) videos.
Figures 7 and 8 show, respectively, the trend of the compo-
nents of the 5-points and 68-points features in different frames
of the videos, for both persons. Please, recall that Person1 face
has been detected in just two videos, while Person2 face has
been detected in all three videos. It is possible to notice that,
for frames of the same video, the lines of the distances are
quite regular, while they have a great difference when moving
to another video. This shows that, while a person is seen by
the same camera, with the same angle of view, it is possible
to use the distance of facial landmarks to recognize a person
by its face with good accuracy.
We also computed the average and the variance of the
distances between nodal points and facial landmarks reported
in Figure 9. In particular, Figure 9(a) reports the average and
the variance of the distances between the 5 nodal points and
Figure 9(b) reports the average and variance of the distances
between the centroid of the face and the 68 facial landmarks.
In both cases, the average and the variance are computed on
the distance of the same pair of points in all the different
frames of Person1 and Person2 videos. The figure shows that
the variance is very small in almost every pair of points, and
also that the average value of the two persons is quite different
in four of five pairs of the nodal points (Figure 9(a)) and in
lots of 68 facial landmarks (Figure 9(b)). This means that, by
analyzing consecutive frames of a video, when this is feasible,
it is possible to increase the possibility to recognize a certain
person by using the distance of the facial landmarks.
(a) 68-point features for Person1 videos.
(b) 68-points features for Person2 videos.
Figure 8. Distances between the 68 facial landmarks and the face centroid in
different frames of Person1 (a) and Person2 (b) videos.
C. Classification Accuracy
We performed some experiments to compare the accuracy
in performing the face verification task by using the four
different features described above. To this purpose, the faces
extracted from the videos were merged with LFW, that has
been used as distractor.
LFW is a very famous face dataset, which contains around
13 thousand faces and 5.750 different identities. All images in
LFW are 250x250 pixels and the face is aligned to be in the
center of the image. However, there is a lot of background in
the images, sometimes capturing also other people faces. This
could lead to multiple face detection. Therefore, we cropped
each image in the LFW dataset to the size of 150x150 pixels,
by keeping the same center, in order to cut the background and
avoid multiple face detection. In this case also, we performed
the face detection and we computed the facial landmark points
by using the dlib library (Figure 10 shows some examples of
LFW faces with facial landmarks highlighted). We merged the
LFW dataset with the 59 faces that we detected in the test
videos and we created a unified dataset. We then extracted
the four different features (5-points, 68-points, Pairs and deep
features), from all the faces in the new dataset.
We used each of the faces detected in the test set videos
as a query for a NN search in the unified dataset. We used the
Euclidean distance as dissimilarity measure between features
and sorted the entire dataset according to this distance with
the given query, from the nearest to the farthest. We discarded
the first result of each query since it is the query itself.
(a) 5- points feature average and variance for Person1 videos.
(b) 68-points feature average and variance for Person2 videos.
Figure 9. Average and variance of the 5 and 68 distances for Person1 (a)
and Person2 (b).
Figure 10. Some examples from LFW dataset and the corresponding
detected faces with facial landmarks.
Figure 11 reports some query examples with the Top5
results, for all the features analyzed. For each feature, we report
the best and the worst result, in which the biggest number of,
respectively, correct and wrong matches in the first five results
is obtained. The best result of 5-points feature only got three
correct matches in the Top5 results, while all the other features
got all correct matches in the Top5 results. The worst result
is the same for all the facial landmarks features, that is no
correct match in the Top5 results. On the other hand, the deep
feature worst result only has one wrong match, that is ranked
in the last of the Top5 results.
The different size of the faces detected in the videos is due
to the different size of the bounding box of the face computed
by the face detector library. This is caused by the different
position of the person in the scene with respect to the camera;
a bigger face means that the person is closer to the camera.
Feature mAP
5-points feature 0.03
68-points feature 0.06
Pairs feature 0.07
Deep feature 0.81
We compared all the four different features by computing
the mean Average Precision (mAP) on the results of the
queries, so we measured how well the results are ordered
according to the query. In particular, for each query, we sum
the number of correct results, weighted by their position in the
result set, and we divide this value by all the correct elements
in the dataset. We then average the precision of all queries,
thus obtaining the mean Average Precision for each feature.
The results are reported in Table I. They show that the 68-
points feature is two times better than the 5-points feature, and
the Pairs feature slightly improves the 68-points feature result.
However, the deep feature is more than one order of magnitude
better than all the features based on the facial landmarks.
Feature Top1 Top5
5-points feature 24% 47%
68-points feature 51% 76%
Pairs feature 64% 78%
Deep feature 97% 98%
We also computed the Top1 and Top5 accuracy for all the
features considered. The Top1 accuracy counts the percentage
of queries in which the first person of the result set is the
same person of the corresponding query. The Top5 accuracy
considers the first five persons of the result set to check if
the correct one is present. Table II shows that 5-points feature
works very bad in this scenario with small and low-resolution
faces with a Top1 accuracy of only 24% and a Top5 accuracy
of 47%. The 68-points feature and the Pairs features, improve
the Top1 accuracy of more than twice with respect to the 5-
points feature, and up to 78% in case of the Top5 accuracy.
Also in this case, however, the deep feature works much better
obtaining a 97% Top1 accuracy and a 98% Top5 accuracy.
The facial landmarks have indeed the property of being
an accepted proof in trials, and they can be used to classify
people in some conditions and with a certain accuracy; they are
also faster to be computed with respect to the deep features.
However, the deep feature shows much better performance,
especially in challenging scenarios with low-resolution faces.
In this paper, we presented a comparison between facial
landmarks and deep learning approaches in performing the
Figure 11. Query examples for the all the kinds of features, with Top5 results. For each feature, the best and the worst results are reported.
face verification task. Facial landmarks are very important
in forensics because they can be used as objective proof in
trials. We performed our experiments on videos taken in a
real scenario and by exploiting the widely used face dataset
LFW. Results show that the accuracy of the deep features in
verifying whether a face belongs to a given person is much
greater than the one of facial landmarks based approach. On
the other hand, the deep learning results cannot be used as
proof in court. We think, however, that deep features approach
should help the forensics process along with facial landmarks.
In particular, the latter should be used after the face verification
has been executed with deep features, in order to provide an
objective measure for the decision.
This work has been partly funded by the “Renewed En-
ergy” project of the DIITET Department of CNR and by the
EU Framework Programme Horizon 2020 COST Association
COST Action CA16101. Special thanks to Prof. Dariusz Zuba
and the Instytut Ekspertyz Sdowych in Krakow for the videos
used as test set.
[1] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local
binary patterns: Application to face recognition,” IEEE transactions on
pattern analysis and machine intelligence, vol. 28, no. 12, 2006, pp.
[2] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,”
in British Machine Vision Conference, 2015.
[3] “Instytut ekspertyz sdowych - krakow,”, accessed:
[4] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “La-
beled faces in the wild: A database for studying face recognition in
unconstrained environments,” Technical Report 07-49, University of
Massachusetts, Amherst, Tech. Rep., 2007.
[5] P. Verlinde, G. Chollet, and M. Acheroy, “Multi-modal identity verifi-
cation using expert fusion,” Information Fusion, vol. 1, no. 1, 2000, pp.
[6] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face rep-
resentation by joint identification-verification,” in Advances in neural
information processing systems, 2014, pp. 1988–1996.
[7] C. Sanderson, M. T. Harandi, Y. Wong, and B. C. Lovell, “Combined
learning of salient local descriptors and distance metrics for image set
face verification,” in Advanced Video and Signal-Based Surveillance
(AVSS), 2012 IEEE Ninth International Conference on. IEEE, 2012,
pp. 294–299.
[8] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute
and simile classifiers for face verification,” in Computer Vision, 2009
IEEE 12th International Conference on. IEEE, 2009, pp. 365–372.
[9] A. G. Rassadin, A. S. Gruzdev, and A. V. Savchenko, “Group-level
emotion recognition using transfer learning from face identification,”
arXiv preprint arXiv:1709.01688, 2017.
[10] J. Liu, Y. Deng, T. Bai, Z. Wei, and C. Huang, “Targeting ulti-
mate accuracy: Face recognition via deep embedding,” arXiv preprint
arXiv:1506.07310, 2015.
[11] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified
embedding for face recognition and clustering,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2015,
pp. 815–823.
[12] J. Park, K. Lee, and K. Kang, “Arrhythmia detection from heartbeat
using k-nearest neighbor classifier,” in Bioinformatics and Biomedicine
(BIBM), 2013 IEEE International Conference on. IEEE, 2013, pp.
[13] D. Wang, C. Otto, and A. K. Jain, “Face search at scale,” IEEE
transactions on pattern analysis and machine intelligence, vol. 39, no. 6,
2017, pp. 1122–1136.
[14] G. Amato, F. Carrara, F. Falchi, and C. Gennaro, “Efficient indexing
of regional maximum activations of convolutions using full-text search
engines,” in Proceedings of the 2017 ACM on International Conference
on Multimedia Retrieval. ACM, 2017, pp. 420–423.
[15] “Dlib library,”, accessed: 2018-04-13.
[16] V. Kazemi and J. Sullivan, “One millisecond face alignment with an
ensemble of regression trees,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2014, pp. 1867–1874.
[17] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
no. 7553, 5 2015, pp. 436–444.
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural
information processing systems, 2012, pp. 1097–1105.
[19] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based
convolutional networks for accurate object detection and segmentation,
IEEE transactions on pattern analysis and machine intelligence, vol. 38,
no. 1, 2016, pp. 142–158.
[20] G. Amato, F. Falchi, and L. Vadicamo, “Visual recognition of ancient
inscriptions using convolutional neural network and fisher vector,”
Journal on Computing and Cultural Heritage (JOCCH), vol. 9, no. 4,
2016, p. 21.
[21] G. Amato, F. Carrara, F. Falchi, C. Gennaro, C. Meghini, and C. Vairo,
“Deep learning for decentralized parking lot occupancy detection,”
Expert Systems with Applications, vol. 72, 2017, pp. 327–334.
[22] “Eu framework programme horizon 2020 cost action ca16101, Actions/ca/CA16101, accessed: 2018-04-13.
... They used FGNet annotation that has 20 points, as shown in Figure 1. They have found that the following landmark points-2,3,4,5,6,7-were very influenced by FE, where they considered only the other 14 points in their analysis, as shown in Figure 2. Amato et al. [13] compared between 5-points features and 68-points features, as shown in Figures 3 and 4. They conducted their experiments on videos taken in a real scenario by surveillance cameras. They used dlib library and the FL detectors to implement the approach represented by [14], which returns an array of 68-points in the form of (x,y) coordinated. ...
... FGNet annotation[12].Amato et al.[13] compared between 5-points features and 68-points features, as shown inFigures 3 and 4. They conducted their experiments on videos taken in a real scenario by surveillance ...
... 55-point features[13]. ...
Full-text available
The human mood has a temporary effect on the face shape due to the movement of its muscles. Happiness, sadness, fear, anger, and other emotional conditions may affect the face biometric system's reliability. Most of the current studies on facial expressions are concerned about the accuracy of classifying the subjects based on their expressions. This study investigated the effect of facial expressions on the reliability of a face biometric system to find out which facial expression puts the biometric system at greater risk. Moreover, it identified a set of facial features that have the lowest facial deformation caused by facial expressions to be generalized during the recognition process, regardless of which facial expression is presented. In order to achieve the goal of this study, an analysis of 22 facial features between the normal face and six universal facial expressions is obtained. The results show that the face biometric systems are affected by facial expressions where the disgust expression achieved the most dissimilar score, while the sad expression achieved the lowest dissimilar score. Additionally, the study identified the five and top ten facial features that have the lowest facial deformations on the face shape in all facial expressions. Besides that, the relativity score showed less variances between the sample using the top facial features. The obtained results of this study minimized the false rejection rate in the face biometric system and subsequently the ability to raise the system's acceptance threshold to maximize the intrusion detection rate without affecting the user convenience.
... However, it requires a GPU while training. A face verification system framework can be categorized into three sections; face detection or facial landmark localization, feature extraction or deep feature, and face verification [5][6][7]. ...
... Face detection mainly deals with finding the whole human face from the image and video. The facial landmark localization is defined as the localization of specific key points on the frontal face, such as eye contours, eyebrow contours, nose, mouth corners, lip, and chin [5,7]. • Feature Extraction Techniques. ...
... Khan [6] proposed a framework that detected only 49 facial landmarks from eyes (12 marks), eyebrows (10 marks), nose (9 marks), and lips (18 marks). Furthermore, Amato et al. [7] compared the effectiveness between facial landmarks features and deep feature or feature extraction for verifying faces. For facial landmarks features, it returned 68 key points located on the face. ...
Full-text available
Face verification systems have many challenges to address because human images are obtained in extensively variable conditions and in unconstrained environments. Problem occurs when capturing the human face in low light conditions, at low resolution, when occlusions are present, and even different orientations. This paper proposes a face verification system that combines the convolutional neural network and max-margin object detection called MMOD + CNN, for robust face detection and a residual network with 50 layers called ResNet-50 architecture to extract the deep feature from face images. First, we experimented with the face detection method on two face databases, LFW and BioID, to detect human faces from an unconstrained environment. We obtained face detection accuracy > 99.5% on the LFW and BioID databases. For deep feature extraction, we used the ResNet-50 architecture to extract 2,048 deep features from the human face. Second, we compared the query face image with the face images from the database using the cosine similarity function. Only similarity values higher than 0.85 were considered. Finally, the top-1 accuracy was used to evaluate the face verification. We achieved an accuracy of 100% and 99.46% on IMM frontal face and IMM face databases, respectively.
... We also have compared the recognition accuracy in performing the face recognition task by using the distance of facial landmarks and some CNN-based approaches. Facial landmarks are very important in forensics because they can be used as objective proof in trials, however, the recognition accuracy of these approaches is much lower than the ones based on Deep Learning [20]. ...
... Some of the results obtained by this Action have been presented in [20]. ...
Technical Report
Full-text available
The Artificial Intelligence for Multimedia Information Retrieval (AIMIR) research group is part of the NeMIS laboratory of the Information Science and Technologies Institute ``A. Faedo'' (ISTI) of the Italian National Research Council (CNR). The AIMIR group has a long experience in topics related to: Artificial Intelligence, Multimedia Information Retrieval, Computer Vision and Similarity search on a large scale. We aim at investigating the use of Artificial Intelligence and Deep Learning, for Multimedia Information Retrieval, addressing both effectiveness and efficiency. Multimedia information retrieval techniques should be able to provide users with pertinent results, fast, on huge amount of multimedia data. Application areas of our research results range from cultural heritage to smart tourism, from security to smart cities, from mobile visual search to augmented reality. This report summarize the 2019 activities of the research group.
... ;Kazemi and Sullivan (2014);Amato et al. (2018).(5) input_shape = (40, (transfer_value_size) m ) ...
Full-text available
Violence detection and face recognition of the individuals involved in the violence has an influence that’s noticeable on the development of automated video surveillance research. With increasing risks in society and insufficient staff to monitor them, there is an expanding demand for drones square measure and computerized video surveillance. Violence detection is expeditious and can be utilized as the method to selectively filter the surveillance videos, and identify or take note of the individual who is creating the anomaly. Individual identification from drone surveillance videos in a crowded area is difficult because of the expeditious movement, overlapping features, and bestrew backgrounds. The goal is to come with a better drone surveillance system that recognizes the violent individuals that are implicated in violence and evoke a distress signal so that fast help can be offered. This paper uses the currently developed techniques based on deep learning and proposed the concept of transfer learning using deep learning-based different hybrid models with LSTM for violence detection. Identifying individuals incriminated in violence from drone-captured images involves major issues in variations of human facial appearance, hence the paper uses a CNN model combined with image processing techniques. For testing, the drone captured video dataset is developed for an unconstrained environment. Ultimately, the features extracted from a hybrid of inception modules and residual blocks, with LSTM architecture yielded an accuracy of 97.33% and thereby proved to be noteworthy and thereby, demonstrating its superiority over other models that have been tested. For the individual identification module, the best accuracy of 99.20% obtained on our dataset, is a CNN model with residual blocks trained for face identification.
... The problem of detecting and recognizing people in images or videos has become of central importance in many video surveillance applications [3][4][5]16]. The issue of facial recognition from a drone perspective has, however, been addressed more recently in literature [8]. ...
... Deep Learning [20] is a branch of Machine Learning that allows a neural network, composed of large number of layers, to learn representations of input data with increasing level of abstraction. Deep learning approaches provide near-human level accuracy in performing tasks like image classification [19], object detection [13], object recognition [9], sentiment analysis [28], speech recognition [12], parking monitoring [5,6], face recognition [8,23], and more. Deep features learned from CNNs have shown impressive performance in classification and recognition tasks. ...
Conference Paper
With the advent of deep learning based methods, facial recognition algorithms have become more effective and efficient. However, these algorithms have usually the disadvantage of requiring the use of dedicated hardware devices, such as graphical processing units (GPUs), which pose restrictions on their usage on embedded devices with limited computational power. In this paper, we present an approach that allows building an intrusion detection system, based on face recognition, running on embedded devices. It relies on deep learning techniques and does not exploit the GPUs. Face recognition is performed using a knn classifier on features extracted from a 50-layers Residual Network (ResNet-50) trained on the VGGFace2 dataset. In our experiment, we determined the optimal confidence threshold that allows distinguishing legitimate users from intruders. In order to validate the proposed system, we created a ground truth composed of 15,393 images of faces and 44 identities, captured by two smart cameras placed in two different offices, in a test period of six months. We show that the obtained results are good both from the efficiency and effectiveness perspective.
Full-text available
Face comparison/face mapping is one of the promising methods in face biometrics which needs relatively little effort compared with face identification. Various factors may be used to verify whether two faces are of the same person, among which facial landmarks are one of the most objective indicators due to the same anatomical definition for every face. This study identified major landmarks from 2D and 3D facial images of the same Korean individuals and calculated the distance between the reciprocal landmarks of two images to examine their acceptable range for identifying an individual to obtain standard values from diverse facial angles and image resolutions. Given that reference images obtained in the real-world could be from various angles and resolutions, this study created a 3D face model from multiple 2D images of different angles, and oriented the 3D model to the angle of the reference image to calculate the distance between reciprocal landmarks. In addition, we used the super-resolution method of artificial intelligence to address the inaccurate assessments that low-quality videos can yield. A portion of the process was automated for speed and convenience of face analysis. We conclude that the results of this study could provide a standard for future studies regarding face-to-face analysis to determine if different images are of the same person.
We report an investigation into the application of the logistic regression classifier for estimation of gender in facial images. We used 2000 images, 1000 each of each gender from a publicly available database and automatically detected facial landmarks and derived some morphometric facial indices. These indices were used as predictors for the classification. As the traditional manual extraction of facial landmarks is time consuming, automatic detection of the landmarks improves the efficiency. The logistic regression classification is also compared with two other classification methods, the likelihood-ratio (LR) based method where the features of a face are evaluated in terms of the probability distribution of these features in both the sexes, and the Convolutional Neural Networks (CNN) methods. While is former is desirable from the viewpoint of interpretability and to assess the strength of evidence, the latter is sophisticated. We report an AUC of 0.94 with true positive (TP) rate of 88.4% for males and 87.9% for females for logistic regression-based classification. This performance is better than the likelihood ratio classifier with TP rate of 79.6% for males and 82.2% for females. The overall performance of logistics regression is slightly less than the CNN classifier that has 89.3% TP rate for males and 92.6% for females. We have extended these models to a CCTV image database, more representative of the forensic scenario and found the logistic regression performing better than the CNN method on an average for 8 different types of cameras. We conclude that as a trade-off between simplicity and sophistication, the logistic regression classifier can be used for a two-class problem like classification of sex from facial morphometric indices, and that the likelihood ratio approach can assess the strength of the classification, in conformance with the requirements of evidence interpretation.
Full-text available
In this paper, we describe our algorithmic approach, which was used for submissions in the fifth Emotion Recognition in the Wild (EmotiW 2017) group-level emotion recognition sub-challenge. We extracted feature vectors of detected faces using the Convolutional Neural Network trained for face identification task, rather than traditional pre-training on emotion recognition problems. In the final pipeline an ensemble of Random Forest classifiers was learned to predict emotion score using available training set. In case when the faces have not been detected, one member of our ensemble extracts features from the whole image. During our experimental study, the proposed approach showed the lowest error rate when compared to other explored techniques. In particular, we achieved 75.4% accuracy on the validation data, which is 20% higher than the handcrafted feature-based baseline. The source code using Keras framework is to be made publicly available.
Conference Paper
Full-text available
In this paper, we adapt a surrogate text representation technique to develop efficient instance-level image retrieval using Regional Maximum Activations of Convolutions (R-MAC). R-MAC features have recently showed outstanding performance in visual instance retrieval. However, contrary to the activations of hidden layers adopting ReLU (Rectified Linear Unit), these features are dense. This constitutes an obstacle to the direct use of inverted indexes, which rely on sparsity of data. We propose the use of deep permutations, a recent approach for efficient evaluation of permutations, to generate surrogate text representation of R-MAC features, enabling indexing of visual features as text into a standard search-engine. The experiments, conducted on Lucene, show the effectiveness and efficiency of the proposed approach.
Full-text available
By bringing together the most prominent European institutions and archives in the field of Classical Latin and Greek epigraphy, the EAGLE project has collected the vast majority of the surviving Greco-Latin inscriptions into a single readily-searchable database. Text-based search engines are typically used to retrieve information about ancient inscriptions (or about other artifacts). These systems require that the users formulate a text query that contains information such as the place where the object was found or where it is currently located. Conversely, visual search systems can be used to provide information to users (like tourists and scholars) in a most intuitive and immediate way, just using an image as query. In this article, we provide a comparison of several approaches for visual recognizing ancient inscriptions. Our experiments, conducted on 17,155 photos related to 14,560 inscriptions, show that BoW and VLAD are outperformed by both Fisher Vector (FV) and Convolutional Neural Network (CNN) features. More interestingly, combining FV and CNN features into a single image representation allows achieving very high effectiveness by correctly recognizing the query inscription in more than 90% of the cases. Our results suggest that combinations of FV and CNN can be also exploited to effectively perform visual retrieval of other types of objects related to cultural heritage such as landmarks and monuments.
Full-text available
A smart camera is a vision system capable of extracting application-specific information from the captured images. The paper proposes a decentralized and efficient solution for visual parking lot occupancy detection based on a deep Convolutional Neural Network (CNN) specifically designed for smart cameras. This solution is compared with state-of-the-art approaches using two visual datasets: PKLot, already existing in literature, and CNRPark-EXT. The former is an existing dataset, that allowed us to exhaustively compare with previous works. The latter dataset has been created in the context of this research, accumulating data across various seasons of the year, to test our approach in particularly challenging situations, exhibiting occlusions, and diverse and difficult viewpoints. This dataset is public available to the scientific community and is another contribution of our research. Our experiments show that our solution outperforms and generalizes the best performing approaches on both datasets. The performance of our proposed CNN architecture on the parking lot occupancy detection task, is comparable to the well-known AlexNet, which is three orders of magnitude larger.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
rsons of interest among the billions of shared photos on these websites. Despite significant progress in face recognition, searching a large collection of unconstrained face images remains a difficult problem. To address this challenge, we propose a face search system which combines a fast search procedure, coupled with a state-of-the-art commercial off the shelf (COTS) matcher, in a cascaded framework. Given a probe face, we first filter the large gallery of photos to find the top-k most similar faces using features learned by a convolutional neural network. The k retrieved candidates are re-ranked by combining similarities based on deep features and those output by the COTS matcher. We evaluate the proposed face search system on a gallery containing 80 million web-downloaded face images. Experimental results demonstrate that while the deep features perform worse than the COTS matcher on a mugshot dataset (93.7% vs. 98.6% TAR@FAR of 0.01%), fusing the deep features with the COTS matcher improves the overall performance (99.5% TAR@FAR of 0.01%). This shows that the learned deep features provide complementary information over representations used in state-of-the-art face matchers. On the unconstrained face image benchmarks, the performance of the learned deep features is competitive with reported accuracies. LFW database: 98.20% accuracy under the standard protocol and 88.03% TAR@FAR of 0.1% under the BLUFR protocol; IJB-A benchmark: 51.0% TAR@FAR of 0.1% (verification), rank 1 retrieval of 82.2% (closed-set search), 61.5% FNIR@FAR of 1% (open-set search). The proposed face search system offers an excellent trade-off between accuracy and scalability on galleries with millions of images. Additionally, in a face search experiment involving photos of the Tsarnaev brothers, convicted of the Boston Marathon bombing, the proposed cascade face search system could find the younger brother's (Dzhokhar Tsarnaev) photo at rank 1 in 1 second on a 5M gallery and at rank 8 in 7 seconds on an 80M gallery.
Face Recognition has been studied for many decades. As opposed to traditional hand-crafted features such as LBP and HOG, much more sophisticated features can be learned automatically by deep learning methods in a data-driven way. In this paper, we propose a two-stage approach that combines a multi-patch deep CNN and deep metric learning, which extracts low dimensional but very discriminative features for face verification and recognition. Experiments show that this method outperforms other state-of-the-art methods on LFW dataset, achieving 99.85% pair-wise verification accuracy and significantly better accuracy under other two more practical protocols. This paper also discusses the importance of data size and the number of patches, showing a clear path to practical high-performance face recognition systems in real world
Object detection performance, as measured on the canonical PASCAL VOC Challenge datasets, plateaued in the final years of the competition. The best-performing methods were complex ensemble systems that typically combined multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 50% relative to the previous best result on VOC 2012—achieving a mAP of 62.4%. Our approach combines two ideas: (1) one can apply high-capacity convolutional networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data are scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, boosts performance significantly. Since we combine region proposals with CNNs, we call the resulting model an R-CNN or Region-based Convolutional Network. Source code for the complete system is available at