ArticlePDF Available

A Novel Deep Learning Architecture With Image Diffusion for Robust Face Presentation Attack Detection

Authors:

Abstract and Figures

Face presentation attack detection (PAD) is considered to be an essential and critical step in modern face recognition systems. Face PAD aims at exposing an imposter or an unauthorized person seeking to deceive the authentication system. Presentation attacks are typically made using a fake ID through a digital/printed photograph, video, paper mask, 3D mask, and make-up etc. In this research, we propose a novel face PAD solution using an interpolation-based image diffusion augmented by transfer learning of a MobileNet convolutional neural network. The proposed interpolation-based image diffusion method and face PAD approach, implemented in a single framework, shows promising results on various anti-spoofing databases. The experimental results illustrate that the proposed face PAD method shows superior performance compared to most of the state-of-the-art methods.
Content may be subject to copyright.
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
A Novel Deep Learning Architecture with
Image Diffusion for Robust Face
Presentation Attack Detection
MADINI O. ALASSAFI1, MUHAMMAD SOHAIL IBRAHIM2, IMRAN NASEEM3, 4, 5, RAYED
ALGHAMDI1, REEM ALOTAIBI1, FARIS A. KATEB1, HADI MOHSEN OQAIBI1,
ABDULRAHMAN A. ALSHDADI6, AND SYED ADNAN YUSUF7
1Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia (e-mail: malasafi, raalghamdi8, ralotibi, fakateb,
Hoqaibi@kau.edu.sa)
2College of Electrical Engineering, Zhejiang University, Hangzhou, China (e-mail: msohail@zju.edu.cn)
3School of Electrical, Electronic, and Computer Engineering, The University of Western Australia, Perth, Australia
4College of Engineering, Karachi Institute of Economics and Technology, Karachi, Pakistan (email: imrannaseem@kiet.edu.pk)
5Research and Development, Love for Data, Karachi, Pakistan
6College of Computer Science and Engineering, Department of Information Systems and Technology, University of Jeddah, Jeddah, Saudi Arabia (email:
alshdadi@uj.edu.sa)
7Research and Innovation Department, Intelexica Pvt. Ltd., United Kingdom (email: adnan@intelexica.com)
Corresponding author: Muhammad Sohail Ibrahim (e-mail: msohail@zju.edu.cn).
This research work was funded by the Institutional Fund Projects under grant no (IFPRC-044-611-2020). Therefore, authors gratefully
acknowledge technical and financial support from the Ministry of Education and King Abdul Aziz University, Jeddah, Saudi Arabia.
ABSTRACT Face presentation attack detection (PAD) is considered to be an essential and critical step
in modern face recognition systems. Face PAD aims at exposing an imposter or an unauthorized person
seeking to deceive the authentication system. Presentation attacks are typically made using a fake ID through
a digital/printed photograph, video, paper mask, 3D mask, and make-up etc. In this research, we propose
a novel face PAD solution using an interpolation-based image diffusion augmented by transfer learning
of a MobileNet convolutional neural network. The proposed interpolation-based image diffusion method
and face PAD approach, implemented in a single framework, shows promising results on various anti-
spoofing databases. The experimental results illustrate that the proposed face PAD method shows superior
performance compared to most of the state-of-the-art methods.
INDEX TERMS Anti-spoofing, deep learning, face liveness detection, face presentation attack detection,
interpolation-based image diffusion.
I. INTRODUCTION
BIOMETRIC authentication systems such as face recog-
nition and fingerprint have gained a lot of popularity
over the past decade due to the increased security and re-
liability compared to the conventional password-based au-
thentication systems. Face recognition, in particular, due to
its non-intrusive nature, high accuracy, and usability has led
to its applications in various domains including surveillance
[1], classroom attendance systems [2], school examination
monitoring [3], mobile phone unlock [4], and access control
systems [5] to name a few. This has been made possible due
to the availability of large databases and greater computing
power due to which, many deep learning-based automatic
face recognition systems have been reported to achieve accu-
racy of over 99% [6]. However, face authentication systems
suffer from an intrinsic drawback of false acceptance which
can entail a security risk due to the possibility that the system
security can be vulnerable under a spoofing attack by a
malicious adversary/imposter attempting to spoof the face
recognition system.
Face spoofing attacks or face presentation attacks are re-
ferred to attacks where an adversary obtains unauthorized ac-
cess to a face recognition system by camouflaging/imposing
as an authorized person. Acquiring facial images using so-
cial media has also enabled the attackers to spoof the face
recognition systems using a variety of attacks such as print
photo attacks, recorded facial videos, and 3D mask attacks.
Therefore, the demand of efficient face anti-spoofing systems
or face presentation attack detection (PAD) systems has also
risen to alleviate the spoofing risks to the face recognition
VOLUME 4, 2016 1
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
applications.
During the past decade, a large number of face PAD
methods have been reported. These methods put their prime
focus towards the exploration of efficient features for face
PAD. Such methods can be divided into different categories
such as distortion, motion, texture, and deep learning-based
face PAD techniques. Each of the aforementioned techniques
have made important contributions in the performance of
face PAD systems. Distortion-based techniques focus on
utilizing the distortion sensitive features to perform face
anti-spoofing. Motion based techniques focus on extracting
motion features such as blinking of the eyes, optical flow,
head movement, and lip movement to perform face liveness
detection. Texture-based face PAD techniques aim to adopt
texture features such as local binary patterns (LBP) to detect
face spoofing. Deep learning-based techniques perform anti-
spoofing by learning the feature representation using deep
neural networks.
Early face PAD solutions rely heavily on the motion-based
features such as movement of the head, blinking of the eyes,
and lip movement to detect 2D face spoofing attacks. In [7],
a face detection technique leveraging mouth localization and
utilizing lip motion analysis for face detection and liveness
detection by incorporating AdaBoost and SVM, respectively,
was presented. The authors in [8] performed liveness detec-
tion using the difference of optical flow fields generated by
the movement of 3D objects (such as a face) and 2D planes
(such as a printed photograph). A face liveness detection
technique that analyzes the correlation between the back-
ground and the fore-ground using optical flow is presented in
[9]. Motion-based face PAD techniques also use eye-blinking
movements [10], [11], dynamic facial textures [12], nose and
ear movement [13] among other facial movement cues.
Most of the face PAD schemes presented in the liter-
ature focus on the detection of replay and print spoofing
attacks, which can be addressed using the texture and color
features. A number of early works incorporate hand-crafted
features such as color texture features and local binary
patterns (LBP) [14]–[16], histogram of oriented gradients
(HOG) [17]. Other texture-based methods employed scale-
invariant feature transform [4], [18], speeded-up robust fea-
ture (SURF) [19], optical flow and texture analysis [20], etc.
Despite the recent trend towards detection of 3D mask
attacks [21], most works in literature focus on the detection of
2D spoofing attacks such as print photo attack, replay attacks,
masking attacks, etc. This is due to the reason that most 3D
face PAD approaches rely on the introduction of additional
hardware (cameras for multiple channels e.g. thermal, NIR,
etc.) and also due to the computational complexity of such
approaches, it makes such approaches infeasible for applica-
tion in low-cost systems. Wenyun Sun et. al. [22] proposed
to revisit the face PAD task using local label supervision
where the pixel-level spoofing cues are classified using the
spoof fore-ground, genuine fore-ground, and the genuine
background by a depth-based fully convolutional network
(FCN). A face PAD method that utilized a shape-from-
surface algorithm to extract intrinsic properties such as depth,
reflectance, and albedo of the faces followed by a novel
shallow CNN architecture to learn useful features for the face
PAD task is presented in [23]. The work in [24] presented an
approach that uses a mix of real-word face images and deep
convolutional autoencoder generated images followed by a
CNN feature fusion layer to balance the fusion, in an adaptive
fashion, of the two images to efficiently perform face anti-
spoofing. A CNN-based face PAD approach that learns the
dynamic disparity maps based hand-crafted features within
the network using an additional disparity layer in the custom
CNN architecture is presented in [25]. The use of multiple
channels such as near-infrared (NIR), thermal, etc., besides
the visual spectra for face PAD has also been reported in
the literature. The authors of [26] presented a multi-channel
CNN-based technique that uses four channels namely color,
NIR, depth, and thermal to address the 2D and 3D face PAD
problem. A hybrid approach that uses a region based CNN
and an improved Retinex-based LBP in a cascade fashion to
perform face PAD is presented in [27].
In many deep learning and computer vision applications,
transfer learning has been readily used for the extraction of
deep convolutional features. This has been made possible
with the availability of very deep CNN architectures where
complex and discriminative features can be extracted using
pre-trained models or fine-tuning. These pre-trained CNN
architectures provide excellent feature extraction capabilities.
While most of the existing works design and train the CNN
models from scratch using face PAD datasets, such models
usually suffer from overfitting due to the unavailability of
large training datasets. In order to address such overfitting
problems and enhance the performance of computer vision
and deep learning tasks, the use of pre-trained models and
fine-tuning deep CNN models from large image classifica-
tion databases such as ImageNet [28], has been actively
reported in literature. In this regard, a two-stream CNN-
based face PAD technique which leverages a pre-trained
ResNet18 model to learn the features from RGB space
and an illumination-invariant space to be provided to an
attention-based feature fusion mechanism for efficient face
PAD performance, is presented in [29]. The use of pre-
trained deep CNN models in face PAD approaches that use
multiple channels such as RGB, depth, NIR, thermal, etc.
as an input to a pre-trained multi-channel CNN (MC-CNN)
network followed by a small network consisting of a few
layers to perform the classification of spoof vs. real faces
have also been presented in the literature [26], [30]. The
hybrid approach presented in [27] also uses deep features
extracted from a pre-trained VGG16 model which has been
used as a base network, as well as illumination features
extracted using an improved Retinex algorithm for the face
anti-spoofing task. The work presented in [31] also proposed
to use a pre-trained InceptionV4 model as a liveness feature
extraction network in their face PAD framework.
In this context, however, most of the studies do not explore
the potential of some state-of-the-art deep CNN models
2VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
for e.g. MobileNet [32] in the face PAD problem. In this
research, we propose a framework for efficient deep learning-
based face PAD for 2D attacks. In the proposed approach, a
diffusion method is implemented that uses an interpolation-
based method to perform image diffusion to enhance the
distinguishing features in the real and spoof images. Since
image diffusion is a process to introduce smoothness into
an image, and it can be viewed as applying a Gaussian
filter on an image [33]. Therefore, this research presents an
interpolation-based technique to generate diffused images.
The diffused images are then fed to a deep CNN followed
by a detection model which classifies real and spoof images.
This study presents an end-to-end approach where diffusion
and the deep CNN are combined into a single model.
The rest of the paper is organized as: the proposed method
is outlined in section II, the details of datasets and perfor-
mance metrics are provided in section III followed by the
experimental results and conclusion in sections IV and V
respectively.
II. PROPOSED METHOD
In this paper, we propose a face PAD approach by designing
an interpolation-based image diffusion mechanism followed
by a deep CNN-based face PAD network using a pre-trained
MobileNet [32] as our base model.
A. IMAGE DIFFUSION
Image diffusion is a process where input images are
smoothed either at a constant rate (linear diffusion) [33],
where the smoothness is achieved at a constant rate through-
out the image, or in a nonlinear fashion where the important
image features such as edges are retained in the diffused
image [34], [35]. In computer vision applications, linear
diffusion is among the oldest and most investigated partial
differential equation method which can be seen as an evo-
lution process where an image is diffused/smoothed in all
directions at a constant rate. Such diffusion processes tend
to suppress the finer scale structural details in the image
subjected to diffusion.
Contemporary image diffusion schemes include linear and
non-linear diffusion models. Gaussian smoothing is con-
sidered the most popular diffusion scheme among the lin-
ear diffusion schemes [36]. Among the non-linear diffu-
sion schemes, the Perona-Malik diffusion [34] or commonly
known as anisotropic diffusion is considered the most widely
used image diffusion technique in image processing. Other
non-linear diffusion techniques include continuous diffusion
filtering, semi-discrete diffusion filtering, and discrete diffu-
sion filtering schemes [36]. Other image diffusion models
include hybrid image diffusion [37], modified Perona-Malik
diffusion model [38], [39], etc.
The proposed diffusion strategy is carried out as a pre-
processing step where the captured frames are subjected to
compression using inter-area interpolation scheme. During
the compression stage, the inter-area interpolation technique
calculates the ratios of the output image height and width to
FIGURE 1: PCA embeddings for CASIA-FASD database,
(a) Anisotropic diffusion, (b) Proposed diffusion method
input image height and width termed as scalexand scaley.
The product of these ratios is then used to calculate Area =
scalex×scaley. Then depending on the value of the ratios,
the number of corresponding pixels from each channel are
selected to form an array of pixels for each channel. These
selected pixel arrays are then added and divided by Area to
calculate the corresponding output image pixels. The values
of ratios can either be integer or fractional. If the ratios are
non-integer, then the sum is taken as a weighted sum and the
value of weight for each pixel depends on the percentage of
a pixel in a particular pixel array. The weight for each pixel
is calculated as the ratio of the percentage of the pixel being
in the pixel array to Area.
The compressed frames are then expanded keeping the
pixel-area relationships into consideration thereby resulting
in smooth/diffused images. The proposed image diffusion
method enhances the class discrimination cues in the images.
This is due to the reason that, in general, image diffusion
is mainly performed for noise removal while preserving the
important information such as lines, edges, and other content
that is vital for the interpretation of the image [34]. In general,
noise reduction can remove significant information from an
image. Therefore, using inter-area interpolation, where pixel-
area relationships are used for re-sampling can remove the
noise content as well as preserve the useful information in
the image and generates images free from Moires’ patterns.
To verify the effectiveness of the proposed diffusion
method, in Fig. 1 PCA embedding for CASIA-FASD [40]
database using the proposed diffusion is visualized in com-
parison with anisotropic diffusion [34]. The green regions
in the figure represent the real class samples while the red
regions represent the attack/spoof class samples. It can be
observed that the PCA embedding of the proposed tech-
nique show better class separability compared to that for
anisotropic diffusion (i.e. less overlap between the red and
green regions). It is also noteworthy that the class separabil-
ity shown in PCA embedding has been achieved in image
domain which can essentially aid in CNN training to obtain
superior classification ability. A comparison of different pre-
processing methods detailing different diffusion techniques
is discussed in section IV-A.
VOLUME 4, 2016 3
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
B. DEEP CNN-BASED FACE PAD MODEL
The proposed face PAD approach harnesses the concept of
transfer learning using a pre-trained deep CNN. The method-
ology is motivated by the fact that the available face PAD
datasets are typically not sufficient for training a deep CNN
from scratch. Transfer learning is a technique where the
learned knowledge from a deep network trained on one task
is passed on to another network [41]. Transfer learning can
also overcome the overfitting problem caused by insufficient
and imbalanced training data. Transfer learning can be used
in two ways; (1) as a feature extractor [42] and (2) fine tuning
the source model [43].
For the proposed face PAD approach, we used a pre-
trained MobileNet [32] deep CNN architecture that was
trained on ImageNet dataset [28]. We used the MobileNet ar-
chitecture in a fine-tuning setting, where pre-trained weights
of ImageNet are used for fine-tuning the model on the face
PAD databases. The use of transfer learning a deep CNN
model is motivated by the strong feature extraction capa-
bility as well as the reduced computational requirements of
such training methods. An experiment for the performance
evaluation of the two transfer learning strategies mentioned
above. The findings of the experiment suggest that fine tuning
a pre-trained model results in better performance. The details
of the experiment are presented in Section IV-B. The bock
diagram of the proposed deep CNN-based face PAD network
is presented in Fig. 2.
The MobileNet model is based on depth-wise separable
convolution [44] where a standard convolution is divided
into a depth-wise convolution and piece-wise convolution.
The filtering and combining of inputs into an output are
done in a single step in standard convolution, whereas the
depth-wise separable convolution splits this task into two
layers thereby significantly reducing the model size and the
number of computations performed in the model. A standard
convolution takes DA×DA×Minput features and yields an
output feature map of the dimensions DA×DA×N, which
results in the number of computations as:
Dk·Dk·M·N·DA·DA(1)
Where Dk×Dkis the kernel size, Mis the number of input
channels, Nis the number of output channels, and DA×DA
is the size of the feature map. 3×3depth-wise separable
convolutions are used in MobileNet architecture which can
reduce the computations substantially. Using depth-wise sep-
arable convolution results in a computational cost of
M·Dk·Dk·DA·DA+N·M·DA·DA(2)
which can be simplified as
M·DA·DA·(Dk·Dk+N)(3)
where it can be seen that the computations are decreased
nearly by a factor of Nas compared to that of standard
convolutions presented in (1).
The proposed face PAD method uses the MobileNet pre-
trained model as a base model. The proposed face PAD model
TABLE 1: Proposed Face PAD Model Architecture
Layer Type / Strides Input Shape Output Shape
Diffusion 30 ×40 ×3 244 ×244 ×3
Conv2D / 2 244 ×244 ×3 112 ×112 ×32
Conv2D Depth-wise / 1 112 ×112 ×32 112 ×112 ×32
Conv2D Point-wise / 1 112 ×112 ×32 112 ×112 ×64
Zero Padding 112 ×112 ×64 113 ×113 ×64
Conv2D Depth-wise / 2 113 ×113 ×64 56 ×56 ×64
Conv2D Point-wise / 1 56 ×56 ×64 56 ×56 ×128
Conv2D Depth-wise / 1 56 ×56 ×128 56 ×56 ×128
Conv2D Point-wise / 1 56 ×56 ×128 56 ×56 ×128
Zero Padding 56 ×56 ×128 57 ×57 ×128
Conv2D Depth-wise / 2 57 ×57 ×128 28 ×28 ×128
Conv2D Point-wise / 1 28 ×28 ×128 28 ×28 ×256
Conv2D Depth-wise / 1 28 ×28 ×256 28 ×28 ×256
Conv2D Point-wise / 1 28 ×28 ×256 28 ×28 ×256
Zero Padding 28 ×28 ×256 29 ×29 ×256
Conv2D Depth-wise / 2 29 ×29 ×256 14 ×14 ×256
Conv2D Point-wise / 1 14 ×14 ×256 14 ×14 ×512
5×Conv2D Depth-wise / 1
5×Conv2D Point-wise / 1
14 ×14 ×512 14 ×
14 ×512
14 ×14 ×512 14 ×
14 ×512
Zero Padding 14 ×14 ×512 15 ×15 ×512
Conv2D Depth-wise / 2 15 ×15 ×512 7 ×7×512
Conv2D Point-wise / 1 7×7×512 7 ×7×1024
Conv2D Depth-wise / 1 7×7×1024 7 ×7×1024
Conv2D Point-wise / 1 7×7×1024 7 ×7×1024
Dense / 1 50176 512
Sigmoid 512 1
architecture is presented in Table. 1. The top layers of the
MobileNet base model are interchanged by a simple MLP
classifier network containing two fully-connected layers with
512 and 1 units respectively. All layers are followed by batch
normalization and activated by ReLU activation, except for
the final fully-connected layer where a Sigmoid activation is
used for binary classification. We used Adam optimizer with
a learning rate of 1×104for model compilation.
III. MATERIALS AND METHODS
A. DATASETS
The proposed deep learning framework was extensively eval-
uated on four standard datasets namely: (1) Replay-Attack,
(2) Replay-Mobile, (3) CASIA-FASD, and (4) ROSE-Youtu.
Each dataset is described below.
1) Replay-Attack Dataset
The Replay-Attack dataset [45] is a 2D face presentation
attack dataset consisting of 1300 video clips (9-15 seconds
duration) of photo and video attack attempts for 50 clients.
The video clips are shot using different cameras and under
different controlled and adverse lighting conditions, an ex-
ample of real and attack video frames under adverse and
controlled lighting is presented in Fig. 3. The first, second,
and third columns represent the video frames for a real user,
attack using fixed stand, and attack using hands to hold
the spoofing device respectively. The first row represents
the video frames in adverse lighting and the second row
4VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 2: Proposed face PAD network
represents the video frames in controlled lighting conditions.
All the videos are recorded at a frame rate of 25FPS, and
have a resolution of 320 ×240. The dataset is divided into
training, testing, and development set. The dataset contains
4 real and 20 attack videos per client/subject. The training
and development set each contain 15 subjects with 360 video
clips while the test set consists of 20 subjects with 480 video
clips. The subjects present in one set do not appear in any
other sets. The dataset details are enlisted in Table. 2.
FIGURE 3: Example of real and attack video frames under
adverse (first row) and controlled (second row) lighting in
Replay-Attack dataset
FIGURE 4: Sample frames from real and attack video clips
from Replay-Mobile database
2) Replay-Mobile Dataset
The Replay-Mobile dataset [46] was developed for face
recognition and face PAD in 2016. It contains 1030 video
clips of photo and video attacks of 40 subjects. These video
clips were recorded on different mobile devices under dif-
ferent lighting conditions. The dataset is divided into train,
test, and development sets and the subjects present in one
set do not appear in any other sets. The details of Replay-
Mobile dataset are enlisted in Table. 2. Some examples of
the extracted frames from the Replay-Mobile dataset are
presented in Fig. 4.
3) CASIA-FASD Dataset
The CASIA-FASD dataset [40] developed for face anti-
spoofing was released in 2012. CASIA-FASD is a small
dataset containing diverse photo and video attacks of differ-
ent image qualities for 50 subjects. Each subject in the dataset
contains 3 genuine video clips and 9 attack video clips. Hence
the dataset contains 600 video clips for 50 subjects. The
dataset is divided into 20 subjects for training and 30 subjects
VOLUME 4, 2016 5
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 5: Sample frames from real and attack video clips
from CASIA-FASD database
for testing, whereas, each subject appears only in one of
the sets. Some examples of the extracted frames from the
CASIA-FASD dataset are presented in Fig. 5.
4) ROSE-Youtu Dataset
ROSE-Youtu face liveness detection dataset [47], [48] is a
comprehensive face anti-spoofing database. It covers a vari-
ety of lighting conditions, attack types, and camera models.
The database consists of 3350 video clips of 20 subjects. For
each subject, there are around 150 200 video clips with an
average duration of 10 12 seconds. There are three type
of spoofing attacks covered in this database including video
replay attack, print photo attack, and masking (paper mask)
attack. The performance on ROSE-Youtu database is usually
measured in equal error rate (EER). Some examples of the
extracted frames from the ROSE-Youtu dataset are presented
in Fig. 6.
TABLE 2: Dataset Description
Dataset Year Number of Subjects Real / Attack
Replay-Attack [45] 2012 50 200 / 1000
Replay-Mobile [46] 2016 40 390 / 640
CASIA-FASD [40] 2012 50 150 / 450
ROSE-Youtu [47] 2018 20 1000 / 2350
B. PERFORMANCE EVALUATION METRICS
Since face PAD is essentially a classification problem,
therefore, standard threshold dependent performance evalu-
ation parameters such as sensitivity (Sen), specificity (Spe),
Youden’s index (YI), and F1-score are reported in this paper.
Besides these metrics, face liveness classifiers or commonly
called face PAD methods can also be evaluated on the basis
of the classification accuracy achieved by the algorithm [31].
Each of these parameters can be calculated using the follow-
ing equations:
Sen =T P
T P +F N (4)
FIGURE 6: Sample frames from real and attack (Masking
attack) video clips from ROSE-Youtu dataset
Spe =T N
T N +F P (5)
Y I =Sensitivity +Specificity 1(6)
F1Score = 2 ×P rec ×Recall
P rec +Recall (7)
P rec =T P
T P +F P (8)
Where TP, FP, TN, and FN represent true positive, false
positive, true negative, and false negative, respectively. Thus,
sensitivity represents the fraction of real/genuine face images
correctly detected/classified as genuine faces while speci-
ficity presents the fraction of spoof/attack faces correctly
classified as attack faces. The Youden’s index integrates the
sensitivity and specificity measures in a way that emphasize
both the sensitivity and specificity. The value of Youden’s
index ranges between 0 and 1, with 0 being the worst result
and 1 being the perfect value indicating no false positives
and false negatives. F1 score is also a popular measure of test
accuracy and it represents the harmonic mean of the precision
(Prec) and recall/sensitivity.
Besides these measures, most face PAD literature also
reported the half total error rate (HTER), which is defined
by the following equation:
H T ER =F AR +F RR
2(9)
Where FAR and FRR are the false acceptance rate and false
rejection rate defined by the following equations respectively:
F AR =F P
F P +T N (10)
F RR =F N
F N +T P (11)
6VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
HTER is used as a performance metric in most face anti-
spoofing literature due to the reason that in most face PAD
databases, there is a significant imbalance between the real
and attack classes. This imbalance is mainly due to the
difficulties in obtaining real data as compared to the im-
poster/attack data. Since FAR and FRR are database distribu-
tion independent metrics and the HTER is the average of the
both, therefore, it is convenient to use HTER for performance
comparison than the standard classification error metrics for
face PAD methods.
For the performance evaluation on ROSE-Youtu and
CASIA-FASD database, it is recommended to use equal error
rate (EER) instead of HTER. EER can be defined by the
following equation. Both the HTER and EER are reported
as a percentage in literature.
EER =F P +F N
T P +F P +F N +T N (12)
Among the various performance evaluation metrics, one
of the most important metric to evaluate the classification
performance of a binary classification system or any classifier
in general is the receiver operating characteristics (ROC)
curve. This is often called area under the curve - receiver
operating characteristics (AUC-ROC) or simply area under
the receiver operating characteristics (AUROC) curve. ROC
curve is a probability curve and AUC represents the degree or
extent of class separability. Therefore, the higher the AUC,
the better the classifier is at discriminating class 0 from
class 1. The ROC curve is a graph of sensitivity or true
positive rate (TPR) against false positive rate (FPR) (where
FPR = 1 specif icity).
C. TRAINING SETUP
Since face anti-spoofing datasets, in general, are imbalanced,
i.e. the number of real video clips are less than the number
of attack videos, therefore, for Replay-Attack and Replay-
Mobile datasets, we randomly selected 25 frames of each
video in the real training, development, and testing sets.
Whereas for the attack videos, we randomly selected 10
frames from each video in the training, development, and
testing sets. To keep the number of samples/images in both
the classes balanced, we captured 40 frames from the real
video clips in CASIA-FASD database and 15 frames from
the respective attack video clips. Similarly, for ROSE-Youtu
database 6 random frames were selected from each real video
clip, while 2 frames were randomly selected from the attack
video clips to maintain a balance between the two classes.
The frames were captured with the dimensions of 30 ×40
for all the datasets used in this study except CASIA-FASD,
where the captured frame dimensions were kept 224 ×224.
The proposed model was trained for 50 epochs and
we used early stopping with the patience of 10 epochs
to avoid over-fitting, thereby stopping the model training
if the performance stopped improving. We used valida-
tion loss as the metric for early stopping, and the best
model was saved to perform testing/inference on test set.
The code for training and testing of the proposed method
on CASIA-FASD database has been made available in
the authors’ github repository (https://github.com/mhdshl/
Face-PAD-transfer-learning-diffusion).
IV. EXPERIMENTAL RESULTS
Extensive experiments were conducted to evaluate the effi-
cacy of the proposed framework. For the proposed approach,
five independent simulation runs were performed to calculate
the standard threshold-dependent metrics and the results for
accuracy, sensitivity, specificity, Youden’s index, F1-Score,
etc., for all the datasets are reported in Table. 3. The results
clearly show the superior discriminating capability of the
proposed approach in terms of these standard threshold-
dependent metrics.
TABLE 3: Performance of the proposed method evaluated
on standard threshold-dependent metrics for Replay-Attack,
Replay-Mobile, CASIA-FASD, and ROSE-Youtu datasets.
Dataset Replay-Attack Replay-Mobile CASIA-FASD ROSE-Youtu
Accuracy (%)99.93 99.04 99.90 95.04
Sensitivity 0.9980 1.0 0.9902 0.9542
Specificity 1.0 0.9771 0.9709 0.9473
Youden Index 0.9980 0.9771 0.9612 0.9015
F1-Score 0.9990 0.9917 0.9738 0.9461
To verify the robustness of the proposed approach, data
visualization was performed for all four databases using PCA
embedding, refer to Fig. 7. The PCA plots of the diffused
input images and the plots of latent feature space of the
proposed face PAD network are presented. The latent features
are extracted from the MLP network hidden layer with 512
nodes. The data visualization exercise not only verifies the
effectiveness of the proposed image diffusion scheme, but
also shows the discriminating capability of the proposed
face PAD model. The latent feature space PCA embeddings
presented in Fig. 7 show the strong discriminating capability
of the proposed face PAD technique.
We also calculated AUC and plotted the AUC-ROC curves
for the performance evaluation of the proposed technique
on each dataset. The AUC-ROC curves for Replay-Attack,
Replay-Mobile, ROSE-Youtu, and CASIA-FASD datasets
are presented in Fig. 8a, 8b, 8c, and 8d respectively. The
AUC-ROC curves and the AUC values presented in Fig. 8
showcase the promising face PAD performance of the pro-
posed approach with an AUC of 0.9995,0.9918,0.9496,
and 0.9989 units for Replay-Attack, Replay-Mobile, ROSE-
Youtu, and CASIA-FASD databases respectively.
A. COMPARISON OF DIFFERENT PRE-PROCESSING
SCHEMES
In order to evaluate the effectiveness of the proposed diffu-
sion technique and its utility in model training, an experiment
is designed to evaluate the performance of the proposed
face PAD model on ROSE-Youtu database with different
pre-processing schemes including: (1) model trained without
VOLUME 4, 2016 7
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 7: PCA embeddings of input space vs. latent feature space, (a) Replay-Attack database, (b) Replay-Mobile database,
(c) ROSE-Youtu database, (d) CASIA-FASD database.
(a) Replay-Attack database (b) Replay-Mobile database
(c) ROSE-Youtu database (d) CASIA-FASD database
FIGURE 8: Receiver operating characteristics (ROC) curves. (a) AUC-ROC curve for training and testing on Replay-Attack
database ((AUC = 0.9995, EER(%) = 0.06). (b) AUC-ROC curve for training and testing on Replay-Mobile database (AUC =
0.9918, EER(%) = 0.96). (c) AUC-ROC curve for training and testing on ROSE-Youtu database (AUC = 0.9496, EER(%) =
4.95). (d) AUC-ROC curve for training and testing on CASIA-FASD database (AUC = 0.9989, EER(%) = 0.09).
8VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
any diffusion in pre-processing module, (2) model trained
with Perona-Malik anisotropic diffusion in pre-processing
module, (3) model training done with Gaussain filtering-
based diffusion in pre-processing module, and (4) model
trained with the proposed diffusion scheme in pre-processing
module. The training setup for these experiments is kept the
same as outlined in the previous section.
For Perona-Malik anisotropic diffusion, the parameters
such as number of iterations, conduction coefficient, stability
constant, step size (distance between the adjacent pixels),
etc., are kept the same as the default values. Similarly, for
diffusion using Gaussian filtering, the standard deviation for
Gaussian filter is also kept as the default value. Table 4
presents the accuracy and HTER/EER scores of the models
trained with each of the pre-processing schemes detailed
above.
The results presented in Table 4 clearly show that the
proposed pre-processing scheme achieves 42.5%,71.1%, and
44.7% improvement in HTER score compared to no diffu-
sion, Perona-Malik diffusion, and Gaussian filtering-based
diffusion respectively.
B. COMPARISON OF DIFFERENT LEARNING SCHEMES
In this section, an experiment to evaluate the performance
of different learning schemes is conducted. Similar to the
previous experiment, ROSE-Youtu database is used in this
experiment to compare the performance of the two transfer
learning schemes discussed in Section II as well as a learning
scheme where the MobileNet architetcure is trained from
scratch. The network architecture and the training setup is
kept the same for all the learning schemes. For simplicity,
the learning schemes have been denoted as: Scheme-1: where
the MobileNet architecture is loaded without any weights and
trained from scratch, Scheme-2: where pre-trained ImageNet
weights are loaded and the layers of the MobileNet model are
frozen (i.e. the MobileNet model is used in feature extraction
setting), Scheme-3: pre-trained ImageNet weights are loaded
and the model is trained in a fine-tuning setting. It is note-
worthy that in Scheme-2, only the fully-connected layers are
trained while in Scheme-3, the layers of MobileNet as well
as the fully-connected layers are fine-tuned. The performance
has been evaluated using the performance evaluation metrics
presented in Section III and the results are presented in Table
5.
The comparative results presented in Table 5 clearly indi-
TABLE 4: Comparison of the performance of the face PAD
model with different pre-processing schemes on ROSE-
Youtu database
Pre-Processing Scheme Accuracy (%) HTER / EER (%)
No Diffusion 90.99 8.56 / 9.01
Perona-Malik Diffusion [34], [35] 79.26 17.03 / 20.74
Gaussian Filtering [36] 90.52 8.9 / 9.48
Proposed 95.04 4.92 / 4.95
TABLE 5: Performance Comparison for Different Learning
Schemes on ROSE-Youtu Database
Performance Metric Scheme-1 Scheme-2 Scheme-3
Accuracy (%)89.35 91.91 95.04
Sensitivity 0.8692 0.9392 0.9542
Specificity 0.9216 0.8958 0.9473
Youden Index 0.7908 0.8350 0.9015
F1-Score 0.8975 0.9257 0.9461
HTER (%)10.46 8.25 4.92
cate that the learning Scheme-3 (Fine-tuning) shows superior
results compared to the learning Scheme-1 and Scheme-
2 in-terms of almost every performance evaluation metric.
In particular, the HTER results show 52.96% and 40.36%
improvement in Scheme-3 as compared to Scheme-1 and
Scheme-2 respectively. Therefore, transfer learning in a fine-
tuning setting is selected for the proposed scheme.
C. COMPARISON WITH STATE-OF-THE-ART FACE PAD
METHODS
The performance of our proposed face PAD approaches
using our interpolation-based image diffusion method was
compared with the state-of-the-art methods on Replay-Attack
and Replay-Mobile dataset.
TABLE 6: Comparative Test Results for Replay-Attack
Dataset.
Method HTER (%)
Diffusion Speed [49] (2015) 12.5
FASNet [50] (2017) 1.20
Diffusion-CNN [51] (2017) 10.0
SfSNet [23] (2020) 3.1
InceptionV4 [31] (2020) 13.54
SCNN [31] (2020) 7.53
SE-ResNet-18 [52] (2020) 3.3
CompactNet [53] (2020) 0.7
VGG16+GMM [54] (2021) 1.46
GoogleNet+GMM [54] (2021) 3.76
WA (PSO+PS) [55] (2021) 0.0
WA (GA+MMS+PS) [56] (2022) 0.0
Proposed Method 0.09
We compared our method with different CNN and
diffusion-based face PAD methods and the comparative re-
sults are presented in Table 6 and Table 7 for Replay-Attack
VOLUME 4, 2016 9
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 7: Comparative Test Results for Replay-Mobile
Dataset.
Method HTER (%)
SR Arashloo et. al. [57] (2020) 6.7
InceptionV4 [31] (2020) 5.94
SCNN [31] (2020) 4.96
VGG16+GMM [54] (2021) 17.21
MKL [58] (2021) 6.7
GoogleNet+GMM [54] (2021) 13.56
WA (PSO+PS) [55] (2021) 5.85
WA (GA+MMS+PS) [56] (2022) 5.12
Proposed Method 1.14
and Replay-Mobile dataset, respectively. It is evident from
the results that the proposed approach shows superior per-
formance compared to most of the competing approaches.
For the Replay-Attack database, for instance, the proposed
approach outperforms most of the contemporary methods by
a high margin and achieves comparable HTER performance
with the methods presented in [55] and [56]. Similarly, as
presented in Table 7, the proposed method outperforms the
contemporary approaches on Replay-Mobile database.
The results for ROSE-Youtu and CASIA-FASD datasets
are presented in Tables 8 and 9 respectively. The proposed
method achieves an HTER/EER (%) score of 4.92/4.95 for
ROSE-Youtu database which outperforms the contemporary
methods presented in the literature. Similarly for CASIA-
FASD, the proposed method achieves an HTER value of
0.09% outperforming the other methods presented in the
results.
D. CROSS-DOMAIN PERFORMANCE EVALUATION
Extensive experiments were conducted to perform cross-
domain performance evaluation of the proposed face PAD
technique. A number of experiments were performed where
the proposed face PAD model was trained on Replay-Attack
training dataset and tested on ROSE-Youtu, and CASIA-
FASD datasets. In a similar setting, we trained the proposed
model on individual datasets and performed testing on the
others and report HTER(%) values. Table 10 presents the
cross-database testing results in-terms of HTER. The results
presented in the table exhibit the robustness and versatility of
the proposed technique in a cross-database testing scheme.
The overall HTER results show that the proposed method
achieves better mean HTER compared to the face PAD tech-
niques presented in [67], [47], and [68] by 34.25%,19.11%,
and 20.08% respectively.
For training on Replay-Attack and testing on CASIA-
FASD, the proposed method outperforms [67] and [68]
TABLE 8: Comparative Test Results for ROSE-Youtu
Dataset.
Method HTER/EER (%)
LPQ (HSV) [16] (2016) 30.4/-
LPQ (YCbCr) [16] (2016) 27.6/-
Wavelet [59] (2017) 26.6/-
Ensemble of Classifiers [60] (2019) 9.3/-
SE-ResNet-18 [52] (2020) -/8.0
WA (PSO+PS) [55] (2021) 5.61/-
FASNet [61] (2021) 8.57/-
ResNet50+GMM [54] (2021) 14.69/-
WA (GA+MMS+PS) [56] (2022) 5.12/-
Fatemifar et al. [62] (2022) 6.34/-
Proposed Method 4.92/4.95
TABLE 9: Comparative Test Results for CASIA-FASD
Dataset.
Method HTER/EER (%)
Deep Metric Learning [63] (2019) 16.74/-
Motion Pattern [64] (2020) 17.81/-
Noise Pattern [64] (2020) 13.33/-
Decision Fusion [64] (2020) 10.54/-
GFA-CNN [65] (2020) -/ 8.3
S-CNN [66] (2021) -/ 0.69
Proposed Method 0.09/0.09
by an HTER margin of 53.63% and 44.36%. The cross-
database performance is found inferior to [47]. For training
on Replay-Attack and testing on ROSE-Youtu, the proposed
approach shows superior performance in terms of cross-
database HTER and achieves 33.46%,17.03%, and 20.21%
gain as compared to [47], [67] and [68] respectively. For
training on ROSE-Youtu and testing on Replay-Attack, the
proposed method outperforms [67], [47], and [68] by an
HTER margin of 75.63%,78.27%, and 72.18% respectively.
For testing on CASIA-FASD, the proposed technique shows
improvement in HTER margin by 1.06% and 12.67% com-
pared to the techniques presented in [47] and [68]. Lastly,
for the case of training on CASIA-FASD and testing on the
Replay-Attack database, the proposed method outperforms
[67] and [47] by an HTER margin of 30.88% and 26.49%
respectively. The cross-database testing of a model trained
on CASIA-FASD and tested on ROSE-Youtu also shows
comparable performance.
10 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 10: Cross-database performance (in-terms of HTER(%) of the proposed approach and the baseline approaches). Train
dataset Test dataset (Replay-Attack: RA, ROSE-Youtu: RY, CASIA-FASD: CF)
Method RA RY RA CF RY RA RY CF CF RA CF RY Mean HTER (%)
Tzeng et al. [67] (2017) 50.0 49.8 34.6 28.7 41.8 31.4 39.38
Li et al. [47] (2018) 40.1 12.3 38.8 30.1 39.3 31.6 32.03
Wang et al. [68] (2019) 41.7 41.5 30.3 34.1 17.5 29.4 32.42
Proposed Method 33.27 23.09 8.43 29.78 28.89 32.03 25.91
V. CONCLUSION
In this paper, a hybrid face PAD approach is proposed
which incorporates the notion of interpolation-based image
diffusion with the transfer learning of a MobileNET CNN.
The proposed framework has shown promising results on
Replay-Attack, Replay-Mobile, CASIA-FASD, and ROSE-
Youtu databases attaining the highest accuracy and HTER
of 99.93% and 0.09%,99.04% and 1.14%,99.90% and
0.09%, and 95.04% and 4.92%, respectively. The proposed
method also demonstrated superior performance in cross-
domain evaluation as well. The applications of such face
PAD approaches are vast and for our future prospects, we
aim to combine our face PAD method with a face recognition
and gesture recognition system for student attendance and
examination monitoring in an educational setting to provide
a combined deep learning-based framework for the day-to-
day activities carried out in schools. We also aim to improve
the cross-domain performance of the proposed method in
our future works by leveraging the proposed method in an
unsupervised learning scheme to perform domain adaptation
for face PAD across various complex face PAD databases.
REFERENCES
[1] Pejman Rasti, Tonis Uiboupin, Sergio Escalera, and Gholamreza Anbarja-
fari. Convolutional neural network super resolution for face recognition in
surveillance monitoring. In International conference on articulated motion
and deformable objects, pages 175–184. Springer, 2016.
[2] Samuel Lukas, Aditya Rama Mitra, Ririn Ikana Desanti, and Dion Kris-
nadi. Student attendance system in classroom using face recognition
technique. In 2016 International Conference on Information and Commu-
nication Technology Convergence (ICTC), pages 1032–1035. IEEE, 2016.
[3] Ayham Fayyoumi and Anis Zarrad. Novel solution based on face recogni-
tion to address identity theft and cheating in online examination systems.
Advances in Internet of Things, 2014, 2014.
[4] Keyurkumar Patel, Hu Han, and Anil K Jain. Secure face unlock: Spoof
detection on smartphones. IEEE transactions on information forensics and
security, 11(10):2268–2283, 2016.
[5] Hansung Lee, So-Hee Park, Jang-Hee Yoo, Se-Hoon Jung, and Jun-Ho
Huh. Face recognition at a distance for a stand-alone access control
system. Sensors, 20(3):785, 2020.
[6] Patrick J Grother, Mei L Ngan, Kayee K Hanaoka, et al. Ongoing face
recognition vendor test (frvt) part 2: Identification. 2018.
[7] Klaus Kollreider, Hartwig Fronthaler, Maycel Isaac Faraj, and Josef Bigun.
Real-time face detection and motion analysis with application in “liveness”
assessment. IEEE Transactions on Information Forensics and Security,
2(3):548–558, 2007.
[8] Wei Bao, Hong Li, Nan Li, and Wei Jiang. A liveness detection method
for face recognition based on optical flow field. In 2009 International
Conference on Image Analysis and Signal Processing, pages 233–236.
IEEE, 2009.
[9] André Anjos, Murali Mohan Chakka, and Sébastien Marcel. Motion-based
counter-measures to photo attacks in face recognition. IET biometrics,
3(3):147–158, 2014.
[10] Gang Pan, Zhaohui Wu, and Lin Sun. Liveness detection for face
recognition. Recent advances in face recognition, pages 109–124, 2008.
[11] Keyurkumar Patel, Hu Han, and Anil K Jain. Cross-database face anti-
spoofing with robust feature representation. In Chinese Conference on
Biometric Recognition, pages 611–619. Springer, 2016.
[12] Rui Shao, Xiangyuan Lan, and Pong C Yuen. Deep convolutional dynamic
texture learning with adaptive channel-discriminability for 3d mask face
anti-spoofing. In 2017 IEEE International Joint Conference on Biometrics
(IJCB), pages 748–755. IEEE, 2017.
[13] Klaus Kollreider, Hartwig Fronthaler, and Josef Bigun. Evaluating liveness
by face images and the structure tensor. In Fourth IEEE Workshop on
Automatic Identification Advanced Technologies (AutoID’05), pages 75–
80. IEEE, 2005.
[14] Tiago de Freitas Pereira, André Anjos, José Mario De Martino, and
Sébastien Marcel. Lbp- top based countermeasure against face spoofing
attacks. In Asian Conference on Computer Vision, pages 121–132.
Springer, 2012.
[15] Tiago de Freitas Pereira, André Anjos, José Mario De Martino, and
Sébastien Marcel. Can face anti-spoofing countermeasures work in a real
world scenario? In 2013 international conference on biometrics (ICB),
pages 1–8. IEEE, 2013.
[16] Zinelabidine Boulkenafet, Jukka Komulainen, and Abdenour Hadid. Face
spoofing detection using colour texture analysis. IEEE Transactions on
Information Forensics and Security, 11(8):1818–1830, 2016.
[17] Jianwei Yang, Zhen Lei, Shengcai Liao, and Stan Z Li. Face liveness
detection with component dependent descriptor. In 2013 International
Conference on Biometrics (ICB), pages 1–6. IEEE, 2013.
[18] Diego Gragnaniello, Giovanni Poggi, Carlo Sansone, and Luisa Verdoliva.
An investigation of local descriptors for biometric spoofing detection.
IEEE transactions on information forensics and security, 10(4):849–863,
2015.
[19] Zinelabidine Boulkenafet, Jukka Komulainen, and Abdenour Hadid. Face
antispoofing using speeded-up robust features and fisher vector encoding.
IEEE Signal Processing Letters, 24(2):141–145, 2016.
[20] Lei Li, Zhaoqiang Xia, Jun Wu, Lei Yang, and Huijian Han. Face
presentation attack detection based on optical flow and texture analysis.
Journal of King Saud University-Computer and Information Sciences,
34(4):1455–1467, 2022.
[21] Lei Li, Zhaoqiang Xia, Xiaoyue Jiang, Yupeng Ma, Fabio Roli, and Xiaoyi
Feng. 3d face mask presentation attack detection based on intrinsic image
analysis. Iet Biometrics, 9(3):100–108, 2020.
[22] Wenyun Sun, Yu Song, Changsheng Chen, Jiwu Huang, and Alex C Kot.
Face spoofing detection based on local ternary label supervision in fully
convolutional networks. IEEE Transactions on Information Forensics and
Security, 15:3181–3196, 2020.
[23] Allan Pinto, Siome Goldenstein, Alexandre Ferreira, Tiago Carvalho, He-
lio Pedrini, and Anderson Rocha. Leveraging shape, reflectance and albedo
from shading for face presentation attack detection. IEEE Transactions on
Information Forensics and Security, 15:3347–3358, 2020.
[24] Yasar Abbas Ur Rehman, Lai-Man Po, Mengyang Liu, Zijie Zou, Weifeng
Ou, and Yuzhi Zhao. Face liveness detection using convolutional-features
fusion of real and deep network generated face images. Journal of Visual
Communication and Image Representation, 59:574–582, 2019.
[25] Yasar Abbas Ur Rehman, Lai-Man Po, and Mengyang Liu. Slnet: Stereo
face liveness detection via dynamic disparity-maps and convolutional
neural network. Expert Systems with Applications, 142:113002, 2020.
[26] Anjith George, Zohreh Mostaani, David Geissenbuhler, Olegs Nikisins,
André Anjos, and Sébastien Marcel. Biometric face presentation attack
detection with multi-channel convolutional neural network. IEEE Trans-
actions on Information Forensics and Security, 15:42–55, 2019.
VOLUME 4, 2016 11
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
[27] Haonan Chen, Yaowu Chen, Xiang Tian, and Rongxin Jiang. A cascade
face spoofing detector based on face anti-spoofing r-cnn and improved
retinex lbp. IEEE Access, 7:170116–170133, 2019.
[28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern-
stein, et al. Imagenet large scale visual recognition challenge. International
journal of computer vision, 115(3):211–252, 2015.
[29] Haonan Chen, Guosheng Hu, Zhen Lei, Yaowu Chen, Neil M Robertson,
and Stan Z Li. Attention-based two-stream convolutional networks for
face spoofing detection. IEEE Transactions on Information Forensics and
Security, 15:578–593, 2019.
[30] Guillaume Heusch, Anjith George, David Geissbühler, Zohreh Mostaani,
and Sébastien Marcel. Deep models and shortwave infrared information
to detect face presentation attacks. IEEE Transactions on Biometrics,
Behavior, and Identity Science, 2(4):399–409, 2020.
[31] Ranjana Koshy and Ausif Mahmood. Enhanced deep learning architec-
tures for face liveness detection for static and video sequences. Entropy,
22(10):1186, 2020.
[32] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam.
Mobilenets: Efficient convolutional neural networks for mobile vision
applications. arXiv preprint arXiv:1704.04861, 2017.
[33] Erkut Erdem. Linear diffusion. Hacettepe University, February 24th, 2012.
[34] Pietro Perona, Takahiro Shiota, and Jitendra Malik. Anisotropic diffusion.
In Geometry-driven diffusion in computer vision, pages 73–92. Springer,
1994.
[35] Guido Gerig, Olaf Kubler, Ron Kikinis, and Ferenc A Jolesz. Nonlinear
anisotropic filtering of mri data. IEEE Transactions on medical imaging,
11(2):221–232, 1992.
[36] Joachim Weickert et al. Anisotropic diffusion in image processing,
volume 1. Teubner Stuttgart, 1998.
[37] Djemel Ziou and Alain Horé. Reducing aliasing in images: a pde-based
diffusion revisited. Pattern recognition, 45(3):1180–1194, 2012.
[38] YQ Wang, Jichang Guo, Wufan Chen, and Wenxue Zhang. Image denois-
ing using modified perona–malik model based on directional laplacian.
Signal Processing, 93(9):2548–2558, 2013.
[39] Na Wang, Yu Shang, Yang Chen, Min Yang, Quan Zhang, Yi Liu, and
Zhiguo Gui. A hybrid model for image denoising combining modified
isotropic diffusion model and modified perona-malik model. IEEE Access,
6:33568–33582, 2018.
[40] Zhiwei Zhang, Junjie Yan, Sifei Liu, Zhen Lei, Dong Yi, and Stan Z Li.
A face antispoofing database with diverse attacks. In 2012 5th IAPR
international conference on Biometrics (ICB), pages 26–31. IEEE, 2012.
[41] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transfer-
able are features in deep neural networks? arXiv preprint arXiv:1411.1792,
2014.
[42] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan
Carlsson. Cnn features off-the-shelf: an astounding baseline for recog-
nition. In Proceedings of the IEEE conference on computer vision and
pattern recognition workshops, pages 806–813, 2014.
[43] Eva Cetinic, Tomislav Lipic, and Sonja Grgic. Fine-tuning convolutional
neural networks for fine art classification. Expert Systems with Applica-
tions, 114:107–118, 2018.
[44] Laurent Sifre. Rigid-Motion Scattering For Image Classification. PhD
thesis, Ecole Polytechnique, CMAP, 2014.
[45] Ivana Chingovska, André Anjos, and Sébastien Marcel. On the effec-
tiveness of local binary patterns in face anti-spoofing. In 2012 BIOSIG-
proceedings of the international conference of biometrics special interest
group (BIOSIG), pages 1–7. IEEE, 2012.
[46] Artur Costa-Pazo, Sushil Bhattacharjee, Esteban Vazquez-Fernandez, and
Sebastien Marcel. The replay-mobile face presentation-attack database.
In 2016 International Conference of the Biometrics Special Interest Group
(BIOSIG), pages 1–7. IEEE, 2016.
[47] Haoliang Li, Wen Li, Hong Cao, Shiqi Wang, Feiyue Huang, and Alex C
Kot. Unsupervised domain adaptation for face anti-spoofing. IEEE
Transactions on Information Forensics and Security, 13(7):1794–1809,
2018.
[48] Zhi Li, Rizhao Cai, Haoliang Li, Kwok-Yan Lam, Yongjian Hu, and
Alex C Kot. One-class knowledge distillation for face presentation attack
detection. IEEE Transactions on Information Forensics and Security, 2022.
[49] Wonjun Kim, Sungjoo Suh, and Jae-Joon Han. Face liveness detection
from a single image via diffusion speed model. IEEE Transactions on
Image Processing, 24(8):2456–2465, 2015.
[50] Oeslle Lucena, Amadeu Junior, Vitor Moia, Roberto Souza, Eduardo
Valle, and Roberto Lotufo. Transfer learning using convolutional neural
networks for face anti-spoofing. In International conference image analysis
and recognition, pages 27–34. Springer, 2017.
[51] Aziz Alotaibi and Ausif Mahmood. Deep face liveness detection based on
nonlinear diffusion using convolution neural network. Signal, Image and
Video Processing, 11(4):713–720, 2017.
[52] Guoqing Wang, Hu Han, Shiguang Shan, and Xilin Chen. Unsupervised
adversarial domain adaptation for cross-domain face presentation attack
detection. IEEE Transactions on Information Forensics and Security,
16:56–69, 2020.
[53] Lei Li, Zhaoqiang Xia, Xiaoyue Jiang, Fabio Roli, and Xiaoyi Feng. Com-
pactnet: learning a compact space for face presentation attack detection.
neurocomputing, 409:191–207, 2020.
[54] Soroush Fatemifar, Shervin Rahimzadeh Arashloo, Muhammad Awais,
and Josef Kittler. Client-specific anomaly detection for face presentation
attack detection. Pattern Recognition, 112:107696, 2021.
[55] Soroush Fatemifar, Muhammad Awais, Ali Akbari, and Josef Kittler.
Particle swarm and pattern search optimisation of an ensemble of face
anomaly detectors. In 2021 IEEE International Conference on Image
Processing (ICIP), pages 3622–3626. IEEE, 2021.
[56] Soroush Fatemifar, Shahrokh Asadi, Muhammad Awais, Ali Akbari, and
Josef Kittler. Face spoofing detection ensemble via multistage optimisa-
tion and pruning. Pattern Recognition Letters, 158:1–8, 2022.
[57] Shervin Rahimzadeh Arashloo. Matrix-regularized one-class multiple
kernel learning for unseen face presentation attack detection. IEEE
Transactions on Information Forensics and Security, 16:4635–4647, 2021.
[58] Shervin Rahimzadeh Arashloo. Matrix-regularized one-class multiple
kernel learning for unseen face presentation attack detection. IEEE
Transactions on Information Forensics and Security, 16:4635–4647, 2021.
[59] Wen Li, Lin Chen, Dong Xu, and Luc Van Gool. Visual recognition in
rgb images and videos by learning from rgb-d data. IEEE transactions on
pattern analysis and machine intelligence, 40(8):2030–2036, 2017.
[60] Soroush Fatemifar, Muhammad Awais, Shervin Rahimzadeh Arashloo,
and Josef Kittler. Combining multiple one-class classifiers for anomaly
based face spoofing attack detection. In 2019 International Conference on
Biometrics (ICB), pages 1–7. IEEE, 2019.
[61] Naima Bousnina, Lilei Zheng, Mounia Mikram, Sanaa Ghouzali, and
Khalid Minaoui. Unraveling robustness of deep face anti-spoofing models
against pixel attacks. Multimedia Tools and Applications, 80(5):7229–
7246, 2021.
[62] Soroush Fatemifar, Muhammad Awais, Ali Akbari, and Josef Kittler. De-
veloping a generic framework for anomaly detection. Pattern Recognition,
124:108500, 2022.
[63] Daniel Pérez-Cabo, David Jiménez-Cabello, Artur Costa-Pazo, and
Roberto J López-Sastre. Deep anomaly detection for generalized face
anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition Workshops, pages 0–0, 2019.
[64] Yukun Ma, Yaowen Xu, and Fanghao Liu. Multi-perspective dynamic
features for cross-database face presentation attack detection. IEEE
Access, 8:26505–26516, 2020.
[65] Xiaoguang Tu, Zheng Ma, Jian Zhao, Guodong Du, Mei Xie, and Jiashi
Feng. Learning generalizable and identity-discriminative representations
for face anti-spoofing. ACM Transactions on Intelligent Systems and
Technology (TIST), 11(5):1–19, 2020.
[66] Ruijie Quan, Yu Wu, Xin Yu, and Yi Yang. Progressive transfer learning
for face anti-spoofing. IEEE Transactions on Image Processing, 30:3946–
3955, 2021.
[67] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial
discriminative domain adaptation. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 7167–7176, 2017.
[68] Guoqing Wang, Hu Han, Shiguang Shan, and Xilin Chen. Improving
cross-database face presentation attack detection via adversarial domain
adaptation. In 2019 International Conference on Biometrics (ICB), pages
1–8. IEEE, 2019.
12 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
MADINI O. ALASSAFI received his B.S. de-
gree in Computer Science from King Abdulaziz
University, Saudi Arabia in 2006, and received
M.S. degree in Computer Science from California
Lutheran University, United State of America in
2013. He has been awarded the PhD qualification
in “Security Cloud Computing” in February 2018
from University of Southampton, United King-
dom. He is currently working as an associate pro-
fessor of Information Technology department in
the Faculty of Computing and Information Technology at King Abdulaziz
University. His research interests span mainly around Cloud Computing
and Security, Distributed Systems, Internet of Things (IoT) Security issues,
Cloud Security Adoption, Risks, Cloud Migration Project Management,
Cloud of Things and Security Threats. He is now the Vice-Dean of the
Faculty of Computing and Information Technology at King Abdulaziz
University, Jeddah, Saudi Arabia.
MUHAMMAD SOHAIL IBRAHIM received his
B.E. degree in Electronic Engineering from Iqra
University, Pakistan in 2012, and received his
M.E. degree in Telecommunications from N.E.D.
University of Engineering & Technology, Pakistan
in 2016. From 2013 to 2019, he was a Lec-
turer with the Faculty of Engineering, Science,
and Technology, Iqra University, Karachi, Pak-
istan. From 2013 to 2014, he also served as a
Research Assistant with Embedded Systems Re-
search Group, Karachi Institute of Economics and Technology, Pakistan.
He is currently with the Smart Energy Systems Lab, College of Electrical
Engineering, Zhejiang University, China. His research interests include deep
learning, computer vision, deep learning applications in energy systems.
Muhammad Sohail Ibrahim was a recipient of 2020 Highly Cited Review
Paper award from Applied Energy, Elsevier for his review paper titled
"Machine learning driven smart electric systems: Current trends and new
perspectives".
IMRAN NASEEM received his B.E. (Electrical
Engineering) degree in 2002 from the NED Uni-
versity of Engineering and Technology, Pakistan.
He did his M.S. (Electrical Engineering) in 2005
from the King Fahd University of Petroleum and
Minerals (KFUPM), KSA and Ph.D in 2010 from
The University of Western Australia. He did his
post doctorate at the Institute for Multi-sensor Pro-
cessing and Content Analysis, Curtin University
of Technology, Australia. He joined the College of
Engineering, KIET, Pakistan in 2011 where he is currently Professor. He
is also an Adjunct Research Fellow at the School of Electrical, Electronic
and Computer Engineering at The University of Western Australia. His
research interests include pattern classification and machine learning with
a special emphasis on biometrics and bioinformatics applications. He has
authored several publications in top journals and conferences including
IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE
International Conference on Image Processing etc. His benchmark work
on face recognition has received more than 180 citations in less than four
years. He is also a reviewer of IEEE Transactions on Pattern Analysis and
Machine Intelligence, IEEE Transactions on Image Processing and IEEE
Signal Processing Letters.
RAYED ALGHAMDI holds a bachelor degree
in Computer Science, Master and Ph.D in Com-
munication and Information Technology. He is
currently involved in designing a full interactive
e-learning course to target enhancing soft skills
for computing students. His current research in-
terests involve e-learning applications, computing
students’ readiness to confidently join industry
manpower.
PLACE
PHOTO
HERE
REEM ALOTAIBI (Co-author photograph not
available) is an associate professor at the Faculty
of Computing and Information Technology, King
Abdulaziz University, Jeddah, Saudi Arabia. Cur-
rently, she is the supervisor of the Information
Technology department. Dr.Alotaibi received her
PhD in computer science from University of Bris-
tol, Bristol, U.K., in 2017. During 2017-2018 she
was a visiting lecturer at the Intelligent Systems
Laboratory, University of Bristol. Her research
interests include Artificial Intelligence, Machine earning, Data mining and
Crowd management. Dr. Alotaibi’s research has been funded by several
sources in Saudi Arabia including Deputyship for Research & Innovation,
Ministry of Education, King Abdulaziz City for Science and Technology
(KACST) and Deanship of Scientifc Research (DSR), King Abdulaziz
University.
FARIS A. KATEB received a Ph.D. in Computer
Science from the University of Colorado, USA.
Currently, He is an assistant professor and head
of the IT Department, Faculty of Computing and
Information Technology, King Abdulaziz Univer-
sity, Jeddah, KSA. His research interests include
Computer Vision and Image Processing applica-
tions such as Object Detection, Face Recognition,
Adversarial Examples. He also is working on the
Natural Language Process for Arabic and English.
He participates as a speaker or presenter in conferences and artificial
intelligence areas and a member of the advisory board in other departments.
HADI MOHSEN OQAIBI received his B.S.,
M.S., and Ph.D. degrees in Computer Science
from King Abdulaziz University, Saudi Arabia. He
is currently working as an assistant professor in the
Faculty of Computing and Information Technol-
ogy at King Abdulaziz University, Saudi Arabia.
His research interests are mainly focused on Ma-
chine learning, Deep Learning, Pattern Recogni-
tion, and Image Processing. He has published sev-
eral peer reviewed journal articles and conference
papers.
VOLUME 4, 2016 13
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
ABDULRAHMAN A. ALSHDADI is assistant
professor of Computer Science in Faculty of Com-
puting and Information Technology at University
of Jeddah. He has been awarded the PhD qualifi-
cation in cloud computing in February 2018 from
University of Southampton, Southampton, UK.
His research interests span mainly around Industry
4.0 Prestaining issues of Cloud Computing and
Fog Computing Security, Internet of Things (IoT)
and Smart Cities, Intelligent Systems, Deep Learn-
ing, Data Science Analytics and Modelling. He has published numerous
conference papers, Journal Papers and one book chapter. He is now acting as
a head of Computer Science and Artificial Intelligent Department (CSAI) as
well as Vice Dean of College of Computer Science and Engineering (CCSE)
at University of Jeddah, Jeddah, Saudi Arabia.
SYED ADNAN YUSUF is currently the Director
of a UK-based research and development firm spe-
cializing in advanced computer vision algorithms
with a focus on deep-learning technologies. His
career originates from a computer systems engi-
neering background with an interest in advanced
biometrics, intelligent transport systems, long-
range object detection, tracking and identification.
As a research scientist, he has lead various teams
working in the domains of document verification,
facial identity analysis and traffic violation. In his previous role with Hitachi
Europe, he led a team of scientists and engineers developing autonomous
perception and motion planning systems for the Nissan Leaf Electric. The
work led to a fully autonomous 200+-mile journey on a variety of U.K.,
roads as part of the Human Drive project. In the research domain, his
focus is on CNN/RNN algorithms with a focus on driverless autonomous
control and video analytics systems and deep residual networks for the
face recognition domain. Having a background in Computer Vision and
Deep Learning domains, he has contributed in projects including firefighter
safety, maritime condition monitoring, financial technologies, and intelligent
transport systems.
14 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285826
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
... The fundamental purpose behind these videos is to damage the reputation of those involved. Cybercriminals continue to improve their capabilities, often using sophisticated techniques to carry out cyber-attacks, including hacking, fraud, and phishing [7]. Recent findings make it difficult to distinguish between real and fake. ...
... The rise of Deepfakes and the proliferation of open-source tools increase privacy concerns and pose a threat to people today. Many researchers are working in this new, challenging area, and it is not easy to differentiate between real videos and altered videos [7][8][9]. The number of Deepfakes continues to increase daily, and this happens because the rapid use of open-source tools raises privacy concerns and threatens routines [10]. ...
... Therefore, there is a need to create a reliable and efficient mechanism for preventing and detecting potential damage to overcome the challenges caused by these media. Many researchers, organizations, and studies have recently focused on identifying fake content [3][4][5][6][7]11]. Numerous techniques have already been proposed for Deepfake detection, including deep learning (DL), such as convolution neural network (CNN) [12], CNN with SVM [13], recurrent neural network (RNN) [14], CNN with long short-term memory (LSTM) [14], and machine learning (ML) techniques [15]. ...
Article
Full-text available
Deepfake technology uses artificial intelligence to create realistic but false audio, images, and videos. Deepfake technology poses a significant threat to the authenticity of visual content, particularly in live-stream scenarios where the immediacy of detection is crucial. Existing Deepfake detection approaches have limitations and challenges, prompting the need for more robust and accurate solutions. This research proposes an innovative approach: combining eye movement analysis with a hybrid deep learning model to address the need for real-time Deepfake detection. The proposed hybrid deep learning model integrates two deep neural network architectures, MesoNet4 and ResNet101, to leverage their respective architectures’ strengths for effective Deepfake classification. MesoNet4 is a lightweight CNN model designed explicitly to detect subtle manipulations in facial images. At the same time, ResNet101 handles complex visual data and robust feature extraction. Combining the localized feature learning of MesoNet4 with the deeper, more comprehensive feature representations of ResNet101, our robust hybrid model achieves enhanced performance in distinguishing between manipulated and authentic videos, which cannot be conducted with the naked eye or traditional methods. The model is evaluated on diverse datasets, including FaceForensics++, CelebV1, and CelebV2, demonstrating compelling accuracy results, with the hybrid model attaining an accuracy of 0.9873 on FaceForensics++, 0.9689 on CelebV1, and 0.9790 on CelebV2, showcasing its robustness and potential for real-world deployment in content integrity verification and video forensics applications.
... 2D-VGG [3-5, 12, 13, 18, 40-42], 2D-ResNet [23,24,28,30,31,[43][44][45][46], 2D-DenseNet [24,47], and 2D-MobileNet [48] are employed as 2D-FRADS backbone models in the literature. They all have more trainable parameters to fetch deep features. ...
... This diffusion scheme was more effective than the Perona-Malik and Gaussian filtering diffusion techniques. The image classification task employed MobileNet transfer learning and achieved an HTER of 0.09 [48]. In an interesting study, the preprocessing step involved resizing 25 frames from each video recording to 64 × 64. ...
... The MobileNet with Diffusion technique was used to improve class discrimination features. It obtained an HTER of 1.14 [48]. An interesting work utilized multi-task cascaded convolutional networks to detect and align faces. ...
Article
Full-text available
Biometric face and lip-reading systems are susceptible to face replay attacks. Where the intruder presents a recorded video of a legitimate user or presents printed photos to gain system access without authorization. Consequently, liveness detection becomes essential to confirm whether the person in camera view is real rather than fake replays. This study aims to detect such printed photo attacks and video or photo replay attacks on high-resolution screens. The main objective of this research is to develop a lightweight 3D-DNN model that considers both spatial and temporal features to distinguish between real and attack videos. To achieve this objective, a lightweight 3D-StudentNet face replay attack defense system is proposed leveraging 3D-ArrowNet deep neural network and knowledge distillation. The system captures dynamic spatio-temporal features from five video frames captured by an RGB camera. Considerable experimentation is conducted using the Replay-Attack and Replay-Mobile benchmark datasets. Experimental results demonstrate that the proposed 3D-ArrowNet achieves state-of-the-art performance and transfers its knowledge successfully to a lightweight 3D-StudentNet with fewer network parameters. Thus, 3D-StudentNet can supplant existing 2D-DNN architectures utilized for replay attack defense systems, which capture only spatial features.
... Further research has extended this approach by incorporating various fusion techniques, including hybrid models and ensemble methods. For instance, some studies have integrated CNNs and RNN-based architectures to capture both spatial and temporal features [18][19][20]. While these methods offer improvements in detection performance, they still face challenges in handling real-time applications and ensuring robustness across different types of deepfake manipulations [21]. ...
... This method primarily detects manipulation by identifying misalignments in audio-visual synchronization, providing an important step forward in distinguishing real content from fabricated media [18,19]. However, integrating these features with high accuracy and in real-time systems remains a complex task [20,21,25]. ...
Article
Full-text available
Deepfake detection has become a critical challenge nowadays with the rise of sophisticated generative techniques that manipulate audio-visual data. Existing methods primarily focus on lip movement synchronization using audio and visual features, often relying on local feature extraction with Convolutional Neural Networks (CNNs). In this work, we propose an enhanced multimodal framework that integrates with local and global features for advanced deepfake detection. Our approach extends traditional pipelines by introducing additional visual features such as eye movement and facial regions, combined with audio features to model cross-modal dependencies. While CNNs capture local features, Vision Transformers (ViTs) extract global contextual relationships from both visual and audio modalities. The diffusion models are incorporated as pre-processors to refine noisy data and generate realistic augmentations, ensuring high-quality feature representation. The proposed framework achieves state-of-the-art performance, with accuracy scores of 0.9987, 0.9825, 0.9915, and 0.9812 on the FakeAVCeleb, AV-Deepfake1M, TVIL, and LAV-DF datasets, respectively. These results demonstrate significant improvements over existing methods, highlighting the framework’s superior generalization and robustness in detecting subtle inconsistencies across manipulated audio-visual data.
Article
Full-text available
A few decades ago, conventional image processing methods mostly focused on basic tasks such as image enhancement, registration, or edge detection [...]
Article
Full-text available
Despite the recent improvements in facial recognition, face spoofing attacks can still pose a serious security threat to biometric systems. As fraudsters are coming up with novel spoofing attacks, anomaly-based detectors, compared to the binary spoofing attack counterparts, have certain generalisation performance advantages. In this work, we investigate the merits of fusing multiple anomaly classifiers using weighted averaging (WA) fusion. The design of the entire system is based on genuine-access data only. To optimise the parameters of WA, we propose a novel three-stage optimisation method with the following contributions: (a) A new hybrid optimisation method using Genetic Algorithm (GA) and Pattern Search (PS) to explore the weight space more effectively (b) a novel two-sided score normalisation method to improve the anomaly detection performance (c) a new ensemble pruning method to improve the generalisation performance. To further boost the performance of the proposed anomaly detection ensemble, we incorporate client-specific information to train the proposed model. We evaluate the capability of the proposed model on publicly available face spoofing databases including Replay-Attack, Replay-Mobile and Rose-Youtu. The experimental results demonstrate that the proposed WA fusion outperforms the state-of-the-art anomaly-based and multiclass approaches.
Article
Full-text available
The fusion of one-class classifiers (OCCs) has been shown to exhibit promising performance in a variety of machine learning applications. The ability to assess the similarity or correlation between the output of various OCCs is an important prerequisite for building of a meaningful OCCs ensemble. However, this aspect of the OCC fusion problem has been mostly ignored so far. In this paper, we propose a new method of constructing a fusion of OCCs with three contributions: (a) As a key contribution, enabling an OCC ensemble design using exclusively non anomalous samples, we propose a novel fitness function to evaluate the competency of OCCs without requiring samples from the anomalous class; (b) As a minor, but impactful contribution, we investigate alternative forms of score normalisation of OCCs, and identify a novel two-sided normalisation method as the best in coping with long tail non anomalous data distributions; c) In the context of building our proposed OCC fusion system based on the weighted averaging approach, we find that the weights optimised using a particle swarm optimisation algorithm produce the most effective solution. We evaluate the merits of the proposed method on 15 benchmarking datasets from different application domains including medical, anti-spam and face spoofing detection. The comparison of the proposed approach with state-of-the-art methods alongside the statistical analysis confirm the effectiveness of the proposed model.
Conference Paper
Full-text available
While the remarkable advances in face matching render face biometric technology more widely applicable, its successful deployment may be compromised by face spoofing. Recent studies have shown that anomaly-based face spoofing detectors offer an interesting alternative to the multiclass counterparts by generalising better to unseen types of attack. In this work, we investigate the merits of fusing multiple anomaly spoofing detectors in the unseen attack scenario via a Weighted Averaging (WA) and client-specific design. We propose to optimise the parameters of WA by a two-stage op-timisation method consisting of Particle Swarm Optimisation (PSO) and the Pattern Search (PS) algorithms to avoid the local minimum problem. Besides, we propose a novel scoring normalisation method which could be effectively applied in extreme cases such as heavy-tailed distributions. We evaluate the capability of the proposed system on publicly available face anti-spoofing databases including Replay-Attack, Replay-Mobile and Rose-Youtu. The experimental results demonstrate that the proposed fusion system outperforms the majority of anomaly-based and state-of-the-art multiclass approaches.
Article
Full-text available
Face liveness detection is a critical preprocessing step in face recognition for avoiding face spoofing attacks, where an impostor can impersonate a valid user for authentication. While considerable research has been recently done in improving the accuracy of face liveness detection, the best current approaches use a two-step process of first applying non-linear anisotropic diffusion to the incoming image and then using a deep network for final liveness decision. Such an approach is not viable for real-time face liveness detection. We develop two end-to-end real-time solutions where nonlinear anisotropic diffusion based on an additive operator splitting scheme is first applied to an incoming static image, which enhances the edges and surface texture, and preserves the boundary locations in the real image. The diffused image is then forwarded to a pre-trained Specialized Convolutional Neural Network (SCNN) and the Inception network version 4, which identify the complex and deep features for face liveness classification. We evaluate the performance of our integrated approach using the SCNN and Inception v4 on the Replay-Attack dataset and Replay-Mobile dataset. The entire architecture is created in such a manner that, once trained, the face liveness detection can be accomplished in real-time. We achieve promising results of 96.03% and 96.21% face liveness detection accuracy with the SCNN, and 94.77% and 95.53% accuracy with the Inception v4, on the Replay-Attack, and Replay-Mobile datasets, respectively. We also develop a novel deep architecture for face liveness detection on video frames that uses the diffusion of images followed by a deep Convolutional Neural Network (CNN) and a Long Short-Term Memory (LSTM) to classify the video sequence as real or fake. Even though the use of CNN followed by LSTM is not new, combining it with diffusion (that has proven to be the best approach for single image liveness detection) is novel. Performance evaluation of our architecture on the REPLAY-ATTACK dataset gave 98.71% test accuracy and 2.77% Half Total Error Rate (HTER), and on the REPLAY-MOBILE dataset gave 95.41% accuracy and 5.28% HTER.
Article
Full-text available
In the last few decades, deep-learning-based face verification and recognition systems have had enormous success in solving complex security problems. However, it has been recently shown that such efficient frameworks are vulnerable to face-spoofing attacks, which has led researchers to build proficient anti-facial-spoofing (or liveness detection) models as an additional security layer. In response, increasingly challenging and tricky attacks have been launched to fool these anti-spoofing mechanisms. In this context, this paper presents the results of an analytical study on transfer-learning-based convolutional neural networks (CNNs) for face liveness detection and differential evolution-based adversarial attacks to evaluate the efficiency of face anti-spoofing classifiers against adversarial attacks. Specifically, experiments were conducted under different use-case scenarios on four face anti-spoofing databases to highlight practical criteria that can be used in the development of countermeasures to address face-spoofing issues.
Article
Face presentation attack detection (PAD) has been extensively studied by research communities to enhance the security of face recognition systems. Although existing methods have achieved good performance on testing data with similar distribution as the training data, their performance degrades severely in application scenarios with data of unseen distributions. In situations where the training and testing data are drawn from different domains, a typical approach is to apply domain adaptation techniques to improve face PAD performance with the help of target domain data. However, it has always been a non-trivial challenge to collect sufficient data samples in the target domain, especially for attack samples. This paper introduces a teacher-student framework to improve the cross-domain performance of face PAD with one-class domain adaptation. In addition to the source domain data, the framework utilizes only a few genuine face samples of the target domain. Under this framework, a teacher network is trained with source domain samples to provide discriminative feature representations for face PAD. Student networks are trained to mimic the teacher network and learn similar representations for genuine face samples of the target domain. In the test phase, the similarity score between the representations of the teacher and student networks is used to distinguish attacks from genuine ones. To evaluate the proposed framework under one-class domain adaptation settings, we devised two new protocols and conducted extensive experiments. The experimental results show that our method outperforms baselines under one-class domain adaptation settings and even state-of-the-art methods with unsupervised domain adaptation.
Article
Towards the security threats brought by presented fake faces to face recognition systems, many countermeasures have been presented to resist fake faces and achieved promising performance, where facial movement is one of the commonly used cues. However, the more detailed and distinguishable information in the motion cue is not well explored in these methods. In addition, the texture cue used for face presentation attack detection (PAD) is also not well integrated into motion. Therefore, we propose a detection method by analyzing the cues of facial movement and texture. More specifically, the optical flows of a continuous video sequence are first extracted, which can describe the detailed movement direction and movement amplitude. Then, the extracted optical flows are concatenated with the video frames as the input of the network. After that, region and channel attention mechanisms are jointly introduced to adaptively allocate the classification weights. Finally, the fused motion and texture cues are fed into a convolutional network to extract features and identify whether the input video sequence is from live face or not. The proposed detection method is tested on the databases of Replay-Attack, OULU-NPU and HKBU-MARs V1. The experiments show that our proposed face PAD method can well separate various types of fake faces compared to state-of-the-art methods.
Article
Face anti-spoofing (FAS) techniques play an important role in defending face recognition systems against spoofing attacks. Existing FAS methods often require a large number of annotated spoofing face data to train effective anti-spoofing models. Considering the attacking nature of spoofing data and its diverse variants, obtaining all the spoofing types in advance is difficult. This would limit the performance of FAS networks in practice. Thus, an online learning FAS method is highly desirable. In this paper, we present a semi-supervised learning based framework to tackle face spoofing attacks with only a few labeled training data ( e.g. ,  50\sim ~50 face images). Specifically, we progressively adopt the unlabeled data with reliable pseudo labels during training to enrich the variety of training data. We observed that face spoofing data are naturally presented in the format of video streams. Thus, we exploit the temporal consistency to consolidate the reliability of a pseudo label for a selected image. Furthermore, we propose an adaptive transfer mechanism to ameliorate the influence of unseen spoofing data. Benefiting from the progressively-labeling nature of our method, we are able to train our network on not only data of seen spoofing types ( i.e. , the source domain) but also unlabeled data of unseen attacking types ( i.e. , the target domain). In this way, our method can reduce the domain gap and is more practical in real-world anti-spoofing scenarios. Extensive experiments in both the intra-database and inter-database scenarios demonstrate that our method is on par with the state-of-the-art methods but employs remarkably less labeled data (less than 0.1% labeled spoofing data in a dataset). Moreover, our method significantly outperforms fully-supervised methods on cross-domain testing scenarios with the help of our progressive learning fashion.
Article
This paper addresses the problem of face presentation attack detection using different image modalities. In particular, the usage of short wave infrared (SWIR) imaging is considered. Face presentation attack detection is performed using recent models based on Convolutional Neural Networks using only carefully selected SWIR image differences as input. Conducted experiments show superior performance over similar models acting on either color images or on a combination of different modalities (visible, NIR, thermal and depth), as well as on a SVM-based classifier acting on SWIR image differences. Experiments have been carried on a new public and freely available database, containing a wide variety of attacks. Video sequences have been recorded thanks to several sensors resulting in 14 different streams in the visible, NIR, SWIR and thermal spectra, as well as depth data. The best proposed approach is able to almost perfectly detect all impersonation attacks while ensuring low classification errors. On the other hand, obtained results show that obfuscation attacks are more difficult to detect. We hope that the proposed database will foster research on this challenging problem. Finally, all the code and instructions to reproduce presented experiments is made available to the research community.