Available via license: CC BY 4.0
Content may be subject to copyright.
Citation: Li, J.; Sun, W. A Neural
Network Based on Supervised
Multi-View Contrastive Learning and
Two-Stage Feature Fusion for Face
Anti-Spoofing. Electronics 2024,13,
4865. https://doi.org/10.3390/
electronics13244865
Academic Editor: Dah-Jye Lee
Received: 10 November 2024
Revised: 4 December 2024
Accepted: 6 December 2024
Published: 10 December 2024
Copyright: © 2024 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
A Neural Network Based on Supervised Multi-View Contrastive
Learning and Two-Stage Feature Fusion for Face Anti-Spoofing
Jin Li 1and Wenyun Sun 2,*
1School of Computer Science, Nanjing University of Information Science and Technology,
Nanjing 215500, China; jinli@nuist.edu.cn
2School of Artificial Intelligence, Nanjing University of Information Science and Technology,
Nanjing 215500, China
∗Correspondence: wenyunsun@nuist.edu.cn
Abstract: As one of the most crucial parts of face detection, the accuracy and the generalization of face
anti-spoofing are particularly important. Therefore, it is necessary to propose a multi-branch network
to improve the accuracy and generalization of the detection of unknown spoofing attacks. These
branches consist of several frequency map encoders and one depth map encoder. These encoders are
trained together. It leverages multiple frequency features and generates depth map features. High-
frequency edge texture is beneficial for capturing moiré patterns, while low-frequency features are
sensitive to color distortion. Depth maps are more discriminative than RGB images at the visual level
and serve as useful auxiliary information. Supervised Multi-view Contrastive Learning enhances
multi-view feature learning. Moreover, a two-stage feature fusion method effectively integrates
multi-branch features. Experiments on four public datasets, namely CASIA-FASD, Replay–Attack,
MSU-MFSD, and OULU-NPU, demonstrate model effectiveness. The average Half Total Error Rate
(HTER) of our model is 4% (25% to 21%) lower than the Adversarial Domain Adaptation method in
inter-set evaluations.
Keywords: face anti-spoofing; domain transfer network; supervised contrastive learning; feature fusion
1. Introduction
Face recognition has increasingly been used in interactive intelligent applications, like
check-in and mobile payment, due to its convenience and accuracy. However, existing
face recognition systems are vulnerable to presentation attacks, from print to replay to
makeup and 3D masks [
1
]. As a result, considerable effort from both academia and
industry has been focused on developing face anti-spoofing (FAS) technology to secure
face recognition systems [
2
]. In recent studies, some hybrid methods have combined
handcrafted features with deep learning features for FAS [
3
,
4
] and achieved satisfactory
results. However, handcrafted features rely on a priori designs that cannot be learned.
Furthermore, handcrafted features and generated depth map features may be incompatible.
Therefore, this study proposes a multi-branch network combining frequency images and
depth maps. In addition, the supervised contrast learning and suitable feature fusion
methods are proposed, which are expected to alleviate both feature training and feature
fusion problems.
Handcrafted features have been shown to be discriminative for distinguishing bonafide
from presentation attacks in FAS [
2
], and recent works combine these with depth map
features for further improved FAS performance [
3
,
5
,
6
]. Regarding distinguishing genuine
and spoofing samples, high-frequency features emphasize high-frequency interference like
moiré patterns in printing and video attacks, while low-frequency features highlight color
distortion [
3
]. Employing pixel-level supervision offers more precise, task-relevant cues,
enabling improved intrinsic feature learning. Recent studies explored pseudo-depth labels
for model training guidance [
7
–
9
]. Liu et al. [
7
] noted that depth can be utilized as auxiliary
Electronics 2024,13, 4865. https://doi.org/10.3390/electronics13244865 https://www.mdpi.com/journal/electronics
Electronics 2024,13, 4865 2 of 15
information to supervise both live and spoof faces. Because, from the spatial perspective,
live faces have face-like depth, while faces in print or replay attacks have flat or planar
depth. Wang et al. [
8
] utilized Principal Component Analysis (PCA) to visualize the features
of Convolutional Neural Networks (CNN) in both the RGB and depth domains. Upon
comparing the visualizations, they observed that the CNN features of live and spoofing
faces could always be easily separated in the visualization results of the depth domain.
Tian et al. [
10
] found that contrastive loss outperforms cross-view prediction, with
more views improving semantic capture. Khosla et al. [
11
] extended self-supervised batch
contrastive to fully supervised, using label information to pull same-class points together
and push different-class clusters apart in embedding space. This paper combines fully-
supervised contrastive learning [
11
] with contrastive multi-view coding [
10
], proposing
the Supervised Multi-view Contrastive Learning (SMCL), as shown in Figure 1. When the
SMCL method is not applied, a classifier will undesirably leverage spurious correlations to
make genuine/spoof predictions. The distribution of multi-view features with the same
label is relatively decentralized, as shown on the left side of Figure 1. Due to the given label,
originally dispersed multi-view features like depth maps and frequency images become
more tightly clustered on the normalized hyperplane by SMCL, as shown on the right side
of Figure 1. Additionally, Asymmetric Triplet Loss [
12
] is incorporated into fused features,
further minimizing inter-positive distance for a more compact positive sample distribution.
Without SMCL With S MCL
Normal ized Hype rplane Normal ized Hype rplane
Genuine Samples Spoof ing Sample s
Figure 1. Cases with and without Supervised Multi-view Contrastive Learning.
Simple feature fusion methods like concatenation and weighted summation are com-
monly used for multi-scale feature fusion [
13
–
15
]. However, they may not be optimal
when dealing with differences across multiple views. Therefore, this paper proposes the
two-stage feature fusion method. In stage one, fusing three frequency features is viewed
as multi-scale feature fusion. Weighted summation with independent learnable weights
for frequency feature fusion is employed. In stage two, fusing frequency and depth map
features is a multiple views feature fusion. Bilinear pooling technique [
16
] is introduced to
fuse these features:
•
This study proposes a multi-branch network combining frequency images and depth
maps. The model incorporates high-frequency and low-frequency encoders extracting
high-frequency interference information often overlooked and low-frequency color
distortion. It also uses a Generative Adversarial Network to generate depth maps,
utilizing depth map features as an input for the model.
•
Supervised Multi-view Contrastive Learning (SMCL) is proposed, maximizing similarity
between different views of the same face to better capture potential semantic information.
•
A two-stage feature fusion method (TFF) is proposed to address challenges of varying
scales and incompatible features in high-frequency, low-frequency, depth maps, and
enhanced RGB images.
Electronics 2024,13, 4865 3 of 15
2. Related Work
2.1. Face Anti-Spoofing
Some CNN-based FAS models performed well in intra-dataset experiments. However,
when faced with inter-dataset experiments, these models, supervised by binary classifica-
tion, showed poor generalization. Additionally, end-to-end models are difficult to interpret.
Furthermore, hybrid algorithms have the obvious drawback of incompatibility between
handcrafted features and depth map features, which can limit model performance [
2
].
Generative Adversarial Networks (GANs) have garnered significant popularity in face
anti-spoofing (FAS) algorithms, particularly for style migration and domain transformation
tasks. The dual encoder–decoder GAN [
17
] is adept at generating realistic facial images
with continuous pose variations. Many contemporary domain-adaptive algorithms harness
GANs to bridge the discrepancy between source and target domains. Jun et al. [
18
] utilized
GAN for mapping features from the source domain to the target domain. The style feature
scrambling network [
19
] leverages GANs to distill domain-independent content features
by blending content features from diverse domains.
The use of facial depth maps in FAS is a common practice. Several recent studies [
7
,
20
]
incorporate pseudo-depth labels to auxiliate FAS models. These labels enforce prediction
of accurate depths for genuine samples, while zero-mapping is used for spoofing samples.
Liu et al. [
7
] introduced depth maps as auxiliary information for FAS models. Wang et al. [
8
]
adopted Domain-Adversarial Training of Neural Networks [
21
] to train an encoder for
transferring RGB images to depth images.
2.2. Contrastive Learning
Contrastive learning is a discriminative method that aims to group similar samples
closer and diverse samples farther apart [
22
]. Chen et al. [
23
] introduced a simple frame-
work for contrastive learning of visual representations. Khosla et al. [
11
] incorporated
labels as a mask matrix to distinguish positive and negative samples. In the context of
FAS, Sun et al. [
24
] employed supervised contrastive learning to achieve data separation
across different domains. This was accomplished by augmenting classification labels with
domain labels and introducing domain information into the mask matrix. Tian et al. [
10
]
proposed that humans perceive the world through multiple sensory channels, each noisy
and incomplete, yet informative. The study aimed to learn shared information representa-
tions between multiple sensory views. This was achieved using two-by-two permutations
of different views as sample pairs, with positive or negative class labels.
2.3. Feature Fusion
Feature fusion is an integral part of multi-branch networks and is important for
enhancing the features extracted by different encoders. The basic fusion operation is
usually an element-wise addition or concatenation. Multi-scale features are fused through
lateral connections [
14
]. Element-wise addition mitigates aliasing. He et al. [
25
] employ an
attention-based method to fuse multi-view features. Kim et al. [
16
] utilize a dot product
approximation of bilinear pooling to fuse multi-view features. Bilinear pooling involves
an element-level product operation on features, effectively preserving information from
each view.
3. Materials and Methods
3.1. Overall Structure
The proposed network, as shown in Figure 2, can be divided into three parts. In Part
I, the RGB images are decomposed into frequency images. Then, facial depth maps are
generated using the Generative Adversarial Network. The depth map encoder structure is
DepthNet [
7
]. In Part II, depth map features and multi-scale frequency features are extracted
using multiple encoders. The encoders extracting frequency features are ResNet18 [
26
]. And
the encoder for depth maps is a 3-layer Convolutional Neural Network (CNN) with 3
×
3
convolutional kernels. After that, these features are mapped to a normalized hyperplane
Electronics 2024,13, 4865 4 of 15
by full connection and L2 normalization. Subsequently, Supervised Multi-view Contrastive
Learning (SMCL) is employed to train the multi-view features. In Part III, frequency
features and depth maps are fused by TFF. We also use Asymmetric Triplet Loss to form a
compact positive sample feature distribution. Finally, feature classification is performed by
Binary Cross Entropy (BCE) loss.
De pth M aps
Discriminator
Generated Depth Maps
Ground-T ruth Depth Maps
L1 Loss
GAN Loss
De pth M aps
Decoder
Depth Maps Generator
SMCL
Frequency Map s
FC& L2N orm
TFF
BCE Loss
Asymmetric
Tr ipl et Los s
True samples
False samples
Frequency Map s features
Depth Maps features
Fused features
Supervised
Mu lti-View
Contrastive Loss
Ori ginal Im ages
Figure 2. The structure of the multi-branch network combining frequency images and depth maps.
3.2. Image Decomposition and Ground Truth Depth Map Generation
This paper follows the method [
9
] to decompose RGB images. Figure 3shows the
process of image decomposition. First, the original image is downsampled using an average
pooling layer. The original image size is 256
×
256, and the downsampled image size is
64
×
64 and 128
×
128. Then, the image is upsampled back to the original size using
bilinear interpolation. The high-frequency images are obtained by subtracting the RGB
images from the upsampled images. The upsampled images are the low-frequency images.
The formula is as follows:
IL=U256(D128 (I)),
IH= (I−U256 (D64(I)))k(1)
where I is the original image.
IH
is the high-frequency image.
IL
is the low-frequency
image.
D64()
means downsampling the image to 64
×
64.
D128()
means downsampling
the image to 128
×
128.
U256()
means upsampling the image to 256
×
256. The content in
high-frequency bands becomes visible when it is multiplied by 25 [
9
]. Thus,
k
is an integer
with value 25. In order to ensure the integrity of the information, we also use the original
RGB images as one of the inputs to the network. However, the original RGB images are
randomly cropped and scaled. In this experiment, we randomly select the area of the
cropped region to be between 10% and 40% of the original RGB images area.
Electronics 2024,13, 4865 5 of 15
64×64
128×128
256×256
256×256
256×256
256×256
Pooling
Resize
Resize
Mi nus
×k
256×256
RandomResizedCrop
Figure 3. The process of image decomposition.
The 3DDFA-V2 [
27
] model is introduced for generating ground truth depth maps for
depth estimation. Figure 4shows RGB images, high- and low-frequency images, and depth
images, respectively. The images from top to bottom represent the genuine image, print
attack, replay attack, and mask attack.
Figure 4. Original, high-frequency, low-frequency, and ground truth depth images.
Electronics 2024,13, 4865 6 of 15
3.3. Supervised Multi-View Contrastive Learning
Given a mini-batch
{xi
,
yi}i=1...K
of size
K
, the corresponding batch used for training
contains 2K pairs
{e
xi
,
e
yi}i=1...2K
.
e
xi
and
e
x2i−1
are different views with the same labels,
e
y2i
=
e
y2i−1
=
yi
. Within a multi-view batch,
i∈I≡ {1 . . . 2K}
is the index of an arbitrary
view.
p∈P(i)
is the index of the other views with the same label, and
|P(i)|
is its cardinality.
After encoding, fully connecting, and L2 normalizing, these inputs of multiple images are
projected into a hyperplane, referred to as
zi
,
zp
.
A(i)
denotes all samples except
e
xi
.
za
refers as the random sample projected into a hyperplane, except
e
zi
. Supervised Contrastive
Learning [11] is defined as Equation (2).
LSup =∑
i∈I
−1
|P(i)|∑
p∈P(i)
log
ex pzi·zp/τ
∑
a∈A(i)
ex p(zi·za/τ)
(2)
τis the temperature parameter, affecting learning and feature distribution.
Supervised Contrastive Learning is used to train two different views. On this basis,
the full graph method is introduced to accomplish the contrastive learning of multi-views.
Suppose we have
M
views, from
V1
to
VM
. Considering all pairs
(i
,
j)
,
i=j
and obtaining
M×(M−
1
)
pairs. By involving all pairs of views, the objective function of Supervised
Multi-view Contrastive Learning is
LSMC L =∑
1<=i<j<=M
LSup(Vi,Vj)(3)
1
<=i<j<=M
means one pair consisting of two different views, and we have
M
views.
To further narrow the distance between positive samples, Asymmetric Triplet Loss is used,
making the positive sample feature distribution more compact.
3.4. Two-Stage Feature Fusion
As shown in Figure 5, the TFF can be divided into two stages. The first stage involves
fusing frequency features. Though both high-frequency images and low-frequency images
originate from the same RGB domain, they possess distinct inputs. Therefore, traditional
multi-view fusion methods, designed for scenarios with identical inputs, are unsuitable for
fusing features from different inputs. Hence, in the first stage, focusing on frequency fea-
tures, the weights for summation are initialized as independent learnable parameters. These
parameters are then multiplied by their corresponding frequency features. Equation
(4)
is
as follows:
Ff re =λhFh+λlFl+λiI(4)
Ff re
refers to the frequency features.
λh
is the weight of the high-frequency features.
λl
is the weight of the low-frequency features.
λi
is the weight of the RGB image features.
Given that the frequency image and the RGB image originate from the same domain, the
value of
λi
is set to 0.1 to complement the image information without introducing excessive
redundancy. The values of λhand λlare both 1.0.
The second stage involves fusing frequency and depth map features. These features
originate from different domains, and their respective encoder structures differ, raising
compatibility issues. However, bilinear pooling can mitigate these issues. By using the
mutual dot product of features, bilinear pooling enables feature fusion without explicit
weight assignment. Moreover, it is not constrained by feature dimensions and can be
applied to features of arbitrary dimensions, making it well suited for fusing depth maps
and frequency features.
Electronics 2024,13, 4865 7 of 15
HF Ma ps
Features
Depth Maps
Features
Frequency Maps
Features
LF Maps
Features
HF-Weight
LF-Weight Feature
Fusion
Addition
Element-Wise Multiple
RG B Maps
Features
RGB-Weight
FC
FC
i<=k, i++
i<=k, i++
Tahn
Tahn
Element-Wise Addition
Feature
Fusion
Fre-Weight
De pth-Weight
S
M
C
L
Figure 5. The structure of two-stage feature fusion (TFF).
Kim et al. [
16
] approximate bilinear pooling by employing the Hamada product
between matrices. It effectively reduces the number of parameters involved. This paper
follows his method in bilinear pooling [
16
]. Equation
(5)
describes the fusion process as
follows:
Fbp =Uf re ◦Vde p,
F=λf re Ff re +λdep Fdep +λb p Fbp
(5)
F
denotes fused features.
U
and
V
are weight matrices.
◦
implies the Hadamard product,
i.e., a element-wise multiplication between two matrices.
Fbp
denotes fused features via
bilinear pooling. Similar to skip connections, fused features by bilinear pooling are treated
as residuals. A weighted sum is computed with original fused frequency features, depth
map features, and residuals. In stage two, the weights for fused frequency features, depth
map features, and bilinear pooling features are 0.5, 1.0, and 1.0, respectively.
3.5. Losses and Training
Generating depth maps by GAN [28] is as follows:
λGA Nargmin
Gmax
DLGA N(G,D)+λL1LL1(G)(6)
where
λGA N
is 1, and
λL1
is 200.
LGA N
is the GAN loss, and
L1
is the L1 reconstruction loss.
In summary, the overall loss for classification is defined as
L=λsmcl LSMC L +λtri Ltri +Lcls (7)
LSMC L
is the loss of Supervised Multi-view Contrastive Learning.
Ltri
is the Asymmetric
Triplet Loss [
12
].
Lcls
is the Binary Cross Entropy loss.
λsmcl
and
λtri
are the weight param-
eters to avoid bias from different loss values during backpropagation. In this experiment,
λsmcl is 0.1 [24], and λtri is 2 [12]. The pseudocode of our Algorithm 1is shown below.
Electronics 2024,13, 4865 8 of 15
Algorithm 1 Supervised Multi-view Contrastive Learning and Multi-view Features Fusion
for Face Anti-Spoofing
Input: Training data {xi,yi}N
i, ground truth depth maps {xDepth}N
i
Output: Prediction value ˆ
y
1: for K iterations do
2: Sample and augment a mini-batch {xi,yi}N
i.
3: Decompose images from {xi,yi}N
iinto multi-scale frequency images.
4: Generate depth maps through GAN by comparing with {xDepth}N
i.
5:
Obtain multi-scale frequency and depth map features via multiple encoders. Multi-
scale frequency images and generated depth maps serve as inputs.
6: Fully connecting and L2 normalizing multi-features for SMCL.
7: Feature fusion through TFF as described in Equation (5)
8: Predict labels ˆ
yby Classifier.
9: Generate depth maps by GAN as descirbed in Equation (6)
10: Calculate the overall loss Las described in Equation (7)
11: Backward: update model’s parameters.
12: end for
4. Experiments
Multiple datasets involve diverse populations, variable lighting conditions, and un-
seen attack types. Experiments on multiple datasets can well validate the accuracy and
generalization ability of the model. To be consistent with existing works, we evaluated the
proposed network on four publicly available benchmarking datasets: the CASIA face anti-
spoofing dataset (CASIA-FASD) [
29
], the Idiap Replay–Attack dataset [
30
], the MSU-MFSD
dataset [31], and the OULU-NPU dataset [32].
•
MSU-MFSD includes 280 videos from 35 subjects, recorded using a laptop camera
and a smartphone camera with resolutions of 640 × 480 and 720 × 480, respectively.
It comprises mainly two types of spoofing attacks: printed photo attack and video
replay attack.
•
The CASIA-FASD dataset contains 50 subjects, three scenes, 150 live videos, and 450
attack videos. All faces are frontal with no additional lighting. The dataset is divided
into training and test sets with a 40/60 split (20 subjects for training, 30 for testing).
There is no validation set.
•
Replay–Attack consists of 1200 videos from 50 subjects. The dataset has two differ-
ent illumination environments (controlled and inverse) and two different camera
conditions (handheld and fixed). It is divided into training (360 videos), validation
(306 videos), and testing (480 videos).
•
OULU-NPU is a dataset consisting of 4950 real access and attack videos, recorded
using the front cameras of six different mobile devices in three sessions with vary-
ing illumination conditions and background scenes. The presentation attack types
considered in the database are print and video replay attacks.
4.1. Implement Details
In this paper, code is implemented using PyTorch 1.10.0, Python 3.8, and CUDA
12.1. Face detection is performed with the MTCNN model [
33
], and the 3DDFA-V2 model
generates the ground truth depth maps. Both human facial images and depth maps are
resized to 256
×
256 pixels. Consistent with most in FAS, for data augmentation, common
techniques such as color jittering, random rotation, and normalization are employed.
Additionally, this paper introduces data augmentation methods that have demonstrated
good performance in contrastive learning, such as random crop and CutOut, to the RGB
image augmentation. The network was trained with a batch size of 25, a learning rate of
0.0003, and 60 epochs. A dynamic learning rate adjustment was implemented, reducing
the rate by 0.1 if there was no change in validation loss for 20 epochs. The Adam optimizer
Electronics 2024,13, 4865 9 of 15
was used. Experiments were run on an NVIDIA GeForce RTX 3090 (NVIDIA, Santa Clara,
CA, USA).
4.2. Ablation Study on Different View Combinations
To assess individual contributions, an ablation study was conducted. The ablation
protocol was training on the CASIA-FASD dataset and testing on the Replay–Attack
dataset [
34
], denoted as C
→
I. The impact of using individual features, such as RGB im-
ages, low-frequency images, and high-frequency images, was assessed in Table 1. Results
indicate using separate frequency images yields poor performance, with HTER values
of 25.73%, 33.97%, and 25.72%. Fusing high- and low-frequency features using atten-
tion mechanisms improved HTER to 21.5%. Additionally incorporating RGB images in
frequency image fusion yielded little improvement, with a performance boost of 21.2%.
Second, the effectiveness of adding depth map features was examined, achieving an HTER
of 16.69%. This outperforms using high- and low-frequency features alone. Combining
depth map features with RGB image features, low-frequency features, and high-frequency
features was explored separately. This integration yields HTER of 16.56%, 15.98%, and
14.25%, surpassing performance achieved using depth map features alone. Finally, the best
experiment results were 11.22% obtained by fusing all features using TFF. The best result
was 9.97%, achieved by adding SMCL. These findings show frequency features and depth
map features indeed improve multi-view learning.
Table 1. Comparison of the HTER among different view combinations under Protocol C→I.
Various View Combinations HTER (C→I, %)
only RGB 25.73
only LF 33.97
only HF 25.72
LF + HF + TFF Stage1 21.5
RGB + LF + HF + TFF Stage1 21.2
only Depth 16.69
RGB + Depth + TFF Stage2 16.56
LF + Depth + TFF Stage2 15.98
HF + Depth + TFF Stage2 14.25
Fre + Depth + TFF 11.22
Our Model with Multi-view + SMCL + TFF 9.97
4.3. Comparison Among Different Feature Fusion Methods
To improve performance, several feature fusion methods were utilized. The training
set for the comparative experiments is CASIA-FASD, and the test set is Replay–Attack,
denoted as C→I.
Method 1 incorporates the Attention Perceptive Module (APM) [
35
], using channel and
spatial attention for feature fusion. Method 2 is similar to Method 1, but frequency features
are summed with independent learnable weights. There are two separate weights for the
independent weighted summation of frequency features with depth map features. Method
3 uses a multi-branch network with channel attention via the Squeeze-and-Excitation (SE)
module, followed by concatenation of outputs. Method 4 directly sums image features.
Method 5 directly sums frequency features, but they are multiplied by a factor of 0.5 and
added to depth map features due to their auxiliary role in classification. Method 6 uses
the TFF method, with fused features from bilinear pooling as the final fused multi-view
features. Method 7 also uses the TFF method, but the fused multi-view features from
bilinear pooling are treated as residuals. These residuals are summed with the originally
fused frequency features and depth map features using independent weights. Comparative
experiments showed the best results with the final TFF method, as shown in Table 2.
Electronics 2024,13, 4865 10 of 15
Table 2. Comparison of the HTER among different feature fusion methods under Protocol C→I.
Various Feature Fusion Methods HTER(C→I, %)
APM [35] 34.92
APM [35] and Independent Weights 32
Concatenation [25] 42
Summation 16.86
Summation with Fixed Weight 13.25
TFF without skip connection 12.21
Our TFF 9.97
4.4. Intra-Set Evaluation
To validate the model, intra-set experiments are conducted using the OULU-NPU
dataset on four protocols. Protocol I assesses the model’s performance across diverse
scenarios. Protocol II evaluates its robustness against various attack methods. Protocol III
examines the model’s accuracy when confronted with different imaging devices, specifically
different cell phones. Protocol IV encompasses a combination of the aforementioned
categories. It comprehensively validates the model’s performance across diverse scenarios,
attack methods, and imaging devices.
Table 3presents the results obtained from evaluating the four protocols. The efficacy
of fusing frequency images and depth maps for accurate classification is affirmed in proto-
cols I, III, and IV. Protocol IV encompasses all diverse factors in OULU-NPU, exhibiting
exceptional performance validating comprehensive generalizability and accuracy across
environments, devices, and scenarios.
Table 3. Comparison with FAS methods on OUlU-NPU.
Protocol
Method APCER (%) BPCER (%) ACER (%)
I
STASN 1.2 2.5 1.9
De-spoofing [4] 1.2 1.7 1.5
STDN [36] 0.8 1.3 1.1
CDCN [37] 0.4 1.7 1
Auxiliary [7] 1.6 1.6 1.6
DFA [8] 0.8 1.1 1
Our Multi-view + SMCL + TFF 0.5 0.2 0.4
II
STASN 4.2 0.3 2.2
De-spoofing [4] 4.2 4.4 4.3
STDN [36] 2.3 1.6 1.9
CDCN [37] 0.4 1.7 1.5
Auxiliary [7] 2.7 2.7 2.7
DFA [8] 3.8 2.1 2.9
Our Multi-view + SMCL + TFF 2.9 1.5 2.2
III
STASN 4.7 ± 3.9 0.9 ± 1.2 2.8 ± 1.6
De-spoofing [4] 4.0 ± 1.8 3.8 ± 1.2 3.6 ± 1.6
STDN [36] 1.6 ± 1.6 4.0 ± 5.4 2.8 ± 3.3
CDCN [37] 2.4 ± 1.3 3.1 ± 1.7 2.9 ± 1.5
Auxiliary [7] 2.7 ± 1.3 3.1 ± 1.7 2.9 ± 1.5
DFA [8] 1.9 ± 1.6 3.8 ± 6.4 2.8 ± 2.7
Our Multi-view + SMCL + TFF 1.7 ± 1.9 3.3 ± 3.5 2.5 ± 2.8
IV
STASN 6.7 ± 10.6 8.3 ± 8.4 7.5 ± 4.7
De-spoofing [4] 5.1 ± 6.3 6.1 ± 5.1 5.6 ± 5.7
STDN [36] 2.3 ± 3.6 4.2 ± 5.4 3.6 ± 4.2
CDCN [37] 4.6 ± 4.6 9.2 ± 8.0 6.9 ± 2.9
Auxiliary [7] 9.3 ± 5.6 10.4 ± 6.0 9.5 ± 6.0
DFA [8] 6.7 ± 7.5 3.3 ± 4.1 5.0 ± 2.2
Our Multi-view + SMCL + TFF 4.8 ± 4.0 1.9 ± 1.4 3.4 ± 2.7
Electronics 2024,13, 4865 11 of 15
4.5. Inter-Set Evaluation
Objective environments, such as attack types, contexts, and imaging device varia-
tions, differ across datasets. To evaluate model generalizability, inter-set experiments are
necessary. This study incorporates two well-known inter-set experiment protocols [
19
,
38
]
into face anti-spoofing. In Protocol A, the CASIA-FASD, Replay–Attack, MSU-MFSD, and
OULU-NPU datasets are chosen. The four datasets interact for training and testing in all
possible combinations, totaling 12 experiments. The datasets are abbreviated as C, I, M,
and O, respectively. The specific experiments are C
→
I, C
→
O, C
→
M, I
→
C, I
→
O, I
→
M,
O
→
C, O
→
I, O
→
M, M
→
C, M
→
O, and M
→
I. Half Total Error Rate (HTER) is the chosen
evaluation metric for these protocols.
Table 4presents the results of Protocol A. Among all results, the proposed model
consistently outperformed compared methods in eight of twelve experiments. On average,
our model achieved the best results compared with others. In particular, the average HTER
of our model is 4% lower than the Adversarial Domain Adaptation method. Due to the
lack of multi-scale frequency map features and depth map features as inputs in the ADA
approach, it demostrate the usefulness of combining multiple views. However, relatively
less favorable outcomes in M
→
C and M
→
I experiments also shed light on limitations of
our model on datasets with smaller sample sizes. As a result, Protocol B experiments were
conducted to address these issues. In Protocol B [
19
], three of four datasets were selected
for training, and the remaining dataset was selected for testing. These four experiments are
referred to as CIM
→
O, COM
→
I, IOM
→
C, and IOC
→
M. We utilize HTER and Area Under
the Curve (AUC) as evaluation metrics for these experiments.
Table 4. Inter-set evaluation results on Protocol A (HTER IN %).
Method C→I C→M C→O I→C I→M I→O M→C M→I M→O O→C O→I O→M Avg.
H&L Frequency [3] 24.3 34
Depth as Aux [7] 27.6 28.4
Generated Depth [8] 16.6 22.9
DR-UDA [39] 15.6 9 28.7 34.2 29 38.5 16.8 3.2 30.2 19.5 25.4 27.4 23.1
MDDR [40] 26.1 20.2 24.7 39.2 23.2 33.6 34.3 8.7 31.7 21.8 27.6 22 26.1
ADA [41] 17.5 9.3 29.1 41.5 30.5 39.6 17.7 5.1 31.2 19.8 26.8 31.5 25
Ours 9.9 13.37 24.08 20.4 17.5 32.4 30.1 10 21.95 28 25 18.9 21.0
The results in Table 5reveal the following findings. Our proposed model consistently
outperformed other methods in all four experiments. The outcomes of Protocol B validate
the superior generalizability of our model when trained on larger datasets. Moreover, these
results provide compelling evidence for the feasibility and effectiveness of our method,
involving using TFF and SMCL to learn multi-view features.
Table 5. Inter-set evaluation experiment on Protocol B.
Method
OCI→M OMI→C OCM→I ICM→O
HTER(%) AUC(%) HTER(%) AUC(%) HTER(%) AUC(%) HTER(%) AUC(%)
Binary CNN [42] 29.25 82.87 34.88 71.94 34.47 65.88 29.61 77.54
LBP TOP [43] 36.9 70.8 42.6 61.05 49.45 49.54 53.15 44.09
Depth + rppg [7] 22.72 85.88 33.52 73.15 29.14 71.69 30.17 77.61
MADDG [44] 17.69 88.06 24.5 84.51 22.19 84.99 27.98 80.02
DR-UDA [39] 16.1 22.2 22.7 24.7
DFA [8] 19.4 86.87 22.03 87.71 21.43 88.81 18.26 89.4
Ours 10.21 96.18 18.45 88.33 15.92 79.14 15.22 92.02
4.6. Visualization and Analysis
Two-dimensional t-SNE [
45
] is used for visualizing features across the four experi-
ments, providing a more comprehensive understanding of the experiment results. The
results, depicted in Figure 6, showcase distinct feature distributions under each protocol.
Figure 7presents the model inference stage under the IOC
→
M experimental protocol,
Electronics 2024,13, 4865 12 of 15
where the ground truth depth map and the model-generated depth map of random gen-
uine and spoofing samples are compared. The green frame represents genuine samples,
and the red frame represents spoofing samples.
0
0
0
0
0
0
ICM-
>
0
0
Figure 6. Two-dimensional t-SNE visualization of the features.
Genuine Samples Spoofing Samples
Figure 7. The ground truth depth maps and the generated depth maps for the random samples.
The feature distributions in Figure 6and the results in Table 5are directly correlated. In
Figure 6, the feature boundaries between positive and negative samples are well separated.
However, in the case of IOM
→
C, where the CASIA dataset was not included during
training, the distribution boundaries appear ambiguous. This suggests the proposed model
somewhat relies on dataset image quality. Although weights update in TFF improves
Electronics 2024,13, 4865 13 of 15
model reliance on depth map generalizability facing unknown attacks, it doesn’t fully
address dataset quality in FAS. Therefore, addressing dataset quality remains a significant
focus for future research.
5. Conclusions
In this paper, we present a multi-branch network designed to effectively fuse frequency
features and depth map features, enhancing the accuracy and generalizability of FAS. To
enhance feature compactness, Supervised Contrastive Learning is proposed for multi-view
feature learning. To address incompatible features from different inputs, a two-stage
feature fusion method is proposed. The models are assessed through intra- and inter-set
experiments on four public datasets. Comparative experiments show the proposed method
achieves relatively improved results. The experiment results validate the model’s accuracy
and generalizability across different scenarios. However, our model has some shortcomings.
The CASIA-FASD dataset contains high-quality face images. Therefore, in Table 5, the
results of the experiments containing the CASIA-FASD dataset are better than those of the
experiments without this dataset. For example, the HTER of OCI
→
M is almost 8% lower
than that of OMI
→
C. This suggests image quality impacting our classification results. To
address this, in the future, we will combine domain generalization with our model, aiming
to train more generalizable features.
Author Contributions: Conceptualization, J.L. Methodology, J.L. Software, J.L. and W.S. Valida-
tion, J.L. formal analysis, J.L. and W.S. data curation, J.L. writing—original draft preparation, J.L.
writing—review and editing, J.L. and W.S. visualization, J.L. supervision, W.S. All authors have read
and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: The codes that support the findings of this study are available in https:
//github.com/Gingerate/Multi-branch-Network-based-SMCL- and-TMFF-for-FAS, accessed on 14
February 2024. The datasets used in this study were obtained from the publicly accessible websites
as follows. CASIA-FASD: https://pypi.org/project/bob.db.CASIA-FASD/, accessed on 14 May
2023. MSU-MFSD: https://github.com/sunny3/MSU-MFSD, accessed on 15 May 2023. OULU-NPU:
https://doi.org/10.1109/FG.2017.77.https://sites.google.com/site/oulunpudatabase/, accessed on
15 May 2023. Replay-Attack: https://zenodo.org/records/4593128, accessed on 17 May 2023.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1.
Guo, J.; Zhu, X.; Zhao, C.; Cao, D.; Lei, Z.; Li, S.Z. Learning meta face recognition in unseen domains. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 19 June 2020; pp. 6163–6172.
2.
Yu, Z.; Qin, Y.; Li, X.; Zhao, C.; Lei, Z.; Zhao, G. Deep learning for face anti-spoofing: A survey. IEEE Trans. Pattern Anal. Mach.
Intell. 2022,45, 5609–5631. [CrossRef] [PubMed]
3.
Chen, B.; Yang, W.; Wang, S. Face anti-spoofing by fusing high and low frequency features for advanced generalization capability.
In Proceedings of the 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Shenzhen, China,
8 August 2020; pp. 199–204.
4.
Jourabloo, A.; Liu, Y.; Liu, X. Face de-spoofing: Anti-spoofing via noise modeling. In Proceedings of the European Conference on
Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 290–306.
5.
Song, X.; Zhao, X.; Fang, L.; Lin, T. Discriminative representation combinations for accurate face spoofing detection. Pattern
Recognit. 2019,85, 220–231. [CrossRef]
6.
Asim, M.; Ming, Z.; Javed, M.Y. CNN based spatio-temporal feature extraction for face anti-spoofing. In Proceedings of the IEEE
2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, China, 2–4 June 2017; pp. 234–238.
7.
Liu, Y.; Jourabloo, A.; Liu, X. Learning deep models for face anti-spoofing: Binary or auxiliary supervision. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 389–398.
8.
Wang, Y.; Song, X.; Xu, T.; Feng, Z.; Wu, X.J. From RGB to depth: Domain transfer network for face anti-spoofing. IEEE Trans. Inf.
Forensics Secur. 2021,16, 4280–4290. [CrossRef]
9.
Liu, Y.; Liu, X. Spoof trace disentanglement for generic face anti-spoofing. IEEE Trans. Pattern Anal. Mach. Intell. 2022,
45, 3813–3830. [CrossRef] [PubMed]
10.
Tian, Y.; Krishnan, D.; Isola, P. Contrastive multiview coding. In Proceedings of the Computer Vision–ECCV 2020: 16th European
Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XI 16; Springer: Cham, Switzerland, 2020; pp. 776–794.
Electronics 2024,13, 4865 14 of 15
11.
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning.
Adv. Neural Inf. Process. Syst. 2020,33, 18661–18673.
12.
Jia, Y.; Zhang, J.; Shan, S.; Chen, X. Single-side domain generalization for face anti-spoofing. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8484–8493.
13.
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3560–3569.
14.
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
15.
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141.
16.
Kim, J.H.; On, K.W.; Lim, W.; Kim, J.; Ha, J.W.; Zhang, B.T. Hadamard product for low-rank bilinear pooling. arXiv 2016,
arXiv:1610.04325.
17.
Hu, C.; Feng, Z.; Wu, X.; Kittler, J. Dual encoder-decoder based generative adversarial networks for disentangled facial
representation learning. IEEE Access 2020,8, 130159–130171. [CrossRef]
18.
Jun, F.; Zhiyi, D.; Yichen, S.; Jingjing, H. Domain Adaptation Based on ResADDA Model for Face Anti-Spoofing Detection. In
Proceedings of the IEEE 2021 International Conference on Computer Engineering and Artificial Intelligence (ICCEAI), Shanghai,
China, 27–29 August 2021; pp. 295–299.
19.
Wang, Z.; Wang, Z.; Yu, Z.; Deng, W.; Li, J.; Gao, T.; Wang, Z. Domain generalization via shuffled style assembly for face
anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA,
USA, 18–24 June 2022; pp. 4123–4133.
20.
Wang, Z.; Yu, Z.; Zhao, C.; Zhu, X.; Qin, Y.; Zhou, Q.; Zhou, F.; Lei, Z. Deep spatial gradient and temporal depth learning for face
anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA,
13–19 June 2020; pp. 5042–5051.
21.
Ganin, Y.; Lempitsky, V. Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the International Conference
on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 1180–1189.
22.
Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A survey on contrastive self-supervised learning. Technologies
2020,9, 2. [CrossRef]
23.
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations.
In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 1597–1607.
24.
Sun, Y.; Liu, Y.; Liu, X.; Li, Y.; Chu, W.S. Rethinking Domain Generalization for Face Anti-spoofing: Separability and Alignment.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June
2023; pp. 24563–24574.
25.
He, D.; He, X.; Yuan, R.; Li, Y.; Shen, C. Lightweight network-based multi-modal feature fusion for face anti-spoofing. Vis. Comput.
2023,39, 1423–1435. [CrossRef]
26.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
27.
Guo, J.; Zhu, X.; Yang, Y.; Yang, F.; Lei, Z.; Li, S.Z. Towards fast, accurate and stable 3d dense face alignment. In Proceedings of
the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020;
pp. 152–168.
28. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial
networks. Commun. ACM 2020,63, 139–144. [CrossRef]
29.
Zhang, Z.; Yan, J.; Liu, S.; Lei, Z.; Yi, D.; Li, S.Z. A face antispoofing database with diverse attacks. In Proceedings of the IEEE
2012 5th IAPR international conference on Biometrics (ICB), New Delhi, India, 29 March–1 April 2012; pp. 26–31.
30.
Chingovska, I.; Anjos, A.; Marcel, S. On the effectiveness of local binary patterns in face anti-spoofing. In Proceedings of the IEEE
2012 BIOSIG-Proceedings of the International Conference of Biometrics Special Interest Group (BIOSIG), Darmstadt, Germany,
6–7 September 2012; pp. 1–7.
31.
Wen, D.; Han, H.; Jain, A.K. Face spoof detection with image distortion analysis. IEEE Trans. Inf. Forensics Secur. 2015,10, 746–761.
[CrossRef]
32.
Boulkenafet, Z.; Komulainen, J.; Li, L.; Feng, X.; Hadid, A. OULU-NPU: A mobile face presentation attack database with
real-world variations. In Proceedings of the IEEE 2017 12th IEEE International Conference on Automatic Face & Gesture
Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 612–618.
33.
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks.
IEEE Signal Process. Lett. 2016,23, 1499–1503. [CrossRef]
34.
Boulkenafet, Z.; Komulainen, J.; Akhtar, Z.; Benlamoudi, A.; Samai, D.; Bekhouche, S.E.; Ouafi, A.; Dornaika, F.; Taleb-Ahmed, A.;
Qin, L.; et al. A competition on generalized software-based face presentation attack detection in mobile scenarios. In Proceedings
of the 2017 IEEE International Joint Conference on Biometrics (IJCB), Denver, CO, USA, 1–4 October 2017; pp. 688–696.
35.
Liu, J.; Zhang, F.; Zhou, Z.; Wang, J. Bfmnet: Bilateral feature fusion network with multi-scale context aggregation for real-time
semantic segmentation. Neurocomputing 2023,521, 27–40. [CrossRef]
Electronics 2024,13, 4865 15 of 15
36.
Liu, Y.; Stehouwer, J.; Liu, X. On disentangling spoof trace for generic face anti-spoofing. In Proceedings of the Computer Vision–
ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16; Springer: Berlin/Heidelberg,
Germany, 2020; pp. 406–422.
37.
Yu, Z.; Zhao, C.; Wang, Z.; Qin, Y.; Su, Z.; Li, X.; Zhou, F.; Zhao, G. Searching central difference convolutional networks for face
anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA,
13–19 June 2020; pp. 5295–5305.
38.
Li, H.; Li, W.; Cao, H.; Wang, S.; Huang, F.; Kot, A.C. Unsupervised domain adaptation for face anti-spoofing. IEEE Trans. Inf.
Forensics Secur. 2018,13, 1794–1809. [CrossRef]
39.
Wang, G.; Han, H.; Shan, S.; Chen, X. Unsupervised adversarial domain adaptation for cross-domain face presentation attack
detection. IEEE Trans. Inf. Forensics Secur. 2020,16, 56–69. [CrossRef]
40.
Wang, G.; Han, H.; Shan, S.; Chen, X. Cross-domain face presentation attack detection via multi-domain disentangled represen-
tation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA,
13–19 June 2020; pp. 6678–6687.
41.
Wang, G.; Han, H.; Shan, S.; Chen, X. Improving cross-database face presentation attack detection via adversarial domain
adaptation. In Proceedings of the IEEE 2019 International Conference on Biometrics (ICB), Crete, Greece, 4–7 June 2019; pp. 1–8.
42. Yang, J.; Lei, Z.; Li, S.Z. Learn convolutional neural network for face anti-spoofing. arXiv 2014, arXiv:1408.5601.
43.
de Freitas Pereira, T.; Komulainen, J.; Anjos, A.; De Martino, J.M.; Hadid, A.; Pietikäinen, M.; Marcel, S. Face liveness detection
using dynamic texture. EURASIP J. Image Video Process. 2014,2014, 2. [CrossRef]
44.
Shao, R.; Lan, X.; Li, J.; Yuen, P.C. Multi-adversarial discriminative deep domain generalization for face presentation attack
detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
15–20 June 2019; pp. 10023–10031.
45.
Cieslak, M.C.; Castelfranco, A.M.; Roncalli, V.; Lenz, P.H.; Hartline, D.K. t-Distributed Stochastic Neighbor Embedding (t-SNE):
A tool for eco-physiological transcriptomic analysis. Mar. Genom. 2020,51, 100723. [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.