Content uploaded by Guoqing Wang
Author content
All content in this area was uploaded by Guoqing Wang on Apr 09, 2020
Content may be subject to copyright.
Cross-domain Face Presentation Attack Detection via Multi-domain
Disentangled Representation Learning
Guoqing Wang1,2, Hu Han1,3,∗
, Shiguang Shan1,2,3, Xilin Chen1,2
1Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS),
Institute of Computing Technology, CAS, Beijing 100190, China,
2University of Chinese Academy of Sciences, Beijing 100049, China
3Peng Cheng Laboratory, Shenzhen, China
guoqing.wang@vipl.ict.ac.cn, {hanhu, sgshan, xlchen}@ict.ac.cn
Abstract
Face presentation attack detection (PAD) has been an
urgent problem to be solved in the face recognition sys-
tems. Conventional approaches usually assume the testing
and training are within the same domain; as a result, they
may not generalize well into unseen scenarios because the
representations learned for PAD may overfit to the subjects
in the training set. In light of this, we propose an efficient
disentangled representation learning for cross-domain face
PAD. Our approach consists of disentangled representation
learning (DR-Net) and multi-domain learning (MD-Net).
DR-Net learns a pair of encoders via generative models
that can disentangle PAD informative features from sub-
ject discriminative features. The disentangled features from
different domains are fed to MD-Net which learns domain-
independent features for the final cross-domain face PAD
task. Extensive experiments on several public datasets val-
idate the effectiveness of the proposed approach for cross-
domain PAD.
1. Introduction
Face recognition (FR) is being widely used in a variety of
applications such as smartphone unlock, access control and
pay-with-face. The easiness of obtaining a genuine user’s
face image brings threat to these FR systems, which can be
used to presentation attacks (PAs), e.g., photo, video replay,
or 3D facial mask. In addition, advances in deep learn-
ing have significantly promoted the development of face
synthesis technologies like deep facial manipulation attacks
[42], feasible using Variational AutoEncoders (VAEs) [22],
or Generative Adversarial Network (GAN) [14].
Many approaches have been proposed to tackle various
∗Corresponding author.
Figure 1. 2D visualization of the features learned by ResNet-18
[18] for live vs. spoof classification on CASIA [52]. We observe
that the testing face features may not gather together as expected
in terms of live and spoof; instead they tend to gather together
in terms of individual subjects. This suggests that the features
learned for PAD are not disentangled from the features for sub-
ject classification very well. We believe this is one of the reasons
that leads to poor generalization ability of PAD models under new
application scenarios.
face PAs, which assume that there are inherent disparities
between live and spoof faces, such as texture in color space
[4, 6], image distortion [48], temporal variation [39] or deep
semantic features [50, 36]. Although these methods show
promising performance under intra-database testing scenar-
ios, their performance dramatically degrades when used
in new application scenario (cross-domain PAD). To im-
prove the robustness of PAD under unseen scenarios, some
scenario-invariant auxiliary information (such as face depth
and heart rhythm) has been exploited for distinguishing be-
tween live and spoof faces [2, 30, 29]. The robustness of
1
arXiv:2004.01959v1 [cs.CV] 4 Apr 2020
these methods relies on the accuracy of the estimated auxil-
iary information to some extent. With the success of trans-
fer learning in many object recognition tasks [10], recent
studies for PAD seek to utilize transfer learning to achieve
end-to-end cross-domain face PAD model or robust features
followed by binary classification models [27, 46, 38]. How-
ever, the generalization ability of these methods is still lim-
ited when the distributions of data in training and testing
domains have big disparities, particularly when the spoof at-
tack types in the testing domain do not appear in the training
domain. One important reason is that these methods may
not disentangle the domain-independent PAD cues from the
domain-dependent cues very well. As shown in Fig. 1, the
learned features for face PAD may overfit to subject classi-
fication, and thus the live face images (or the spoof face im-
ages) of different subjects may not be close to each other in
the feature space. In this case, we cannot expect the model
learned from a source domain can generalize well to unseen
domains. While it could be helpful to improve the cross-
domain robustness by collecting a huge face PAD dataset
containing various subjects and spoof attack types like the
MS-Celeb-1M dataset [16] for FR, it can be very expensive
and time-consuming in practice.
In this paper, we hope on improving PAD generaliza-
tion ability without collecting a huge face PAD dataset, and
propose a disentangled representation learning approach for
cross-domain PAD. In particular, the proposed approach
consists of a disentangled representation learning module
(DR-Net) and a multi-domain learning module (MD-Net).
DR-Net leverages generative models to learn a pair of disen-
tangled feature learning encoders for learning subject clas-
sification (denoted as ID-GAN), and PAD (denoted as PAD-
GAN) features respectively in each source domain. The
discriminators of PAD-GAN are discriminative not only for
distinguishing between the generated and real face images,
but also distinguishing between live and spoof face images.
Similarly, the discriminators of ID-GAN is discriminative
for distinguishing between individual subjects. Thus, with
DR-Net, we can expect to obtain features that are either re-
lated to PAD or subject classification. On the other hand, we
also want to utilize the knowledge about different PA types
to enhance the generalization capability of the model. To
achieve this goal, we propose a multi-domain feature learn-
ing module (MD-Net) to learn domain-independent features
from the disentangled features by DR-Net.
The main contributions of this work are as follows: (i)
we propose a novel disentangled representation learning for
cross-domain PAD, which is able to disentangle features in-
formative for PAD from the features informative for subject
classification, allowing more subject-independent PAD fea-
ture learning; (ii) we propose an effective multi-domain fea-
ture learning to obtain domain-independent PAD features,
which enhances the robustness the PAD model; and (iii) our
approach achieves better performance than the state-of-the-
art face PAD methods in cross-domain PAD.
2. Related Work
Face presentation attack detection. Recent face PAD
approaches can be generally grouped into conventional ap-
proaches, deep learning approaches and domain generalized
approaches. Conventional approaches aim to detect attacks
based on texture or temporal cues. An early work on face
PAD by Li et al. [28] used Fourier spectra analysis to cap-
ture the difference between the live and spoof face images.
After that, various hand-crafted features such as LBP [32],
LPQ [6], SURF [6], HoG [24], SIFT [37] and IDA [48],
have been utilized with traditional classifiers, e.g., SVM
[43] and LDA [13], for live vs. spoof face classification.
To reduce the influence by illumination variations, some
approaches convert face images from RGB color space to
other space such as HSV and YCbCr [4, 6], and then extract
the above features for PAD. Besides, a number of methods
also explored temporal cues of the whole face or individual
face components for PAD, such as eye blink [41], mouth
movement [23] and head rotation [3]. The conventional ap-
proaches are usually computationally efficient and explain-
able, and work well under intra-database testing scenarios.
However, their generalization ability into new application
scenarios is still not satisfying [36, 37].
To overcome the over-fitting issue, researchers are at-
tempting to utilize deep learning models for face PAD in a
general-to-specific model transfer scheme [50, 36] because
of the great success of deep learning in many other com-
puter vision tasks. In [50], CNNs were fine-tuned with
live and spoof face images for face PAD. Liu et al. [30]
introduced depth map and rPPG signal estimation from
RGB face videos as auxiliary supervision to assist in PAD.
Jourabloo et al. [21] inversely decomposed a spoof face
into a live face and a noise of spoof, and then utilized
the spoof noise for PAD. Zhang et al. [51] introduced a
large-scale multi-modal face anti-spoofing dataset namely
CASIA-SURF. Based on CASIA-SURF, several end-to-end
approaches [35, 40, 47] were proposed to exploit the com-
plementary information from RGB, depth and IR, and all
reported the promising results. While deep learning ap-
proaches show strong feature learning ability, they still do
not have pleasing poor generalization ability into new sce-
narios.
Most recently, deep learning based domain generaliza-
tion approaches for cross-domain face PAD have been pro-
posed. Li et al. [27] proposed an unsupervised domain
adaptation PAD framework to transform the source domain
feature space to the unlabeled target domain feature space
by minimizing MMD [15]. Wang et al. [46] proposed
an end-to-end learning approach to improve cross-domain
PAD generalization capability by utilizing prior knowl-
2
Figure 2. The overview of our approach for cross-domain PAD. Our approach consists of a disentangled representation learning module
(DR-Net) and a multi-domain feature learning module (MD-Net). With the face images from different domains as inputs, DR-Net can learn
a pair of encoders for disentangled features for PAD and subject classification respectively. The disentangled features are fed to MD-Net
to learn domain-independent representations for robust cross-domain PAD.
edge from source domain via adversarial domain adapta-
tion. Shao et. al [38] proposed to learn a generalized
feature space via a novel multi-adversarial discriminative
deep domain generalization framework under a dual-force
triplet-mining constraint. Liu et. al [31] defined the de-
tection of unknown spoof attacks as ZSFA and proposed
a novel deep tree network to partition the spoof samples
into semantic sub-groups in an unsupervised fashion. While
the prior domain generalization approaches can improve
the model’s generalization capability, the representations
learned for PAD may still get over-fitting to specific sub-
jects, shooting environment, camera sensors, etc. In light
of this, we propose a disentangled representation learning
module and multi-domain feature learning module to obtain
more discriminative PAD features.
Disentangled representation learning. Disentangled
representation learning aims to design the appropriate ob-
jectives to learn disentangled representations. Recently,
more and more methods utilize GAN and VAE to learn dis-
entangled representations because of their success in gen-
erated tasks. Yan et. al [49] studied a novel problem of
attribute-conditioned image generation and proposed a so-
lution with CVAEs. Chen et.al [8] proposed InfoGAN,
an information-theoretic extension to GAN that is able to
learn disentangled representations in an unsupervised man-
ner. Tran et. al [44] proposed DR-GAN for pose-invariant
face recognition and face synthesis. Hu et. al [20] pro-
posed an approach to learn image representations that con-
sist of disentangled factors of variation without exploiting
any manual labeling or data domain knowledge. The most
related work to the problem we study in this paper is Info-
GAN. However, unlike InfoGAN, which maximizing Mu-
tual Information between generated images and input codes,
we provide explicit disentanglement and control under the
PAD and ID constraints.
3. Proposed Method
We aim to build a robust cross-domain face PAD model
that can mitigate the impact of the subjects, shooting envi-
ronment and camera settings from the source domain face
images. To achieve this goal, we propose an efficient disen-
tangled representation learning for cross-domain face PAD.
As shown in Fig. 2, our approach consists of a disentan-
gled representation learning module (DR-Net) and a multi-
domain feature learning module (MD-Net). DR-Net lever-
ages generative models (i.e., PAD-GAN and ID-GAN in
Fig. 2) to learn a pair of encoders for learning disentan-
gled features for subject classification and PAD respectively
in each source domain. Then, the disentangled features
from multiple domains are fed to MD-Net to learn domain-
independent features for the final cross-domain face PAD.
3.1. Disentangled Representation Learning via
DR-Net
As shown in Fig. 1, it is likely to include the subject’s
identity information in the features learned by CNNs for
simple live vs. spoof classification. The main reason is that
features related to different tasks can be easily coupled with
each other when there are not auxiliary labels for supervis-
ing the network learning. Recently, generative models have
drawn increasing attentions for obtaining disentangled rep-
resentation learning when there are no auxiliary labels to
use [8, 12]. The idea is to resemble the real distributions
of individual classes by learning a generative model based
on the limited data per class. This way, the per class dis-
tributions can be modeled in a continuous way instead of
defined by a few data points available in the training set. In
light of this, we propose a generative DR-Net consisting of
PAD-GAN and ID-GAN to learn a pair of disentangled rep-
resentation encoders for learning subject classification and
3
Figure 3. The diagrams of (a) PAD-GAN and (b) ID-GAN in DR-
Net, which can learn disentangled features for PAD and subject
classification, respectively.
PAD features respectively in each source domain. As shown
in Fig. 3, PAD-GAN and ID-GAN have the same structure,
which is a variant of the conventional GAN architecture.
So, we only use PAD-GAN as an example to provide the
details. ID-GAN is the same as PAD-GAN except for the
number of categories in the classification layer. The classi-
fication in PAD-GAN handles two classes (live and spoof),
while the classification in ID-GAN handles as many classes
as the number of subjects in the dataset. Let (X, Y )denote
the face images and its corresponding labels (live vs. spoof)
,zc∈Pnoise denote the input noise vector of PAD-GAN.
The generator GP AD of PAD-GAN uses zcto generate face
images Xfake =GP AD (zc). The discriminator DP AD not
only needs to predict whether its input face image is real
or generated, but also needs to distinguish between live and
spoof face images. The loss function of DP AD consists of
two parts, i.e., GAN Loss
LGAN (GP AD, DP AD ) = E(x,y)∼(X,Y )[log DP AD (x)]
+Ezc∼Pnoise [log(1 −DP AD (GP AD(zc)))],
(1)
and classification loss
LC(GP AD, DP AD ) = Ezc∼Pnoise ,(x,y)∼(X,Y )
[log(DP AD (GP AD (zc)))] + [log(DP AD (x))] (2)
The overall objective of PAD-GAN can be written as
min
GP AD
max
DP AD
LP AD =LGAN +λLC.(3)
We empirically set λ= 1 to balance the scales of the GAN
loss and classification loss in our experiments.
With DR-Net, we are able to approximate the real distri-
bution of live vs. spoof face images and the distribution of
individual subjects’ face images via the encoders in PAD-
GAN and ID-GAN, respectively. As a result, the encoders
of PAD-GAN and ID-GAN learned in an adversarial way
are expected to obtain features that are purely related to
PAD and subject classification, respectively. In other words,
each encoder can obtain disentangled representation for a
specific task. The traditional binary classification methods
only utilize some discrete samples to find the boundary in
high dimensional feature space. In this way, we may not
obtain good disentangled representations. However, we can
utilize the generative model to represent the samples from
different categories with a continuous distribution. Our ex-
periments in Section 4.4 validate these assumptions.
3.2. Multi-domain Learning via MD-Net
Given DR-Net, we can obtain a pair of encoders for PAD
and subject classification within each domain. Without loss
of generality, we assume there are two source domains,
and a simple approach in leveraging the knowledge of the
two domains to perform cross-domain PAD is to concate-
nate the features extracted by the PAD encoder of each do-
main. However, such an approach may not make full use of
the disentangled ID features extracted by the ID encoders,
which are also helpful for obtaining domain-independent
representations for cross-domain PAD. In light of this, we
propose a multi-domain feature learning (MD-Net) to fully
exploit the disentangled ID and PAD feature representations
from different domains. Fig. 4 shows the training and in-
ference processes of our MD-Net when learning from two
source domains and testing on an unseen domain.
Let (XA, YA)and (XB, YB)denote the face images and
the corresponding labels from two domains (A and B), re-
spectively. EA
ID and EA
P AD denote the ID and PAD en-
coders learned from (XA, YA). Similarly, EB
ID and EB
P AD
denote the ID and PAD encoders learned from (XB, YB).
Then, we can get the PAD features
PB
A=E(x,y)∼(XA,YA)[EB
P AD(x)],
PA
B=E(x,y)∼(XB,YB)[EA
P AD(x)],(4)
where PB
Adenotes the PAD features extracted by EB
P AD
for the face images in domain A. Similarly, PA
Bdenotes the
PAD features extracted by EA
P AD for the face images in do-
main B. Similarly, we can get the disentangled ID feature
representations extracted by EA
ID and EB
ID ,
IA
A=E(x,y)∼(XA,YA)[EA
ID (x)],
IB
B=E(x,y)∼(XB,YB)[EB
ID (x)].(5)
We also use the ID features in our MD-Net to enhance the
generalization ability of the features for cross-domain PAD.
Specifically, as shown in Fig. 4, we leverage the informa-
tion included in PB
A,PA
B,IA
A, and IB
Bin a cross concatena-
4
Figure 4. The overview of MD-Net for domain-independent feature learning. During network training, the two encoders (EA
ID ,EA
P AD )
from domain A and two encoders (EB
ID ,EB
P AD ) from domain B are used to extract disentangled features, i.e., IA
A,PA
B,IB
B,PB
A. These
disentangled features are further enhanced in a cross-verified manner to build concatenated features UAand UB, which are used for both
subject classification and PAD. The finally learned encoders (EA
P AD ,EB
P AD ) are used to obtain two PAD features per face image, which
are concatenated and used to learn the final live vs. spoof classification model to be used during network inference. Dashed lines indicate
fixed network parameters.
tion manner, and obtain the following features, i.e.,
UA=PB
A⊕IA
A,
UB=PA
B⊕IB
B.(6)
Under the assumptions that the encoders EP AD and EI D
can ideally extract PAD and ID features, we expect that the
multi-domain feature UAshould retain the identity informa-
tion from the face image in domain A, and retain the live or
spoof information from the face image in domain B. Simi-
larly, the multi-domain feature UBshould retain the identity
information from the face image in domain B, and retain
the live or spoof information from the face image in domain
A. These properties can be used to enhance the learning of
EP AD. For the property of retaining the live or spoof infor-
mation, we introduce a live vs. spoof classification branch
using classifiers FAand FBwith cross-entropy loss
LCE (EB
P AD, FA) = E(x,y )∼(XA,YA)
[−ylog(FA(UA)) −(1 −y) log(1 −log(FA(UA)))],
LCE (EA
P AD, FB) = E(x,y )∼(XB,YB)
[−ylog(FB(UB)) −(1 −y) log(1 −log(FB(UB))))].
(7)
For the property of retaining identity information, we intro-
duce an image reconstruction branch using decoders DA
REC
and DB
REC with L1loss
L1(EB
P AD, D A
REC ) = ExA∼XA[
xA−DA
REC (UA)
1],
L1(EA
P AD, D B
REC ) = ExB∼XB[
xB−DB
REC (UB)
1].
(8)
After the cross-verified learning of MD-Net, the encoders
EP AD from DR-Net can be expected to have the ability to
generate more general features that are preferred for cross-
domain PAD. All the above optimizations in MD-Net are
performed in a commonly used multi-task learning manner
[17].
Finally, for all the face images from domains A and B,
we use the learned PAD encoders EA
P AD and EB
P AD to ex-
tract two PAD features per face image, which are then con-
catenated into one feature and used for learning a live vs.
spoof classifier FMD. Similarly, during inference, given a
face image from an unseen domain XU, we use EA
P AD and
EB
P AD to extract two PAD features, i.e., PA
Uand PB
U, and
use their concatenation to perform live vs. spoof classifica-
tion using FMD .
4. Experimental Results
4.1. Databases and Protocols
We provide evaluations on four widely used databases
for cross-domain PAD including Idiap REPLAY-ATTACK
[9] (Idiap), CASIA Face AntiSpoofing [52] (CASIA),
MSU-MFSD [48] (MSU) and OULU-NPU [7] (OULU).
Table 1 provides a summary of the four datasets in terms
of the PA types, display devices, and the number of subjects
in the training and testing sets.
We regard one dataset as one domain in our experiment.
Given the four datasets (Idiap, CASIA, MSU, OULU), we
can define three types of protocols to validate the effective-
ness of our approach. In protocol I, we train the model
5
Dataset PA
types
No. of Subj. Display
devicestrain test
CASIA [52]
(abbr.: C)
Printed photo
Cut photo
Replayed video
20 30 iPad
Idiap [9]
(abbr.: I)
Printed photo
Display photo
Replayed video
30 20 iPhone 3GS
ipad
MSU [48]
(abbr.:M)
Printed photo
Replayed video 18 17 iPad Air
iphone 5S
OULU [7]
(abbr.:O)
Printed photo
Display photo
Replayed video
35 20 Dell 1905FP
Macbook Retina
Table 1. A summary of the PAD databases used in our experiments.
using all the images from three datasets, and test the model
on the fourth dataset (denoted as [A, B, C]→D). Then, we
have four tests in total for protocol I: [O, C, I]→M, [O, M,
I]→C, [O, C, M]→I, [I, C, M]→O. In protocol II, we train
the model using all the images from two datasets, and test
the model on a different dataset (denoted as [A, B]→C).
So, there are 12 tests in total for protocol II, e.g., [M, I]→C,
[M, I]→O, [M, C]→I and [M, C]→O, etc. In protocol III,
we train the model using only the images from one dataset,
and test the model on a different dataset (denoted as A→B).
Then, there are also 12 tests in total for protocol III, e.g.,
I→C, I→M, C→I and C→M, etc.
4.2. Baselines and Evaluation Metric
We compare our approach with several state-of-the-
art methods that can work for cross-domain PAD, e.g.,
Auxiliary-supervision (Aux) [30] using CNN and RNN [19]
to jointly estimate the face depth and rPPG signal from face
video, MMD-AAE [26] learning a feature representation
by jointly optimization a multi-domain autoencoder regu-
larized by the MMD [15] distance, a discriminator and a
classifier in an adversarial training manner, and MADDG
[38] learning a generalized feature space by training one
feature generator to compete with multiple domain discrim-
inators simultaneously.
We also use a number of conventional PAD methods as
baselines, such as Multi-Scale LBP (MS LBP) [32], Binary
CNN (CNN)[50], Image Distortion Analysis (IDA) [48],
Color Texture (CT) [5] and LBPTOP [11].
We follow the state-of-the-art methods for cross-domain
face PAD [27, 25, 46] and report Half Total Error Rate
(HTER) [1] (the average of false acceptance rate and false
rejection rate) for cross-domain testing. We also report
Area Under Curve (AUC) in order to compare with base-
line methods such as [38].
4.3. Implementation Details
Network Structure. The PAD-GAN and ID-GAN in
DR-Net share the same structure. To be specific, the gener-
ators have one FC layer and seven transposed convolution
layers to generate fake images of 256 ×256 ×3from the
512-D noise vectors. There are four residual blocks in the
discriminators and each block has four convolution layers,
which have the same settings as the convolution part of the
ResNet-18 [18]. The discriminators learned from DR-Net
are then used as the disentangled feature encoders to extract
the features used by MD-Net. The decoders in MD-Net also
have seven transposed convolution layers to generate the re-
constructed images of 256×256 ×3from the 2048-D multi-
domain features. We use a kernel size of 3 for all transposed
convolution layers and use ReLU [33] for activation.
Training Details. We use an open source SeetaFace 1
algorithm to do face detection and landmark localization.
All the detected faces are then normalized to 256 ×256
based on five facial keypoints (two eye centers, nose, and
two mouth corners). We also use an open source imgaug2
library to perform data augmentation, i.e., random flipping,
rotation, resizing, cropping and color distortion. The pro-
posed approach is trained end-to-end and divided into two
stages to make the training process stable. At stage-1, we
use the face images from different domains to train DR-
Net with the joint of GAN loss and classification loss. We
choose Adam with the fixed learning rate of 2e−3, and a
batch size of 128 to train the generators and discriminators.
At stage-2, we use the discriminators learned from DR-Net
as the disentangled feature encoders to learn MD-Net. We
also choose Adam as the optimizers of the encoders and de-
coders with initial learning rates of 1e−5and 1e−4. We train
300 and 100 epochs for the two stages, respectively.
4.4. Results
Protocol I. The cross-domain PAD performance by the
proposed approach and the state-of-the-art methods under
protocol I are shown in Table 2. We can observe that con-
ventional PAD methods usually perform worse than the do-
main generalized methods like Auxiliary-supervision [30],
MMD-AAE [26], MADDG [38]. The possible reason why
conventional PAD methods do not work well is that the dis-
tributions of data in training and testing domains have big
disparities, particularly when the spoof attack types, illumi-
nation or display devices in the testing domain do not appear
in the training domain. Auxiliary supervision like using
rPPG signal [34] in a multi-task way [45] was introduced
in [30], and reported better performance than simply using
CNN for binary classification [50]. This is understandable if
we look at the feature visualization in Fig. 1. MADDG [38]
and our approach perform better than Auxiliary-supervision
[30] which suggests that deep domain generalization is use-
ful for cross-domain PAD. However, our approach performs
better than MADDG [38] and MMD-AAE [26] because of
1https://github.com/seetaface/SeetaFaceEngine
2https://github.com/aleju/imgaug
6
Method [O, C, I]→M [O, M, I]→C [O, C, M]→I [I, C, M]→O
HTER(%) AUC(%) HTER(%) AUC(%) HTER(%) AUC(%) HTER(%) AUC(%)
MS LBP [32] 29.76 78.50 54.28 44.98 50.30 51.64 50.29 49.31
CNN [50] 29.25 82.87 34.88 71.95 34.47 65.88 29.61 77.54
IDA [48] 66.67 27.86 55.17 39.05 28.35 78.25 54.20 44.59
CT [5] 28.09 78.47 30.58 76.89 40.40 62.78 63.59 32.71
LBPTOP [11] 36.90 70.80 42.60 61.05 49.45 49.54 53.15 44.09
Aux(Depth only) [30] 22.72 85.88 33.52 73.15 29.14 71.69 30.17 66.61
Aux(All) [30] - - 28.4 - 27.6 - - -
MMD-AAE [26] 27.08 83.19 44.59 58.29 31.58 75.18 40.98 63.08
MADDG [38] 17.69 88.06 24.5 84.51 22.19 84.99 27.98 80.02
Proposed approach 17.02 90.10 19.68 87.43 20.87 86.72 25.02 81.47
Table 2. Cross-domain PAD performance of the proposed approach and the baseline methods under protocol I.
Method [M, I]→C [M, I]→O
HTER(%) AUC(%) HTER(%) AUC(%)
MS LBP [32] 51.16 52.09 43.63 58.07
IDA [48] 45.16 58.80 54.52 42.17
CT [5] 55.17 46.89 53.31 45.16
LBPTOP [11] 45.27 54.88 47.26 50.21
MADDG [38] 41.02 64.33 39.35 65.10
Proposed approach 31.67 75.23 34.02 72.65
Table 3. Cross-domain PAD performance of the proposed ap-
proach and the baseline methods under protocol II.
multi-domain disentangled representation learning.
Protocol II. The part of cross-domain PAD performance
of the proposed approach and baseline methods under pro-
tocol II are shown in Table 3. We notice that our approach
also shows the promising results compared to the conven-
tional approaches and MADDG [38]. This suggests that
our approach is more able to exploit domain-independent
features for cross-domain PAD than MADDG [38]. The
possible reason is that our approach utilizes the genera-
tive model to disentangle the identity information from the
face images. This way may be better than the dual-force
triplet-mining constraint proposed in [38] to eliminate iden-
tity information. By comparing the cases when CASIA and
OULU are the testing sets in Table 2 and Table 3, we get
the conclusion that more source domains are available, and
thus the advantage of domain generalization can be better
exploited by our approach.
Protocol III. The cross-domain PAD performance by the
proposed approach and baseline domain adaptation meth-
ods under protocol III are shown in Table 4. We can ob-
serve that PAD-GAN has a comparable performance with
the domain adaptation methods and a better performance
than simply using CNN for binary classification [50]. This
further demonstrates that disentangled representation learn-
ing is more potential than pure representation learning when
the images are coupled with diverse information. However,
the domain adaptation methods perform better than PAD-
GAN in some cases, especially using CASIA as a training
or testing set. The possible reason is that CASIA has cut
photos which are not appeared in other datasets. The do-
main adaptation methods can alleviate this issue by learn-
ing a common feature space shared by both the source and
target domains.
Visualization and Analysis. We also visualize the gen-
erated fake face images by PAD-GAN and ID-GAN in DR-
Net. As shown in Fig. 5, we notice that it is hard to tell the
identity of the fake face images generated by PAD-GAN
(Fig. 3 (a)), and we can only see a rough face shape. By
contrast, the fake face images generated by ID-GAN can
better keep the identity information than PAD-GAN. This
demonstrates the usefulness of DR-Net, which can disen-
tangle the ID and PAD features well from the face im-
ages. This again validates that PAD-GAN and ID-GAN
in our DR-Net are effective in learning disentangled fea-
tures for live vs. spoof classification and subject classifi-
cation, respectively. Such a characteristic is very important
for improving cross-domain PAD robustness. Fig. 6 shows
some examples of incorrect PAD results by the proposed
approach on four datasets. We notice that most errors are
caused by the challenging appearance variances due to over-
saturated or under-exposure illumination, color distortions,
or image blurriness that diminishes the differences between
live and spoof face images.
4.5. Ablation Study
Effectiveness of DR-Net and MD-Net. We study the
influences of DR-Net and MD-Net by gradually dropping
them from the proposed approach, and denote the corre-
sponding models as ‘Proposed w/o DR’, ‘Proposed w/o
MD’ and ‘Proposed w/o DR & MD’. The results for cross-
domain PAD are given in Table 5. We can consider CNN
in Table 2 is the method without disentangling identity in-
formation, while w/o MD and full method in Table 5 are
methods with partial and complete identity information dis-
entanglement, respectively. We can see identity information
disentanglement is useful for improving generalization abil-
ity. Furthermore, we can see dropping either component can
lead to performance drop. This suggests that both compo-
nents are useful in the proposed face PAD approach.
Effectiveness of LREC and LCE in MD-Net. We also
7
Method C →I C →M I →C I →M M →C M →I O →I O →M O →C I →O M →O C →O
CNN [50] 45.8 25.6 44.4 48.6 50.1 49.9 47.4 30.2 41.2 45.4 31.4 36.4
SA§[27] 39.2 14.3 26.3 33.2 10.1 33.3 - - - - - -
KSA§[27] 39.3 15.1 12.3 33.3 9.1 34.9 - - - - - -
ADA [46] 17.5 9.3 41.6 30.5 17.7 5.1 26.8 31.5 19.8 39.6 31.2 29.1
Proposed (PAD-GAN) 26.1 20.2 39.2 23.2 34.3 8.7 27.6 22.0 21.8 33.6 31.7 24.7
Table 4. HTER in (%) of cross-domain PAD by the proposed approach and baseline domain adaptation approaches under protocol III.
Method [O, C, I]→M [O, M, I]→C [O, C, M]→I [I, C, M]→O
HTER(%) AUC(%) HTER(%) AUC(%) HTER(%) AUC(%) HTER(%) AUC(%)
Proposed w/o DR&MD 30.25 78.43 33.97 72.32 35.21 65.88 28.74 74.27
Proposed w/o DR 26.43 84.52 31.43 77.62 29.56 72.38 27.23 74.67
Proposed w/o MD 18.07 87.26 24.85 85.34 22.44 85.96 26.78 76.82
Proposed w/o LCE 18.07 87.84 24.27 85.64 22.21 86.23 26.77 76.23
Proposed w/o LREC 18.14 88.41 21.03 86.30 21.47 86.26 26.21 78.45
Proposed (full method) 17.02 90.10 19.68 87.43 20.87 86.72 25.02 81.47
Table 5. Ablation study of the proposed approach in terms of DR-Net, MD-Net, LCE and LREC losses.
Figure 5. Examples of generated results by the proposed DR-Net
on (a) Idiap, (b) CASIA, (c) MSU and (d) OULU. The first two
columns list the results generated by PAD-GAN and ID-GAN. The
third column lists the results of the ground-truth.
conduct the experiments by gradually removing LCE and
LREC in MD-Net to validate their usefulness. We denote
the corresponding model as ‘Proposed w/o LCE ’ and ‘Pro-
posed w/o LREC ’. The results for cross-domain PAD are
given in Table 5. We can see that removing either loss can
lead to performance drop. However, LCE has a greater in-
fluence to the cross-domain PAD performance than LR EC .
5. Conclusions
Cross-domain face presentation attack detection remains
a challenging problem due to diverse presentation attack
types and environmental factors. We propose an effective
Figure 6. Examples of incorrect PAD results by the proposed ap-
proach on Idiap, CASIA, MSU and OULU. The label “S, G” (or
“G, S”) denotes a spoof (or live) face image is incorrectly classi-
fied as live (or spoof) face image.
disentangled representation learning for cross-domain pre-
sentation attack detection, which consists of disentangled
representation learning (DR-Net) and multi-domain feature
learning (MD-Net). DR-Net leverages generative model to
obtain better modeling of the live and spoof class distribu-
tions, allowing effective learning of disentangled features
for PAD. MD-Net further learns domain-independent fea-
ture representation from the disentangled features for the
final cross-domain face PAD task. The proposed approach
outperforms the state-of-the-art methods on several public
datasets.
6. Acknowledgement
This research was supported in part by the National Key
R&D Program of China (grant 2018AAA0102501), Nat-
ural Science Foundation of China (grants 61672496 and
61702481), and Youth Innovation Promotion Association
CAS (grant 2018135).
8
References
[1] A. Anjos and S. Marcel. Counter-measures to photo attacks
in face recognition: a public database and a baseline. In Proc.
IJCB, pages 1–7, 2011.
[2] Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu. Face anti-
spoofing using patch and depth-based CNNs. In Proc. IJCB,
pages 319–328, 2017.
[3] W. Bao, H. Li, N. Li, and W. Jiang. A liveness detection
method for face recognition based on optical flow field. In
Proc. IASP, pages 233–236, 2009.
[4] Z. Boulkenafet, J. Komulainen, and A. Hadid. Face anti-
spoofing based on color texture analysis. In Proc. ICIP,
pages 2636–2640, 2015.
[5] Z. Boulkenafet, J. Komulainen, and A. Hadid. Face spoof-
ing detection using colour texture analysis. IEEE Trans. Inf.
Forensics Security, 11(8):1818–1830, 2016.
[6] Z. Boulkenafet, J. Komulainen, and A. Hadid. Face anti-
spoofing using speeded-up robust features and fisher vector
encoding. IEEE Signal Proc. Let., 24(2):141–145, 2017.
[7] Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, and A. Hadid.
OULU-NPU: A mobile face presentation attack database
with real-world variations. In Proc. FG, pages 612–618,
2017.
[8] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever,
and P. Abbeel. Infogan: Interpretable representation learning
by information maximizing generative adversarial nets. In
Proc. NIPS, pages 2172–2180, 2016.
[9] I. Chingovska, A. Anjos, and S. Marcel. On the effective-
ness of local binary patterns in face anti-spoofing. In Proc.
BIOSIG, 2012.
[10] G. Csurka. Domain adaptation for visual applications: A
comprehensive survey. arXiv preprint, arXiv:1702.05374,
2017.
[11] T. de Freitas Pereira, J. Komulainen, A. Anjos, J. M. De Mar-
tino, A. Hadid, M. Pietik¨
ainen, and S. Marcel. Face liveness
detection using dynamic texture. EURASIP Journal on Im-
age and Video Processing, 2014(1):1–15, 2014.
[12] E. L. Denton et al. Unsupervised learning of disentangled
representations from video. In Proc. NIPS, pages 4414–
4423, 2017.
[13] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classifica-
tion. John Wiley & Sons, 2012.
[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.
Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
erative adversarial nets. In Proc. NIPS, pages 2672–2680,
2014.
[15] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch ¨
olkopf, and
A. Smola. A kernel two-sample test. J. Mach. Learn. Res.,
13:723–773, 2012.
[16] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m:
A dataset and benchmark for large-scale face recognition. In
Proc. ECCV, pages 87–102, 2016.
[17] H. Han, J. Li, A. K. Jain, S. Shan, and X. Chen. Tattoo
image search at scale: Joint detection and compact repre-
sentation learning. IEEE Trans. Pattern Anal. Mach. Intell,
41(10):2333–2348, Oct. 2019.
[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In Proc. CVPR, pages 770–778, 2016.
[19] S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural Computation, 9(8):1735–1780, 1997.
[20] Q. Hu, A. Szab ´
o, T. Portenier, P. Favaro, and M. Zwicker.
Disentangling factors of variation by mixing them. In Proc.
CVPR, pages 3399–3407, 2018.
[21] A. Jourabloo, Y. Liu, and X. Liu. Face de-spoofing: Anti-
spoofing via noise modeling. In Proc. ECCV, pages 290–
306, 2018.
[22] D. P. Kingma and M. Welling. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.
[23] K. Kollreider, H. Fronthaler, M. Faraj, and J. Bigun. Real-
time face detection and motion analysis with application in
liveness assessment. IEEE Trans. Inf. Forensics Security,
2(3):548–558, 2007.
[24] J. Komulainen, A. Hadid, and M. Pietikainen. Context based
face anti-spoofing. In Proc. BTAS, pages 1–8, 2013.
[25] H. Li, P. He, S. Wang, A. Rocha, X. Jiang, and A. C.
Kot. Learning generalized deep feature representation for
face anti-spoofing. IEEE Trans. Inf. Forensics Security,
13(10):2639–2652, 2018.
[26] H. Li, S. Jialin Pan, S. Wang, and A. C. Kot. Domain gen-
eralization with adversarial feature learning. In Proc. CVPR,
pages 5400–5409, 2018.
[27] H. Li, W. Li, H. Cao, S. Wang, F. Huang, and A. C. Kot. Un-
supervised domain adaptation for face anti-spoofing. IEEE
Trans. Inf. Forensics Security, 13(7):1794–1809, 2018.
[28] J. Li, Y. Wang, T. Tan, and A. K. Jain. Live face detection
based on the analysis of fourier spectra. In Proc. SPIE, vol-
ume 5404, pages 296–304, 2004.
[29] S. Liu, X. Lan, and P. C. Yuen. Remote photoplethysmog-
raphy correspondence feature for 3d mask face presentation
attack detection. In Proc. ECCV, pages 558–573, 2018.
[30] Y. Liu, A. Jourabloo, and X. Liu. Learning deep models for
face anti-spoofing: Binary or auxiliary supervision. In Proc.
CVPR, pages 389–398, 2018.
[31] Y. Liu, J. Stehouwer, A. Jourabloo, and X. Liu. Deep tree
learning for zero-shot face anti-spoofing. In Proc. CVPR,
pages 4680–4689, 2019.
[32] J. M¨
a¨
att¨
a, A. Hadid, and M. Pietik¨
ainen. Face spoofing de-
tection from single images using micro-texture analysis. In
Proc. IJCB, pages 1–7, 2011.
[33] V. Nair and G. Hinton. Rectified linear units improve re-
stricted boltzmann machines. In Proc. ICML, pages 807–
814, 2010.
[34] X. Niu, H. Han, S. Shan, and X. Chen. Synrhythm: Learning
a deep heart rate estimator from general to specific. In Proc.
ICPR, pages 3580–3585, 2018.
[35] A. Parkin and O. Grinchuk. Recognizing multi-modal face
spoofing with face recognition networks. In Proc. CVPRW,
pages 1–8, 2019.
[36] K. Patel, H. Han, and A. K. Jain. Cross-database face anti-
spoofing with robust feature representation. In Proc. CCBR,
pages 611–619, 2016.
[37] K. Patel, H. Han, and A. K. Jain. Secure face unlock: Spoof
detection on smartphones. IEEE Trans. Inf. Forensics Secu-
rity, 11(10):2268–2283, 2016.
9
[38] R. Shao, X. Lan, J. Li, and P. C. Yuen. Multi-adversarial dis-
criminative deep domain generalization for face presentation
attack detection. In Proc. CVPR, pages 10023–10031, 2019.
[39] R. Shao, X. Lan, and P. C. Yuen. Deep convolu-
tional dynamic texture learning with adaptive channel-
discriminability for 3d mask face anti-spoofing. In Proc.
IJCB, pages 748–755, 2017.
[40] T. Shen, Y. Huang, and Z. Tong. Facebagnet: Bag-of-local-
features model for multi-modal face anti-spoofing. In Proc.
CVPRW, pages 1–8, 2019.
[41] L. Sun, G. Pan, Z. Wu, and S. Lao. Blinking-based live
face detection using conditional random fields. In Proc. ICB,
pages 252–260, 2007.
[42] S. Suwajanakorn, S. Seitz, and I. Kemelmacher-Shlizerman.
Synthesizing obama: learning lip sync from audio. ACM
Trans. Graph., 36(4):95, 2017.
[43] J. A. Suykens and J. Vandewalle. Least squares support vec-
tor machine classifiers. Neural Proc. Lett., 9(3):293–300,
1999.
[44] L. Tran, X. Yin, and X. Liu. Disentangled representation
learning gan for pose-invariant face recognition. In Proc.
CVPR, pages 1415–1424, 2017.
[45] F. Wang, H. Han, S. Shan, and X. Chen. Deep multi-task
learning for joint prediction of heterogeneous face attributes.
In Proc. FG, pages 173–179, 2017.
[46] G. Wang, H. Han, S. Shan, and X. Chen. Improving cross-
database face presentation attack detection via adversarial
domain adaptation. In Proc. ICB, pages 1–8, 2019.
[47] G. Wang, C. Lan, H. Han, S. Shan, and X. Chen. Multi-
modal face presentation attack detection via spatial and chan-
nel attentions. In Proc. CVPRW, pages 1–8, 2019.
[48] D. Wen, H. Han, and A. K. Jain. Face spoof detection with
image distortion analysis. IEEE Trans. Inf. Forensics Secu-
rity, 10(4):746–761, 2015.
[49] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image:
Conditional image generation from visual attributes. In Proc.
ECCV, pages 776–791, 2016.
[50] J. Yang, Z. Lei, and S. Z. Li. Learn convolutional
neural network for face anti-spoofing. arXiv preprint,
arXiv:1408.5601, 2014.
[51] S. Zhang, X. Wang, A. Liu, C. Zhao, J. Wan, S. Escalera,
H. Shi, Z. Wang, and S. Z. Li. A dataset and benchmark for
large-scale multi-modal face anti-spoofing. In Proc. CVPR,
pages 919–928, 2019.
[52] Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li. A
face antispoofing database with diverse attacks. In Proc. ICB,
pages 26–31, 2012.
10