Myope Models - Are face presentation attack detection models short-sighted?
Pedro C. Neto1,2, Ana F. Sequeira 1and Jaime S. Cardoso2,1
1INESC TEC, Porto, Portugal
2Faculdade de Engenharia da Universidade do Porto, Porto, Portugal
Presentation attacks are recurrent threats to biometric
systems, where impostors attempt to bypass these systems.
Humans often use background information as contextual
cues for their visual system. Yet, regarding face-based sys-
tems, the background is often discarded, since face pre-
sentation attack detection (PAD) models are mostly trained
with face crops. This work presents a comparative study of
face PAD models (including multi-task learning, adversar-
ial training and dynamic frame selection) in two settings:
with and without crops. The results show that the perfor-
mance is consistently better when the background is present
in the images. The proposed multi-task methodology beats
the state-of-the-art results on the ROSE-Youtu dataset by
a large margin with an equal error rate of 0.2%. Fur-
thermore, we analyze the models’ predictions with Grad-
CAM++ with the aim to investigate to what extent the mod-
els focus on background elements that are known to be use-
ful for human inspection. From this analysis we can con-
clude that the background cues are not relevant across all
the attacks. Thus, showing the capability of the model to
leverage the background information only when necessary.
Presentation attacks are one of the weaknesses and dan-
gers posed to face recognition systems (FRS). Apart from a
few exceptions , most of the methods used in presenta-
tion attack detection (PAD) rely on tight face crops . In
other words, cropping the faces removes everything in the
image that is not part of a face. Thus, the information pro-
vided as input for these systems is somehow limited, lead-
ing to myopic (i.e. short-sighted) face PAD models. This
pre-processing step has several advantages, including the
capability to process several faces per frame and it is a fact
that ﬁrst generation face PAD methods were designed to
rely only on the face region, in order to be backwards com-
patible with FRS. However, on the other hand, it removes
spatial and contextual information present in the original
image. The human visual cortex can process this spatial
and contextual information to identify some attacks meant
to fool the human inspection. Moreover, in some cases, a
human can still be fooled by face crops of replay attacks,
if the resolution of the replay device is high enough. Sim-
ilarly, machine vision systems can likely learn to leverage
that information when it is available. Furthermore, they can
decide if it is more important to focus on the contextual in-
formation or on the face itself.
In this work, we aim at further studying an alterna-
tive approach to discarding upfront the background, already
adopted in some previous works in the literature. Thus, in
this work we investigate how different approaches perform
in the presence of contextual background. In the experi-
ments, we perform a comparative study of a state-of-the-art
supervised binary classiﬁcation model and its combination
with an adversarial approach (in this, two embeddings are
produced, one containing information that is useful for the
prediction and the other containing nuisances present in the
input, which are optimized to minimize the mutual infor-
mation between them), on three datasets, using face images
with and without crops. For a more thorough evaluation, we
elected the ROSE-Youtu dataset due to the fact that it offers
a high variety of attacks not available in the other datasets.
Thus, we evaluated a multi-task model which was opti-
mized to also distinguish between the different attacks and
the previous combined with an adversarial approach. Fur-
thermore, we designed a novel experiment based on mul-
tiple instance learning methods. With this, we attempted
at creating a dynamic frame selection system, passing the
responsibility of selecting the frame most likely to include
an attack to the model. Differently from the previous ap-
proaches, this method requires more processing power since
it goes through all frames in a video.
It has been shown in the literature that black-box mod-
els, such as deep neural networks, can learn unpredictable
patterns and focus their decision on “unexpected” regions
of interest in the input. Therefore, we also evaluated the ex-
periments from the perspective of explainable artiﬁcial in-
telligence (xAI). These evaluations are necessary to better
understand the models’ decisions and the errors due to their
opaqueness . We use methods for the visualization of
arXiv:2111.11127v1 [cs.CV] 22 Nov 2021
the elements that were important for the decision, such as
Grad-CAM++ . The output of these methods allowed us
to analyze the visual cues found by the model to detect at-
tacks. We also note that in some cases, the models follow
the same cues used by an human inspector. Thus, we reﬂect
upon the inﬂuence of the background in the choice of future
face PAD algorithms.
This study contributes to the awareness around the need
to incorporate interpretability in face PAD methodologies.
The models’ performance improvements are attributed to
the use of background and this can be corroborated by ob-
serving that, in fact, the models’ decisions are made using
the background cues.
The experiments use three datasets: ROSE-Youtu ;
NUAA ; and Replay-attack . The state-of-the-art
supervised binary classiﬁcation model (BC), and BC com-
bined with an adversarial approach are evaluated on the
three datasets in two settings: with and without background.
Furthermore, we evaluate the multi-task (MT) and dynamic
frame selection (DFS) approaches using only the ROSE-
Youtu dataset which includes a high diversity of attacks
comprising both two-dimensional and three-dimensional
information. Hence, it was used to study the impact of
the background in the model performance and whether the
background affects the capability of generalizing between
attacks. However, the experiments were deﬁned with two
goals in mind: generalization between attacks, and gen-
eralization between subjects. The ﬁrst goal is addressed,
through the study of the performance of the model on at-
tacks that were not seen previously during training. The
second is addressed by using 50% of the subjects for test-
ing. The BC, MT and adversarial approaches are evaluated
with and without background in a cross-dataset scenario.
The major contributions of this work are: i) the eval-
uation (in three widely used datasets) of a state-of-the-art
supervised binary classiﬁcation model and its combination
with an adversarial strategy in two alternative scenarios:
face images with and without background; ii) a multi-task
face PAD approach that leverages background and achieves
state-of-the-art results on the ROSE-Youtu dataset; iii) a
proposed methodology for frame selection strategy on the
ROSE-Youtu dataset. It was not the focus of this study to
see if speciﬁc models in the literature perform better with
and without background, instead, it focuses on proposing
simple and distinct approaches and analyzing whether the
results are consistent across them with regard to the pres-
ence of background.
Besides this introduction and the conclusion, this doc-
ument contains four major sections. First, in Section 2
there is a discussion of the related work and how it led to
the current study. Afterwards, details on the experiments
conducted are given in Section 3. The description of the
datasets is given in Section 4. And ﬁnally, in Section 5 we
present and discuss the obtained results and how it impacts
the future of PAD methods.
2. Related Work
Typically, previous works on face presentation attack
detection do not leverage background information. And
thus, removing it is a common practice and a frequent step
on the preprocessing stage. The most common approach
is to use a face crop, usually obtained through the use
of deep learning-based face detection algorithms such as
MTCNN  and RetinaFace . Earlier methods used
more traditional techniques, for instance, the Viola-Jones
cascade detector .
Within the published works, it is possible to ﬁnd rein-
forcement learning approaches , 3D-CNNs , a two
stages approach relying on blinking  and several other
colour-based methods [20, 2]. The background usage is ad-
dressed in some works [27, 1, 22, 30, 16], however, they
did not perform comparative studies regarding performance,
with and without the background, of several approaches. It
is possible to ﬁnd this comparison in other works, however,
the proposed methods are based on conventional machine
learning instead of end-to-end deep learning [31, 18]. Due
to the nature of the ROSE-Youtu dataset, which contains
three-dimensional and two-dimensional attacks, there are
fewer methods tested on this dataset than on others. The
variability of attacks included in the dataset signiﬁcantly
increase the difﬁculty of ﬁnding a model capable of per-
forming well on all of them. For this reason, even methods
that achieve almost zero error on other datasets, have worse
performance on the ROSE-Youtu .
To the best of our knowledge, there has not been any
method inspired by multiple instance learning applied to
face PAD. However, there is an article on a similar tech-
nique used for the detection of deep fakes . Despite
being a slightly different problem, the detection methodol-
ogy has a signiﬁcant overlap. The adversarial approach fol-
lowed in our experiments was described ﬁrst at  and its
capabilities to work with face PAD systems was evaluated
one year later by Jaiswal et al. .
Producing and visualizing explanations of the predic-
tions for face presentation attack detection is a relatively
new topic. Sequeira et al. have explored the challenge of
interpreting face PAD methods [25, 24]. Their work de-
scribed how the current evaluation metrics for PAD lack in-
formation regarding the elements that are being used for the
prediction. In a sense, they argue that models can make ac-
curate predictions but still base their decision on parts of
the image that do not correspond to real face features or
presentation artifacts as a human inspector would. In this
work, we follow a similar approach to produce and analyze
explanations. However, we use them to infer if the presence
of contextual background leads to the use of certain visual
cues in the image. At the same time, we look forward to
seeing if other contextual elements are used to make cor-
rect predictions, for instance, reﬂections. Humans often use
these elements to make their analysis.
Myriads of attacks are constantly threatening biometric
systems. However, in practice, we do not aim to identify
the type of attack, thus the main goal is to infer if the im-
age given to the sensor is an attack or if it is from a genuine
person. The problem is, in its essence, formulated as a bi-
nary classiﬁcation task. On top of the binary task, we also
applied some different training processes. However, these
do not affect the network at test time. The purpose of these
distinct approaches is to understand if the background effect
generalizes between approaches.
Attack or Genuine?
Figure 1. Architecture of the multi-task learning model. It receives
an image (yellow box) and includes a CNN (blue ﬁgure), two out-
put heads (green ﬁgures), where the ﬁrst is a binary head and the
second has 8 output nodes. The output node for the genuine sam-
ples on both heads ensures that the main goal remains on the de-
tection of attacks.
•Binary classiﬁcation training (BC) - In the ﬁrst approach,
the only task that the backbone network is optimized for
at training time is to classify between attacks and genuine.
For this, we use a MobileNet v2  that outputs two val-
ues, which are activated with softmax. The optimization
of the weights is done using the binary cross-entropy loss
BC E(y , p) = −(ylog(p) + (1 −y) log(1 −p)) (1)
•Multi-task classiﬁcation training (MT) - Whenever a
model is optimized to distinguish between attacks and
genuine images, it treats all the attacks equally. How-
ever, in practice, the attacks are not the same, and each
possesses distinctive characteristics. And thus, we also
formulated the training stage of a MobileNet v2, so it
learns to distinguish between the seven different attacks.
It is also possible that learning to discriminate between
attacks also boosts the performance whenever the attack
is unknown. Instead of having an output layer with two
classes, the network has two output layers. The ﬁrst has
two output classes, whereas the other has eight (seven
attacks and one genuine). In both cases, they are acti-
vated with softmax. Both layers are, simultaneously, up-
dated with the binary cross-entropy 1 and cross-entropy 2
losses, respectively. These losses are combined as seen
on Equation 3. Due to the risk of the second term of the
equation being larger than the ﬁrst, it was necessary to
add an output node for genuine samples in both heads.
Figure 1 is a simpliﬁed visualization of the architecture
of the model.
CE (y, p) = −
LossMulti(y1, p1, y2, p2) = BC E(y1, p1) + C E(y2, p2)
•Adversarial training (Adv.) - In the images shown to
the system, there is background information that is use-
ful for the prediction, for instance, reﬂections. Neverthe-
less, not all the background information is useful. Part
of it can be considered to be a nuisance. Hence, we ex-
plored also an approach that attempted to remove those
parts of the image from the feature vector used for the
classiﬁcation task. This approach, known as Unsuper-
vised Adversarial Invariance , produces two distinct
embeddings (i.e. feature vectors). The ﬁrst vector, e1,
represents the features that are relevant to the prediction
of the model, whereas the second, e2, comprises the in-
formation that should not be used for the prediction. Con-
structing the loss of such architecture requires four terms.
The ﬁrst two are maximization terms (Eq. 6 and 7). They
attempt to reconstruct e1from e2and vice-versa. This at-
tempts at removing any potential mutual information be-
tween both embeddings. The other two are minimization
terms (Eq. 5). The ﬁrst embedding uses e1to perform the
classiﬁcation task. Whereas the second apply some noise
to e1, in the form of a dropout layer. From the noisy e1
and from e2, it tries to reconstruct the input image. For
the construction/reconstruction terms the loss used is the
mean squared error (Eq. 4), while for the classiﬁcation
term we use either the binary cross-entropy (Eq. 1) or the
multi-task loss (Eq. 3). The term αcontrols the impact of
the reconstruction loss on the overall loss. We start with
α= 0.025, and we increase by 0.025 at the end of each
epoch. The architecture is represented in Figure 2.
MSE(x, y) =
LossAdv(e1, e2, e0
2) = −MSE(e1, e0
LossClass (y, p, x, x0) = BC E(y, p) + αMSE(x, x0)
LossClass (y1, p1, y2, p2, x, x0) =
LossMulti task(y1, p1, y2, p2)
Figure 2. Architecture of the adversarial learning model. It re-
ceives an input image (yellow box) and includes a CNN (blue ﬁg-
ure), two feature vectors e1and e2(green boxes). These are used
to reconstruct each other, decode (purple ﬁgure) the input and to
classify the input. This architecture is deeply based on the Unsu-
pervised Adversarial Invariance .
•Dynamic frame selection training (DFS) - Finally, we also
propose an architecture to select the best frame for the
detection of attacks. In state-of-the-art approaches, the
training frames were ﬁxed and previously selected from
the list of possible frames. This, however, raises a couple
of questions: 1) Do all the frames contain the same infor-
mation for the prediction?; 2) If not, are we selecting the
best frames to optimize the network?. We structured the
optimization of this method in two stages: frame selec-
tion and learning. The frame selection stage processes all
the frames in a video and computes their output with the
model. From the outputs, if the video is from an attack
it selects the three frames that have the lowest probability
of being an attack. And the opposite if the video is from
a genuine individual. Afterwards, the selected frames are
used in the learning stage to optimize the network towards
the video labels. At testing time the process is similar to
the frame selection, the frame with the highest probability
of being an attack is used for the classiﬁcation. Perhaps,
one of the most interesting aspects of this approach is that
it can be integrated with the other previously mentioned.
The frame selection step remains unchanged, while the
training stage integrates the changes related to the other
approaches. Figure 3 shows the behavior of the described
method for both frame selection and testing. The frame
selection probability is the result of the attack probability
produced by the binary classiﬁcation layer of the model.
Frame 1 Frame 2 Frame 3 Frame 4 Frame 5
Prob 1 Prob 2 Prob 3 Prob 4 Prob 5
Original video divided into frames
Real: if prob < 0.5
Figure 3. Architecture of the method for dynamic frame selection.
It includes a CNN (blue ﬁgure) with shared weights that processes
all the frames (yellow boxes) of a video and selects one (lower
yellow box) based on a speciﬁc criteria.
To evaluate and compare the performance of these PAD
models, we collected the following metrics: the Bona ﬁde
Presentation Classiﬁcation Error Rate (BPCER) (the pro-
portion of bona ﬁde presentations erroneously classiﬁed
as attacks), and the Attack Presentation Classiﬁcation Er-
ror Rate (APCER) (the proportion of presentation attack
wrongly classiﬁed as bona ﬁde) . Finally, we also col-
lected the Equal Error Rate (EER), which is the error at
the operation point where the APCER and BPCER have the
same value. For the APCER and the BPCER we used a
threshold of 0.5.
The datasets used for the experimental evaluation are:
ROSE-Youtu ; NUAA ; and Replay-Attack .
ROSE-Youtu : Contains, in its public version, 3350
videos with 20 different subjects. On average, video clips
have a duration of 10 seconds. For each of the subjects, it
contains around 150 to 200 videos captured from ﬁve mo-
bile devices (all with different resolutions on their camera)
and ﬁve lighting conditions. The front-facing camera was
used with a distance between face and camera of about 30
to 50 centimeters.
(f) Attack #4 (g) Attack #1 (h) Attack #6 (i) Genuine (j) Genuine
Figure 4. Samples collected from the ROSE-YOUTU dataset 
containing images from attacks and genuine captures. On the top
row, cropped images are displayed. Whereas the bottom row con-
tains the exact same images, but with all the background informa-
Table 1. List of attacks present in the ROSE-YOUTU dataset .
- Genuine (bona ﬁde)
#1 Still printed paper
#2 Quivering printed paper
#3 Video which records a Lenovo LCD display
#4 Video which records a Mac LCD display
#5 Paper mask with two eyes and mouth cropped out
#6 Paper mask without cropping
#7 Paper mask with the upper part cut in the middle
There are eight different types of videos, which trans-
lates into eight classes. The ﬁrst class represents the gen-
uine samples, whereas each of the following seven represent
an attack. The ﬁrst two attacks are print attacks, while the
third and fourth are replay attacks on a Lenovo and an Apple
laptop, respectively. The remaining three are based on paper
masks and are responsible for including three-dimensional
information in the dataset. These attacks are described in
We preprocessed the dataset into two different copies.
For both, the frames of the videos were extracted and stored.
For the second, a face was cropped by the MTCNN algo-
rithm  for all frames. Examples of these images are
seen in Figure 4. We used the videos from the ﬁrst 10 in-
dexed subjects (2,3,4,5,6,7,9,10,11,12) for training and the
remaining 10 for testing.
NUAA : was one of the ﬁrst public databases for
training and evaluating the performance of face PAD meth-
ods. This database simulates a simple and general method
that re-captures a printed photograph of users for attacking
a face recognition system. The NUAA database contains
real and presentation attack face images of 15 persons. For
each person, both real and presentation attack images were
captured in three different sessions using generic cheap we-
bcams and real face and printed photograph of users. The
NUAA database contains 5105 real and 7509 presentation
attack face images in color space with 640 × 480 pixels
of image resolution. In this database, using the collected
images, the training and testing sub-databases are prede-
ﬁned for training and testing of the PAD method, through
which the performances of various PAD methods can be
compared. In detail, the training database contains 1743
real and 1748 presentation attack face images, while the
testing database contains 3362 real and 5761 presentation
attack face images.
Replay-Attack : this database for face PAD consists
of 1300 video clips of photo and video attack attempts to 50
clients, under different lighting conditions. All videos are
generated by either having a (real) client trying to access a
laptop through a built-in webcam or by displaying a photo
or a video recording of the same client for at least 9 seconds.
In this section, we present and discuss the evaluation
results of the methods described. Regarding implementa-
tion details, all the methods described were optimized with
Adam and a ﬁxed learning rate of 0.001. The model used is
a MobileNet v2 pre-trained on the ImageNet dataset. The
images are resized to have a resolution of 224x224 and an
RGB color scheme. The Grad-CAM++  was the visual-
ization tool used to analyze the parts of the image relevant
to the models’ decisions. Train and test set splits were the
same described in the publication of each dataset, so that
the results can be compared with other works.
In Table 2 are shown the results obtained with the binary
classiﬁcation (BC) and its combination with the adversar-
ial training (Adv.+BC) for the NUAA and Replay-Attack
datasets. From both evaluations it is evident the perfor-
mance improvement when the background is present in the
images with a decrease in the EER to 0.00%.
The variety of attacks present in the ROSE-Youtu dataset
motivates a multi-task learning approach. The experi-
ments produced intended to evaluate the BC and the MT
approaches with the inclusion and exclusion of contex-
tual background. We further attempted to integrate them
with the adversarial approach described in the previous
section. The results for both BC and MT classiﬁcation
and their combination with the adversarial training strat-
egy (Adv.+BC; Adv.+MT) can be seen in Table 3. While
the adversarial training did not lead to the expected im-
provement, in all four scenarios the models’ performance
improved with the background. This seems to indicate that
the background provides more information and favoured the
performance error rates. On multi-task classiﬁcation, the
improvements on the EER are as high as 81%.
Table 2. Comparison of four different approaches with their ver-
sions with and without background. The columns represent the
dataset used for both training and testing. The reported values rep-
resent the EER in %.
Method Background NUAA Replay-Attack
Table 3. Comparison of four different approaches with their ver-
sions with and without background on the ROSE-Youtu dataset.
APCER, BPCER and EER are displayed as %. In bold is the best
result per column.
Method Background APCER BPCER EER
In a cross-dataset approach, we performed experiments
in which the models trained with the ROSE-Youtu dataset
were evaluated with the other two datasets. The results of
these experiments are presented in Table 4. These results
are in line with the results depicted on Tables 2 and 3. The
comparison of the same-database and cross-database results
show that the models’ performance consistently improve
when the background information is used.
Considering the performance gains from the use of back-
ground, we used this scenario to explore the proposed multi-
task (MT) and dynamic frame selection (DFS) strategies.
The results of all approaches evaluated for the ROSE-Youtu
dataset are presented in Table 5. Unexpectedly, the perfor-
mance of the DFS methods and the ones that used adversar-
ial training produced worse results than the simple binary
and multi-task classiﬁcation. The BC and MT approaches
performed well at detecting attacks, as can be seen by the
low value of the APCER, 0.25% and 0.15%, respectively.
Regarding the detection of bonaﬁde images, the BC per-
formed worse than several of the other methods and the MT
had the lowest BPCER of them all, 0.40%.
Table 4. Comparison of four different approaches with their ver-
sions with and without background. Results for models trained on
the ROSE-Youtu dataset and tested on the datasets of each column.
The reported values represent the EER in %.
Method Background NUAA Replay-Attack
Table 5. Comparison of all the seven different approaches explored
in the ROSE-Youtu dataset. All of the approaches leveraged back-
ground information. In bold is the best result per column.
Method APCER (%) BPCER (%) EER (%)
BC 0.25 2.03 0.73
MT 0.15 0.40 0.24
Adv. + BC 0.52 1.29 0.76
Adv. + MT 0.29 1.11 0.60
DFS 0.54 4.68 1.62
MT + DFS 0.31 2.23 0.69
Adv. + DFS 2.15 1.78 1.78
We extended the experiments of the multi-task classiﬁ-
cation approach for different evaluation conﬁgurations, one
and unseen attack (the results can be seen in Table 6 and
Table 7, respectively). The multi-task classiﬁcation was
also integrated with the adversarial training and the dynamic
frame selection for a better and more complete comparison.
The one attack conﬁguration selects one attack to be used
for both training and testing. This is intended to see how
hard is to overﬁt the model to that attack and to distinguish it
from genuine images. The results for this conﬁguration are
visible in Table 6 and it is possible to see that while the three
approaches are capable of overﬁtting to the majority of the
attacks, the attack #4 remains challenging to the adversarial
and DFS approaches.
The unseen conﬁguration selects one attack to be re-
moved from training and to be the only one used for testing.
This is to evaluate the capability of the model to general-
ize to novel attacks and to evaluate the challenges that each
attack present to the network. The results for this conﬁg-
uration are seen in Table 7, and it is possible to observe
that both DFS and adversarial approaches have difﬁculties
in generalizing to unseen attacks. Especially attack #4. The
low performance of this attack can be explained by the high
resolution of the replay attack device that increases the dif-
ﬁculty of the task.
Table 6. Evaluation of three approaches in the setting of one attack in the ROSE-Youtu dataset. In this setting, the attack in the ﬁrst column
is the only one used for training and testing. APCER, BPCER and EER are displayed as %. In bold is the best result per column.
Attack MT Adversarial MT DFS MT
APCER BPCER EER APCER BPCER EER APCER BPCER EER
#1 0.00 0.20 0.05 0.20 0.29 0.27 0.50 0.22 0.50
#2 0.00 0.02 0.02 0.00 0.07 0.00 0.00 0.00 0.00
#3 0.00 0.16 0.11 0.00 0.29 0.25 0.00 0.45 0.00
#4 0.30 1.27 0.71 1.16 1.98 1.46 0.50 1.34 1.01
#5 0.00 0.02 0.00 0.00 0.09 0.05 0.00 0.45 0.00
#6 0.00 0.07 0.00 0.00 0.16 0.11 0.00 0.00 0.00
#7 0.00 0.05 0.00 0.00 0.11 0.00 0.00 0.00 0.00
Table 7. Evaluation of three approaches in the setting of unseen attack in the ROSE-Youtu dataset. In this setting, the attack in the ﬁrst
column is excluded from the training and is the only one used for testing. APCER, BPCER and EER are displayed as %. In bold is the best
result per column.
Attack MT Adversarial MT DFS MT
APCER BPCER EER APCER BPCER EER APCER BPCER EER
#1 1.00 0.85 0.95 1.00 6.90 2.87 1.00 6.90 2.30
#2 0.00 0.49 0.25 0.00 1.18 0.27 0.00 4.90 0.67
#3 3.09 3.41 3.27 1.44 3.07 2.61 2.14 10.24 5.79
#4 13.82 3.88 7.57 12.66 6.61 9.15 17.59 16.70 17.09
#5 0.00 1.98 0.65 0.25 0.58 0.45 0.50 4.90 1.49
#6 0.00 0.89 0.33 0.10 0.58 0.10 0.00 4.01 1.00
#7 0.35 3.63 1.77 2.32 7.68 5.20 0.51 9.58 1.78
The multi-task approach excelled at detecting both at-
tacks and bonaﬁde samples, achieving an equal-error rate
better than any other approach. And thus, it was the ap-
proach used to compare with the state-of-the-art for the
ROSE-YOUTU dataset. he methods compared report their
results on similar train/test split, as speciﬁed on the database
publication document . Our results, as seen in Table 8
are better than the state-of-the-art when we include back-
ground and slightly better when there is no background. De-
spite the good results, it is important to note that the meth-
ods presented in this document were not designed to be the
best performing methods at cross-dataset conﬁguration. In-
stead, they are intended to allow a relevant study regarding
the presence of background for this speciﬁc dataset.
In Figure 5 is depicted the ROC curve of the MT model
with the x-axis displayed at the log-scale for a better visu-
alization. The model is indeed nearly perfect at detecting
the attacks and the bona-ﬁde images in the ROSE-Youtu
Finally, we produced explanations of our model for an
example of each category of attacks. For the replay attack,
we produced the explanations in Figures 6a and 6d. In these
ﬁgures, it is possible to observe that the models leveraged
the presence of reﬂections in the attack image, whenever
there is background. When the background is not present
there are no cues to justify the decision of the model, which
Table 8. Comparison of the best proposed approaches, both with
and without background, with the state-of-the-art. In bold is the
best result per column.
Method EER (%)
CoALBP (YCBCR)  17.1
CoALBP (HSV)  16.4
Color [2, 7] 13.9
De Spooﬁng [15, 7] 12.3
RCTR-all spaces  10.7
ResNet-18  9.3
SE-ResNet18  8.6
AlexNet  8.0
DR-UDA (SE-ResNet18)  8.0
DR-UDA (ResNet-18)  7.2
3D-CNN  7.0
Blink-CNN  4.6
DRL-FAS  1.8
Ours w/ Background 0.2
is in fact wrong. Figures 6b and 6e shows the explanations
for a paper mask attack, and as expected, the explanations
do not rely on the background. Instead, the model directs
its focus to the mask area for the ﬁnal prediction. The area
is similar on both versions of the model with and without
Figure 5. Receiver operating characteristic curve for the multi-task
model on the ROSE-Youtu dataset with background. X-axis is at
(a) Replay - B (b) Paper Mask
(c) Print - B
(d) Replay (e) Paper Mask (f) Print
Figure 6. Explanations generated with Grad-CAM++ for different
attacks of subject #23. -B indicates that the model was trained and
tested on images with background. The (d) image was wrongly
classiﬁed as genuine.
background. Finally, the print attack explanations are seen
in Figures 6c and 6f. These ﬁgures shows that once more the
model is capable of understanding the conditions of the im-
age given and directs its focus to an important background
artefact, the pin holding the image. And again, the version
without background does not highlight any relevant cue that
explains the prediction of the model. Hence, we further ar-
gue in favor of a better explanability factor in the models
that include background.
This work explored how consistently the background im-
pacts the performance of distinct methods for face presen-
tation attack detection. The experiments corroborated the
view that a face PAD model is capable of leveraging both
background and face elements to make a correct prediction.
Our approach surpassed the state-of-the-art results for
the ROSE-YOUTU dataset by a large margin. The multi-
task model leverages background artefacts to improve the
detection of speciﬁc attacks. Moreover, we also present
some alternative approaches, dynamic frame selection and
adversarial training, that we believe were limited by the lack
of a large database of face presentation attacks. Their results
were consistent with the one from the multi-task model.
We further contribute to improve the explainability of
these models by analyzing the predictions. This analyze
conducted with the Grad-CAM++ algorithm highlighted
that models that include the background of the images can
leverage the presence of certain artifacts. On the other hand,
when the background is not present the generated explana-
tions seem to be non-informative. Hence, due to their simi-
larity with the human vision with regards to the areas used
for the prediction, models that leverage the background pro-
vide more explanations for their predictions.
The authors would like to acknowledge the reviewers’
comments that were crucial for the improvement of this
work. This work was ﬁnanced by National Funds through
the Portuguese funding agency, FCT - Fundac¸˜
ao para a
encia e a Tecnologia within project UIDB/50014/2020,
and within the PhD grant “2021.06872.BD”.
Conﬂict of Interest
The authors declare that there is no conﬂict of interest
that could be perceived as prejudicing the impartiality of
the research reported.
 Yashasvi Baweja, Poojan Oza, Pramuditha Perera, and
Vishal M Patel. Anomaly detection-based unknown face pre-
sentation attack detection. In 2020 IEEE International Joint
Conference on Biometrics (IJCB), pages 1–9. IEEE, 2020.
 Zinelabidine Boulkenafet, Jukka Komulainen, and Abdenour
Hadid. Face spooﬁng detection using colour texture analysis.
IEEE Transactions on Information Forensics and Security,
 Rizhao Cai, Haoliang Li, Shiqi Wang, Changsheng Chen,
and Alex C Kot. Drl-fas: A novel framework based on deep
reinforcement learning for face anti-spooﬁng. IEEE Trans-
actions on Information Forensics and Security, 16:937–951,
 Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader,
and Vineeth N Balasubramanian. Grad-cam++: General-
ized gradient-based visual explanations for deep convolu-
tional networks. In 2018 IEEE Winter Conference on Appli-
cations of Computer Vision (WACV), pages 839–847, 2018.
 Ivana Chingovska, Andr´
e Anjos, and S ´
On the effectiveness of local binary patterns in face anti-
spooﬁng. In 2012 BIOSIG-proceedings of the international
conference of biometrics special interest group (BIOSIG),
pages 1–7. IEEE, 2012.
 Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kot-
sia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-
level face localisation in the wild. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2020.
 Yuting Du, Tong Qiao, Ming Xu, and Ning Zheng. Towards
Face Presentation Attack Detection Based on Residual Color
Texture Representation. Security and Communication Net-
works, 2021:6652727, 2021.
 Md. Mehedi Hasan, Md. Salah Uddin Yusuf, Tanbin Islam
Rohan, and Shidhartho Roy. Efﬁcient two stage approach
to detect face liveness : Motion based and deep learning
based. In 2019 4th International Conference on Electrical
Information and Communication Technology (EICT), pages
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.
 Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
works. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 7132–7141, 2018.
 Matthew Hutson. The opacity of artiﬁcial intelligence makes
it hard to tell when decision-making is biased. IEEE Spec-
trum, 58(2):40–45, 2021.
 ISO/IEC JTC1 SC37. Information Technology - Biometrics
- Presentation attack detection Part 3: Testing and Reporting.
ISO Int. Organization for Standardization, 2017.
 Ayush Jaiswal, Rex Yue Wu, Wael Abd-Almageed, and Prem
Natarajan. Unsupervised Adversarial Invariance. In S Ben-
gio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi,
and R Garnett, editors, Advances in Neural Information Pro-
cessing Systems, volume 31. Curran Associates, Inc., 2018.
 Ayush Jaiswal, Shuai Xia, Iacopo Masi, and Wael Ab-
dAlmageed. Ropad: Robust presentation attack detection
through unsupervised adversarial invariance. In 2019 Inter-
national Conference on Biometrics (ICB), pages 1–8. IEEE,
 Amin Jourabloo, Yaojie Liu, and Xiaoming Liu. Face de-
spooﬁng: Anti-spooﬁng via noise modeling. In Proceedings
of the European Conference on Computer Vision (ECCV),
pages 290–306, 2018.
 Alperen Kantarcı, Hasan Dertli, and Hazım Kemal Ekenel.
Shufﬂed patch-wise supervision for presentation attack de-
tection. In 2021 International Conference of the Biometrics
Special Interest Group (BIOSIG), pages 1–5. IEEE, 2021.
 Dakshina Ranjan Kisku and Rinku Datta Rakshit. Face
spooﬁng and counter-spooﬁng: a survey of state-of-the-art
algorithms. Transactions on Machine Learning and Artiﬁ-
cial Intelligence, 5(2):31, 2017.
 Jukka Komulainen, Abdenour Hadid, and Matti Pietik¨
Context based face anti-spooﬁng. In 2013 IEEE Sixth In-
ternational Conference on Biometrics: Theory, Applications
and Systems (BTAS), pages 1–8. IEEE, 2013.
 Haoliang Li, Peisong He, Shiqi Wang, Anderson Rocha,
Xinghao Jiang, and Alex C. Kot. Learning generalized deep
feature representation for face anti-spooﬁng. IEEE Transac-
tions on Information Forensics and Security, 13(10):2639–
 Haoliang Li, Wen Li, Hong Cao, Shiqi Wang, Feiyue Huang,
and Alex C. Kot. Unsupervised domain adaptation for face
anti-spooﬁng. IEEE Transactions on Information Forensics
and Security, 13(7):1794–1809, 2018.
 Xiaodan Li, Yining Lang, Yuefeng Chen, Xiaofeng Mao,
Yuan He, Shuhui Wang, Hui Xue, and Quan Lu. Sharp mul-
tiple instance learning for deepfake video detection. In Pro-
ceedings of the 28th ACM International Conference on Mul-
timedia, pages 1864–1872, 2020.
 Ewa Magdalena Nowara, Ashutosh Sabharwal, and Ashok
Veeraraghavan. Ppgsecure: Biometric presentation attack
detection using photopletysmograms. In 2017 12th IEEE In-
ternational Conference on Automatic Face & Gesture Recog-
nition (FG 2017), pages 56–62. IEEE, 2017.
 Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-
moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
residuals and linear bottlenecks. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 4510–4520, 2018.
 Ana F. Sequeira, Tiago Gonc¸alves, Wilson Silva,
ao Ribeiro Pinto, and Jaime S. Cardoso. An ex-
ploratory study of interpretability for face presentation
attack detection. IET Biometrics, n/a(n/a).
 A. F. Sequeira, W. Silva, J. R. Pinto, T. Gonc¸alves, and J. S.
Cardoso. Interpretable biometrics: Should we rethink how
presentation attack detection is evaluated? In 2020 8th In-
ternational Workshop on Biometrics and Forensics (IWBF),
pages 1–6, 2020.
 Xiaoyang Tan, Yi Li, Jun Liu, and Lin Jiang. Face liveness
detection from a single image with sparse low rank bilinear
discriminative model. In European Conference on Computer
Vision, pages 504–517. Springer, 2010.
 Rafael Henrique Vareto and William Robson Schwartz.
Face spooﬁng detection via ensemble of classiﬁers toward
low-power devices. Pattern Analysis and Applications,
 P. Viola and M. Jones. Rapid object detection using a boosted
cascade of simple features. In Proceedings of the 2001 IEEE
Computer Society Conference on Computer Vision and Pat-
tern Recognition. CVPR 2001, volume 1, pages I–I, 2001.
 Guoqing Wang, Hu Han, Shiguang Shan, and Xilin Chen.
Unsupervised adversarial domain adaptation for cross-
domain face presentation attack detection. IEEE Trans-
actions on Information Forensics and Security, 16:56–69,
 Jianwei Yang, Zhen Lei, and Stan Z Li. Learn convolu-
tional neural network for face anti-spooﬁng. arXiv preprint
 Jianwei Yang, Zhen Lei, Shengcai Liao, and Stan Z Li. Face
liveness detection with component dependent descriptor. In
2013 International Conference on Biometrics (ICB), pages
1–6. IEEE, 2013.
 Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao.
Joint face detection and alignment using multitask cascaded
convolutional networks. IEEE Signal Processing Letters,