Conference Paper

Cross-Domain Face Presentation Attack Detection via Multi-Domain Disentangled Representation Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Hence, the developed experiments are divided into three groups based on the scale of the data available for training and following established evaluation protocols: triple-source (3 training datasets), double-source (2 training datasets) and single-source (1 training dataset). We perform five triple-source experiments (training dataset(s) → testing dataset): O&C&I → M, O&M&I → C, O&C&M → I, I&C&M → O, O&C&M → CA, following previous works [10,15,27,33,52]. For the double-source scneario, two cases are considered: M&I → C and M&I → O [15,16,38,59,60]. ...
... For the double-source scneario, two cases are considered: M&I → C and M&I → O [15,16,38,59,60]. The single-source scenario includes a set of twelve experiments where one of the M, C, I, and O datasets is used to train the network and the remaining three are separately used for testing, following previous works on cross-dataset PAD [15,27,46,52,53]. The SynthASpoof dataset is used to adapt models from the synthetic domain to the authentic domain. ...
Preprint
Full-text available
Although face recognition systems have seen a massive performance enhancement in recent years, they are still targeted by threats such as presentation attacks, leading to the need for generalizable presentation attack detection (PAD) algorithms. Current PAD solutions suffer from two main problems: low generalization to unknown cenarios and large training data requirements. Foundation models (FM) are pre-trained on extensive datasets, achieving remarkable results when generalizing to unseen domains and allowing for efficient task-specific adaption even when little training data are available. In this work, we recognize the potential of FMs to address common PAD problems and tackle the PAD task with an adapted FM for the first time. The FM under consideration is adapted with LoRA weights while simultaneously training a classification header. The resultant architecture, FoundPAD, is highly generalizable to unseen domains, achieving competitive results in several settings under different data availability scenarios and even when using synthetic training data. To encourage reproducibility and facilitate further research in PAD, we publicly release the implementation of FoundPAD at https://github.com/gurayozgur/FoundPAD .
... Domain generalization (DG) has emerged as a promising approach to address this challenge by leveraging source data to train models that generalize well to new, unseen domains. Several DG techniques have been developed to improve model generalization, including domain alignment [37,50], data augmentation [74,84], ensemble learning [45,54], and disentangled representation learning [58,72], and so on. ...
... DG methods [73,87] aim to train models on source data in a way that enables them to generalize well to other out-ofdistribution data. Existing DG approaches can be broadly categorized into four groups based on their methodology and motivation: (1) domain alignment [37,44,50,65,68], which measures the distance between distinct distributions and learns domain-invariant representations to enhance the robustness of the model; (2) data augmentation [12,74,84,85], which strongly enhances the diversity of training data to prevent the model from over-fitting to the training data; (3) ensemble learning [45,51,54,86], which trains multiple models and uses their ensemble to improve predictions and reduce bias; and (4) disentangled representation learning [58,59,72], which separates features into domain-variant and domain-invariant components, using the domain-invariant features for more robust predictions. Although these DG methods have achieved significant success in improving model generalization, they may still underperform on out-of-distribution data due to a lack of constraints on the sharpness of the loss landscape. ...
Preprint
Domain generalization (DG) aims to enhance the ability of models trained on source domains to generalize effectively to unseen domains. Recently, Sharpness-Aware Minimization (SAM) has shown promise in this area by reducing the sharpness of the loss landscape to obtain more generalized models. However, SAM and its variants sometimes fail to guide the model toward a flat minimum, and their training processes exhibit limitations, hindering further improvements in model generalization. In this paper, we first propose an improved model training process aimed at encouraging the model to converge to a flat minima. To achieve this, we design a curvature metric that has a minimal effect when the model is far from convergence but becomes increasingly influential in indicating the curvature of the minima as the model approaches a local minimum. Then we derive a novel algorithm from this metric, called Meta Curvature-Aware Minimization (MeCAM), to minimize the curvature around the local minima. Specifically, the optimization objective of MeCAM simultaneously minimizes the regular training loss, the surrogate gap of SAM, and the surrogate gap of meta-learning. We provide theoretical analysis on MeCAM's generalization error and convergence rate, and demonstrate its superiority over existing DG methods through extensive experiments on five benchmark DG datasets, including PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet. Code will be available on GitHub.
... More recent FAS works are mainly focusing on the cross-domain (cross-dataset) scenario, where the training and testing data are drawn from different distributions (Li et al., 2018a). To learn domaininvariant features, more advanced techniques are utilized, such as meta-learning Shao et al., 2020;Qin et al., 2021), adversarial learning (Shao et al., 2019;Jia et al., 2020), disentanglement learning (Wang et al., 2020a;Wu et al., 2021;, etc. Face Anti-Spoofing algorithms in other scenarios, such as domain adaptation Huang et al., 2022;Li et al., 2018b;Wang et al., 2020;Cai et al., 2024b), continual learning , multi-modal learning (Lin et al., 2024), multi-task learning , have also been studied. By contrast, the data side for FAS is relatively less explored. ...
... We follow to abbreviate this protocol as MICO for short. Also, we fol-low (Wang et al., 2020a;Cai et al., 2022) ...
Article
Full-text available
Face Anti-Spoofing (FAS) research is challenged by the cross-domain problem, where there is a domain gap between the training and testing data. While recent FAS works are mainly model-centric, focusing on developing domain generalization algorithms for improving cross-domain performance, data-centric research for face anti-spoofing, improving generalization from data quality and quantity, is largely ignored. Therefore, our work starts with data-centric FAS by conducting a comprehensive investigation from the data perspective for improving cross-domain generalization of FAS models. More specifically, at first, based on physical procedures of capturing and recapturing, we propose task-specific FAS data augmentation (FAS-Aug), which increases data diversity by synthesizing data of artifacts, such as printing noise, color distortion, moiré pattern, etc. Our experiments show that using our FAS augmentation can surpass traditional image augmentation in training FAS models to achieve better cross-domain performance. Nevertheless, we observe that models may rely on the augmented artifacts, which are not environment-invariant, and using FAS-Aug may have a negative effect. As such, we propose Spoofing Attack Risk Equalization (SARE) to prevent models from relying on certain types of artifacts and improve the generalization performance. Last but not least, our proposed FAS-Aug and SARE with recent Vision Transformer backbones can achieve state-of-the-art performance on the FAS cross-domain generalization protocols. The implementation is available at https://github.com/RizhaoCai/FAS-Aug.
... where the training and testing data are drawn from different distributions (Li, He, et al., 2018). To learn domain-invariant features, more advanced techniques are utilized, such as meta-learning Qin et al., 2021;Shao, Lan, & Yuen, 2020;Yu, Wan, et al., 2021), adversarial learning (Jia et al., 2020;Shao et al., 2019), disentanglement learning (Y. G. Wang, Han, Shan, & Chen, 2020a;Wu, Zeng, Hu, Shi, & Mei, 2021), etc. Face Anti-Spoofing algorithms in other scenarios, such as domain adaptation Huang et al., 2022;G. Wang, Han, Shan, & Chen, 2020b;J. Wang et al., 2021), continual learning , multi-modal learning (Lin et al., 2024), multi-task learning , have also been studied. By contrast, the data side for FAS is rel ...
... MFSD (M ) (Wen et al., 2015), OULU-NPU (O) (Boulkenafet et al., 2017), NTU ROSE-YOUTU (Y ) , and SiW (S )(Y. Liu et al., 2018). Following (Shao et al., 2019), we utilize the leave-one-out cross-domain protocol (Shao et al., 2019), which uses the four datasets M, I, C, and O. We follow to abbreviate this protocol as MICO for short. Also, we follow G. Wang et al., 2020a) ...
Preprint
Face Anti-Spoofing (FAS) research is challenged by the cross-domain problem, where there is a domain gap between the training and testing data. While recent FAS works are mainly model-centric, focusing on developing domain generalization algorithms for improving cross-domain performance, data-centric research for face anti-spoofing, improving generalization from data quality and quantity, is largely ignored. Therefore, our work starts with data-centric FAS by conducting a comprehensive investigation from the data perspective for improving cross-domain generalization of FAS models. More specifically, at first, based on physical procedures of capturing and recapturing, we propose task-specific FAS data augmentation (FAS-Aug), which increases data diversity by synthesizing data of artifacts, such as printing noise, color distortion, moir\'e pattern, \textit{etc}. Our experiments show that using our FAS augmentation can surpass traditional image augmentation in training FAS models to achieve better cross-domain performance. Nevertheless, we observe that models may rely on the augmented artifacts, which are not environment-invariant, and using FAS-Aug may have a negative effect. As such, we propose Spoofing Attack Risk Equalization (SARE) to prevent models from relying on certain types of artifacts and improve the generalization performance. Last but not least, our proposed FAS-Aug and SARE with recent Vision Transformer backbones can achieve state-of-the-art performance on the FAS cross-domain generalization protocols. The implementation is available at https://github.com/RizhaoCai/FAS_Aug.
... Concretely, we employ two networks to extract liveness and identity features separately. These features are then treated as dissimilar and expressed as orthogonal from a subspace perspective, instead utilizing generative adversarial network and pixel reconstruction approaches like [29,39], which exhibit heavy computational overheads. To enhance the efficiency and scalability of our framework, we propose two plug-and-play modules: Style Cross (SC) and Channel-wise Style Attention (CWSA). ...
... [41,39] divide the representation of an image into content and liveness parts to solve FAS problems. Wang et al. [29] explicitly disentangles identity with liveness. Liu et al. [18] disentangles a spoof face into a live counterpart and spoof trace and aims to explicitly extract the spoof traces from faces. ...
Preprint
Face anti-spoofing techniques based on domain generalization have recently been studied widely. Adversarial learning and meta-learning techniques have been adopted to learn domain-invariant representations. However, prior approaches often consider the dataset gap as the primary factor behind domain shifts. This perspective is not fine-grained enough to reflect the intrinsic gap among the data accurately. In our work, we redefine domains based on identities rather than datasets, aiming to disentangle liveness and identity attributes. We emphasize ignoring the adverse effect of identity shift, focusing on learning identity-invariant liveness representations through orthogonalizing liveness and identity features. To cope with style shifts, we propose Style Cross module to expand the stylistic diversity and Channel-wise Style Attention module to weaken the sensitivity to style shifts, aiming to learn robust liveness representations. Furthermore, acknowledging the asymmetry between live and spoof samples, we introduce a novel contrastive loss, Asymmetric Augmented Instance Contrast. Extensive experiments on four public datasets demonstrate that our method achieves state-of-the-art performance under cross-dataset and limited source dataset scenarios. Additionally, our method has good scalability when expanding diversity of identities. The codes will be released soon.
... However, in most realistic scenarios, collecting sufficient target data for training is often impractical. To this end, some methods introduce domain generalization (DG) technology, and they often utilize domain labels to facilitate adversarial learning or disentanglement learning to remove domain-specific information (Shao et al., 2019;Jia et al., 2020;Wang et al., 2020;Huang et al., 2022). Despite improved generalization, they are still sensitive to varying face quality caused by various factors (Biometrics, 2020;Schlett et al., 2022), such as camera, imaging media, and illumination condition, which exacerbate the distribution discrepancies (see the left panel of Fig. 1 a), and further hinders the application of FAS. ...
... Moreover, they established a domain generalization benchmark on four FAS datasets (Zhang et al., 2012;Wen et al., 2015;Boulkenafet et al., 2017), which was widely used nowadays. Considering such domain-invariant features might still contain task-irrelated factors (e.g., subject and capture device), Wang et al. (2020) disentangled the generalized features from subject discriminative and domain-invariant features, and achieved better performance. Jia et al. (2020) proposed to learn a compact distribution for living faces, while spoofing faces were dispersed among domains but is compact within each domain. ...
Article
Full-text available
Face Anti-Spoofing (FAS) plays a critical role in safeguarding face recognition systems, while previous FAS methods suffer from poor generalization when applied to unseen domains. Although recent methods have made progress via domain generalization technology, they are still sensitive to variations in face quality caused by task-irrelevant factors like camera and illumination. In this paper, we propose a novel Quality-Invariant Domain Generalization method (QIDG) with a teacher-student architecture, which aligns liveness features into a quality-invariant space to alleviate interference from task-irrelated factors. Specifically, QIDG utilizes the teacher model to produce face quality representations, which serve as the guidance for the student model to explore the quality-invariant space. To seek this space, the student model devises two novel modules, i.e., a dual adversarial learning module (DAL) and a quality feature assembly module (QFA). The former produces domain-invariant liveness features and task-irrelated quality features. While the latter assembles these two features from the same faces into complete quality representations, as well as assembles these two features from living faces in different domains. In this way, QIDG not only achieves the alignment of the domain-invariant liveness features to the quality-invariant space, but also promotes compactness of living faces from different domains in the feature space. Extensive cross-domain experiments demonstrate the superiority of our method on five public databases.
... On the other hand, multisource methods utilize domain labels and often take advantage of the statistical differences in the sample distributions. Specifically, most popular algorithms include data augmentation [14,95] which proves beneficial for regularizing over-parameterized neural networks and improving generalization, meta-learning [3,24,48,91], which exposes models to domain shifts during training, and disentangled representation learning [5,60,78,89], where models most commonly include modules or parts that focus on decomposing learned representations into domain-specific and domain-invariant parts. Additionally, domain alignment [29,55,83] and causal representation learning algorithms [51,52] have also been proposed in the literature towards producing robust models that retain their generalization capabilities on unseen data. ...
Preprint
Full-text available
Domain Generalization (DG) research has gained considerable traction as of late, since the ability to generalize to unseen data distributions is a requirement that eludes even state-of-the-art training algorithms. In this paper we observe that the initial iterations of model training play a key role in domain generalization effectiveness, since the loss landscape may be significantly different across the training and test distributions, contrary to the case of i.i.d. data. Conflicts between gradients of the loss components of each domain lead the optimization procedure to undesirable local minima that do not capture the domain-invariant features of the target classes. We propose alleviating domain conflicts in model optimization, by iteratively annealing the parameters of a model in the early stages of training and searching for points where gradients align between domains. By discovering a set of parameter values where gradients are updated towards the same direction for each data distribution present in the training set, the proposed Gradient-Guided Annealing (GGA) algorithm encourages models to seek out minima that exhibit improved robustness against domain shifts. The efficacy of GGA is evaluated on five widely accepted and challenging image classification domain generalization benchmarks, where its use alone is able to establish highly competitive or even state-of-the-art performance. Moreover, when combined with previously proposed domain-generalization algorithms it is able to consistently improve their effectiveness by significant margins.
... Examples of these techniques include the use of pre-trained networks [9], fusion of pre-trained networks [10], attention models [11], [12], multichannel CNN [13], pixel-based supervision [14], and transformer models [15]. (f) Advanced machinelearning techniques that include domain adaptation [16], [17], [18], [19], [20], [21], self-supervised learning [22] and meta-learning [23] have been introduced to improve the generalization of the face PAD. Despite considerable research on face PAD, the issue remains intricate due to the development of novel PAI and the inadequacy of robustness across the quality of the image capture. ...
Article
Full-text available
Face recognition systems that are commonly used in access control settings are vulnerable to presentation attacks, which pose a significant security risk. Therefore, it is crucial to develop a robust and reliable face presentation attack detection system that can automatically detect these types of attacks. In this paper, we present a novel technique called Point Cloud Graph Attention Network (PCGattnNet) to detect face presentation attacks using 3D point clouds captured from a smartphone. The innovative nature of the proposed technique lies in its ability to dynamically represent point clouds as graphs that effectively capture discriminant information, thereby facilitating the detection of robust presentation attacks. To evaluate the efficacy of the proposed method effectively, we introduced newly collected 3D face point clouds using two different smartphones. The newly collected dataset comprised bona fide samples from 100 unique data subjects and six different 3D face presentation attack instruments. Extensive experiments were conducted to evaluate the generalizability of the proposed and existing methods to unknown attack instruments. The outcomes of these experiments demonstrate the reliability of the proposed method for detecting unknown attack instruments.
... Most DGbased FAS works adopt the idea of feature alignment [16], [18], [20], making the FAS feature indistinguishable across multiple training datasets. Subsequent works introduce the disentangle methods, selecting task-irrelevant features, such as identity features [23], [26] or style features [25], and disentangle them from the FAS feature to reduce the bias. Additionally, meta-learning is also an important research direction in FAS domain generalization. ...
Preprint
Current Face Anti-spoofing (FAS) models tend to make overly confident predictions even when encountering unfamiliar scenarios or unknown presentation attacks, which leads to serious potential risks. To solve this problem, we propose a Confidence Aware Face Anti-spoofing (CA-FAS) model, which is aware of its capability boundary, thus achieving reliable liveness detection within this boundary. To enable the CA-FAS to "know what it doesn't know", we propose to estimate its confidence during the prediction of each sample. Specifically, we build Gaussian distributions for both the live faces and the known attacks. The prediction confidence for each sample is subsequently assessed using the Mahalanobis distance between the sample and the Gaussians for the "known data". We further introduce the Mahalanobis distance-based triplet mining to optimize the parameters of both the model and the constructed Gaussians as a whole. Extensive experiments show that the proposed CA-FAS can effectively recognize samples with low prediction confidence and thus achieve much more reliable performance than other FAS models by filtering out samples that are beyond its reliable range.
... Zhang et al. [1] introduced a training strategy called Lookahead, which involves weight interpolation, to explore flat minima for DG issues. In [32], two separate encoders were learned in an adversarial way to capture identity and domain information respectively. Low-rank decomposition on weight matrices is applied in [33] to identify features that are more generalizable. ...
Preprint
Fine-grained domain generalization (FGDG) is a more challenging task due to its small inter-class variations and relatively large intra-class disparities. When domain distribution changes, the fragility of subtle features leads to a pronounced deterioration in model performance.Nevertheless, humans inherently demonstrate the capacity for generalizing to out-of-distribution data, leveraging structured multi-granularity knowledge that emerges from discerning both the commonality and specificity within categories.Likewise, we propose a Feature Structuralized Domain Generalization (FSDG) model, wherein features experience structuralization into common, specific, and confounding segments, harmoniously aligned with their relevant semantic concepts, to elevate performance in FGDG. Specifically, feature structuralization (FS) is achieved through a decorrelation function on disentangled segments, constraints on common feature consistency, specific feature distinctiveness, and a prediction calibration operation across granularities. By imposing these stipulations, FSDG is prompted to disentangle and align features based on multi-granularity knowledge, facilitating robust subtle distinctions among categories. Extensive experimentation on three benchmarks consistently validates the superiority of FSDG over state-of-the-art counterparts, with an average improvement of 6.1% in terms of FGDG performance. Beyond that, the explainability analysis and experiments on various mainstream model architectures confirm the validity of FS.
Article
Full-text available
Face anti-spoofing (FAS) techniques are crucial for safeguarding face recognition systems against an ever-evolving landscape of spoofing attacks. While existing methods have made strides in detecting known attacks, robustness against unknown, potentially more sophisticated attack types remains a critical and unresolved challenge in the field. In this paper, we propose a novel framework integrating token-wise asymmetric contrastive learning with angular margin loss to enhance the robustness of FAS models against unknown attack types. The key idea to resolve the problem is to learn a feature space where live face features are densely distributed, whereas spoof face features are more dispersed. This is achieved through two novel strategies: (1) asymmetric contrastive learning, encouraging the FAS model to learn a compact distribution for live face features while relaxing the constraint on the distribution of spoof face features, and (2) token-wise learning, focusing on capturing intrinsic liveness cues from local region rather than identity or facial-related features. Additionally, an angular margin loss is incorporated to enhance the discriminative power of the learned features. Extensive experiments on public benchmark datasets demonstrate the superiority of our FAS model over state-of-the-art methods in cross-attack scenarios, showcasing its strong robustness to unknown attacks while maintaining unseen domain generalization capability.
Article
Recent Face Anti-Spoofing (FAS) methods have improved generalization to unseen domains by leveraging domain generalization techniques. However, they overlooked the semantic relationships between local features, resulting in suboptimal feature alignment and limited performance. To this end, pixel-wise supervision has been introduced to offer contextual guidance for better feature alignment. Unfortunately, the semantic ambiguity in coarsely designed pixel-wise supervision often leads to misalignment. This paper proposes a novel Dual Consistency Regularization Network (DCRN). It promotes the fine-grained alignment of local features with dense semantic correspondence for FAS. Specifically, a Dual Consistency Learning module (DCL) is devised to capture the inter- and intra-similarity between each region of sample pairs. In this module, a dual consistency regularization learning objective enhances the semantic consistency of local features by minimizing both the variance of inter-similarity and the distance between inter- and intra-similarity. Further, a weight matrix is estimated based on the inter-similarity, representing the possibility that each region belongs to the living class. Based on this weight matrix, WMSE loss is designed to guide the model in avoiding mapping the live regions to the spoofing class, thus alleviating semantic ambiguity in pixel-wise supervision. Extensive experiments on four widely used datasets clearly demonstrate the superiority and high generalization of the proposed DCRN.
Article
Full-text available
As one of the most crucial parts of face detection, the accuracy and the generalization of face anti-spoofing are particularly important. Therefore, it is necessary to propose a multi-branch network to improve the accuracy and generalization of the detection of unknown spoofing attacks. These branches consist of several frequency map encoders and one depth map encoder. These encoders are trained together. It leverages multiple frequency features and generates depth map features. High-frequency edge texture is beneficial for capturing moiré patterns, while low-frequency features are sensitive to color distortion. Depth maps are more discriminative than RGB images at the visual level and serve as useful auxiliary information. Supervised Multi-view Contrastive Learning enhances multi-view feature learning. Moreover, a two-stage feature fusion method effectively integrates multi-branch features. Experiments on four public datasets, namely CASIA-FASD, Replay–Attack, MSU-MFSD, and OULU-NPU, demonstrate model effectiveness. The average Half Total Error Rate (HTER) of our model is 4% (25% to 21%) lower than the Adversarial Domain Adaptation method in inter-set evaluations.
Article
Face recognition systems have raised concerns due to their vulnerability to different presentation attacks, and system security has become an increasingly critical concern. Although many face anti-spoofing (FAS) methods perform well in intra-dataset scenarios, their generalization remains a challenge. To address this issue, some methods adopt domain adversarial training (DAT) to extract domain-invariant features. Differently, in this paper, we propose a domain adversarial attack (DAA) method by adding perturbations to the input images, which makes them indistinguishable across domains and enables domain alignment. Moreover, since models trained on limited data and types of attacks cannot generalize well to unknown attacks, we propose a dual perceptual and generative knowledge distillation framework for face anti-spoofing that utilizes pre-trained face-related models containing rich face priors. Specifically, we adopt two different face-related models as teachers to transfer knowledge to the target student model. The pre-trained teacher models are not from the task of face anti-spoofing but from perceptual and generative tasks, respectively, which implicitly augment the data. By combining both DAA and dual-teacher knowledge distillation, we develop a dual teacher knowledge distillation with domain alignment framework (DTDA) for face anti-spoofing. The advantage of our proposed method has been verified through extensive ablation studies and comparison with state-of-the-art methods on public datasets across multiple protocols.
Article
Full-text available
Unsupervised domain adaptation-based face anti-spoofing methods have attracted more and more attention due to their promising generalization abilities. To mitigate domain bias, existing methods generally attempt to align the marginal distributions of samples from source and target domains. However, the label and pseudo-label information of the samples from source and target domains are ignored. To solve this problem, this paper proposes a Weighted Joint Distribution Optimal Transport unsupervised multi-source domain adaptation method for cross-scenario face anti-spoofing (WJDOT-FAS). WJDOT-FAS consists of three modules: joint distribution estimation, joint distribution optimal transport, and domain weight optimization. Specifically, the joint distributions of the features and pseudo labels of multi-source and target domains are firstly estimated based on a pre-trained feature extractor and a randomly initialized classifier. Then, we compute the cost matrices and the optimal transportation mappings from the joint distributions related to each source domain and the target domain by solving Lp-L1 optimal transport problems. Finally, based on the loss functions of different source domains, the target domain, and the optimal transportation losses from each source domain to the target domain, we can estimate the weights of each source domain, and meanwhile, the parameters of the feature extractor and classifier are also updated. All the learnable parameters and the computations of the three modules are updated alternatively. Extensive experimental results on four widely used 2D attack datasets and three recently published 3D attack datasets under both single- and multi-source domain adaptation settings (including both close-set and open-set) show the advantages of our proposed method for cross-scenario face anti-spoofing.
Article
Deepfake represented by face swapping and face reenactment can transfer the appearance and behavioral expressions of a face in one video image to another face in a different video. In recent years, with the advancement of deep learning techniques, deepfake technology has developed rapidly, achieving increasingly realistic effects. Therefore, many researchers have begun to study deepfake detection research. However, most existing studies on deepfake detection are mainly limited to binary classification of real and fake images, rather than identifying different methods in an open-world scenario, leading to failures in dealing with unknown deepfake categories in practice. In this paper, we propose an unsupervised domain adaptation method for fine-grained open-set deepfake detection. Our method first uses labeled data from the source domain for model pre-training to establish the ability of recognizing different deepfake methods in the source domain. Then, the method uses a Network Memorization based Adaptive Clustering (NMAC) approach to cluster unlabeled images in the target domain and designs a Pseudo-Label Generation (PLG) to generate virtual class labels for unknown deepfake categories by matching the adaptive clustering results with the known deepfake categories in the source domain. Finally, we retrain the initial multi-class deepfake detection model using labeled data of the source domain and pseudo-labeled data of the target domain to improve its generalization ability to unknown deepfake classes presented in the target domain. We validate the effectiveness of the proposed method under multiple open-set fine-grained deepfake detection tasks based on three deepfake datasets (ForgerNet, FaceForensics++, and FakeAVCeleb). Experimental results show that our method has better domain generalization ability than the state-of-the-art methods, and achieves promising performance in fine-grained open-set deepfake detection.
Article
Face Anti-Spoofing (FAS) aims to detect malicious attempts to invade a face recognition system by presenting spoofed faces. State-of-the-art FAS techniques predominantly rely on deep learning models but their cross-domain generalization capabilities are often hindered by the domain shift problem, which arises due to different distributions between training and testing data. In this study, we develop a generalized FAS method under the Efficient Parameter Transfer Learning (EPTL) paradigm, where we adapt the pre-trained Vision Transformer models for the FAS task. During training, the adapter modules are inserted into the pre-trained ViT model, and the adapters are updated while other pre-trained parameters remain fixed. We find the limitations of previous vanilla adapters in that they are based on linear layers, which lack a spoofing-aware inductive bias and thus restrict the cross-domain generalization. To address this limitation and achieve cross-domain generalized FAS, we propose a novel Statistical Adapter (S-Adapter) that gathers local discriminative and statistical information from localized token histograms. To further improve the generalization of the statistical tokens, we propose a novel Token Style Regularization (TSR), which aims to reduce domain style variance by regularizing Gram matrices extracted from tokens across different domains. Our experimental results demonstrate that our proposed S-Adapter and TSR provide significant benefits in both zero-shot and few-shot cross-domain testing, outperforming state-of-the-art methods on several benchmark tests. We will release the source code upon acceptance.
Article
Full-text available
Face presentation attack detection (PAD) plays a pivotal role in securing face recognition systems against spoofing attacks. Although great progress has been made in designing face PAD methods, developing a model that can generalize well to unseen test domains remains a significant challenge. Moreover, due to the different types of spoofing attacks, creating a dataset with a sufficient number of samples for training deep neural networks is a laborious task. This work proposes a comprehensive solution that combines synthetic data generation and deep ensemble learning to enhance the generalization capabilities of face PAD. Specifically, synthetic data is generated by blending a static image with spatiotemporal-encoded images using alpha composition and video distillation. In this way, we simulate motion blur with varying alpha values, thereby generating diverse subsets of synthetic data that contribute to a more enriched training set. Furthermore, multiple base models are trained on each subset of synthetic data using stacked ensemble learning. This allows the models to learn complementary features and representations from different synthetic subsets. The meta-features generated by the base models are used as input for a new model called the meta-model. The latter combines the predictions from the base models, leveraging their complementary information to better handle unseen target domains and enhance overall performance. Experimental results from seven datasets—WMCA, CASIA-SURF, OULU-NPU, CASIA-MFSD, Replay-Attack, MSU-MFSD, and SiW-Mv2—highlight the potential to enhance presentation attack detection by using large-scale synthetic data and a stacking-based ensemble approach.
Preprint
This paper presents a novel perspective for enhancing anti-spoofing performance in zero-shot data domain generalization. Unlike traditional image classification tasks, face anti-spoofing datasets display unique generalization characteristics, necessitating novel zero-shot data domain generalization. One step forward to the previous frame-wise spoofing prediction, we introduce a nuanced metric calculation that aggregates frame-level probabilities for a video-wise prediction, to tackle the gap between the reported frame-wise accuracy and instability in real-world use-case. This approach enables the quantification of bias and variance in model predictions, offering a more refined analysis of model generalization. Our investigation reveals that simply scaling up the backbone of models does not inherently improve the mentioned instability, leading us to propose an ensembled backbone method from a Bayesian perspective. The probabilistically ensembled backbone both improves model robustness measured from the proposed metric and spoofing accuracy, and also leverages the advantages of measuring uncertainty, allowing for enhanced sampling during training that contributes to model generalization across new datasets. We evaluate the proposed method from the benchmark OMIC dataset and also the public CelebA-Spoof and SiW-Mv2. Our final model outperforms existing state-of-the-art methods across the datasets, showcasing advancements in Bias, Variance, HTER, and AUC metrics.
Article
Full-text available
Although the generalization of face anti-spo-ofing (FAS) is increasingly concerned, it is still in the initial stage to solve it based on Vision Transformer (ViT). In this paper, we present a cross-domain FAS framework, dubbed the Transformer with dual Cross-Attention and semi-fixed Mixture-of-Expert (CA-MoEiT), for stimulating the generalization of Face Anti-Spoofing (FAS) from three aspects: (1) Feature augmentation. We insert a MixStyle after PatchEmbed layer to synthesize diverse patch embeddings from novel domains and enhance the generalizability of the trained model. (2) Feature alignment. We design a dual cross-attention mechanism which extends the self-attention to align the common representation from multiple domains. (3) Feature complement. We design a semi-fixed MoE (SFMoE) to selectively replace MLP by introducing a fixed super expert. Benefiting from the gate mechanism in SFMoE, professional experts are adaptively activated with independent learning domain-specific information, which is used as a supplement to domain-invariant features learned by the super expert to further improve the generalization. It is important that the above three technologies can be compatible with any variant of ViT as plug-and-play modules. Extensive experiments show that the proposed CA-MoEiT is effective and outperforms the state-of-the-art methods on several public datasets.
Preprint
Full-text available
Protecting digital identities of human face from various attack vectors is paramount, and face anti-spoofing plays a crucial role in this endeavor. Current approaches primarily focus on detecting spoofing attempts within individual frames to detect presentation attacks. However, the emergence of hyper-realistic generative models capable of real-time operation has heightened the risk of digitally generated attacks. In light of these evolving threats, this paper aims to address two key aspects. First, it sheds light on the vulnerabilities of state-of-the-art face anti-spoofing methods against digital attacks. Second, it presents a comprehensive taxonomy of common threats encountered in face anti-spoofing systems. Through a series of experiments, we demonstrate the limitations of current face anti-spoofing detection techniques and their failure to generalize to novel digital attack scenarios. Notably, the existing models struggle with digital injection attacks including adversarial noise, realistic deepfake attacks, and digital replay attacks. To aid in the design and implementation of robust face anti-spoofing systems resilient to these emerging vulnerabilities, the paper proposes key design principles from model accuracy and robustness to pipeline robustness and even platform robustness. Especially, we suggest to implement the proactive face anti-spoofing system using active sensors to significant reduce the risks for unseen attack vectors and improve the user experience.
Conference Paper
Full-text available
Face anti-spoofing detection is a crucial procedure in biometric face recognition systems. State-of-the-art approaches, based on Convolutional Neural Networks (CNNs), present good results in this field. However, previous works focus on one single modal data with limited number of subjects. The recently published CASIA-SURF dataset is the largest dataset that consists of 1000 subjects and 21000 video clips with 3 modalities (RGB, Depth and IR). In this paper, we propose a multi-stream CNN architecture called FaceBagNet to make full use of this data. The input of FaceBagNet is patch-level images which contributes to extract spoof-specific discriminative information. In addition , in order to prevent overfitting and for better learning the fusion features, we design a Modal Feature Erasing (MFE) operation on the multi-modal features which erases features from one randomly selected modality during training. As the result, our approach wins the second place in CVPR 2019 ChaLearn Face Anti-spoofing attack detection challenge. Our final submission gets the score of 99.8052% (TPR@FPR = 10e-4) on the test set.
Conference Paper
Full-text available
Face anti-spoofing is essential to prevent face recognition systems from a security breach. Much of the progresses have been made by the availability of face anti-spoofing benchmark datasets in recent years. However, existing face anti-spoofing benchmarks have limited number of subjects (≤ 170) and modalities (≤ 2), which hinder the further development of the academic community. To facilitate face anti-spoofing research, we introduce a large-scale multi-modal dataset, namely CASIA-SURF, which is the largest publicly available dataset for face anti-spoofing in terms of both subjects and visual modalities. Specifically, it consists of 1, 000 subjects with 21, 000 videos and each sample has 3 modalities (i.e., RGB, Depth and IR). We also provide a measurement set, evaluation protocol and training/validation/testing subsets, developing a new benchmark for face anti-spoofing. Moreover , we present a new multi-modal fusion method as base-line, which performs feature re-weighting to select the more informative channel features while suppressing the less useful ones for each modal. Extensive experiments have been conducted on the proposed dataset to verify its significance and generalization capability. The dataset is available at https://sites.google.com/qq. com/chalearnfacespoofingattackdete/.
Conference Paper
Full-text available
Face recognition (FR) is being widely used in many applications from access control to smartphone unlock. As a result, face presentation attack detection (PAD) has drawn increasing attentions to secure the FR systems. Traditional approaches for PAD mainly assume that training and testing scenarios are similar in imaging conditions (illu-mination, scene, camera sensor, etc.), and thus may lack good generalization capability into new application scenarios. In this work, we propose an end-to-end learning approach to improve PAD generalization capability by utilizing prior knowledge from source domain via adversarial domain adaptation. We first build a source domain PAD model optimized with triplet loss. Subsequently, we perform adversarial domain adaptation w.r.t. the target domain to learn a shared embedding space by both the source and target domain models, in which the discriminator cannot reliably predict whether a sample is from the source or target domain. Finally, PAD in the target domain is performed with k-nearest neighbors (k-NN) classifier in the embedding space. The proposed approach shows promising generalization capability in a number of public-domain face PAD databases.
Conference Paper
Full-text available
Face presentation attack detection (PAD) has drawn increasing attentions to secure face recognition (FR) systems which are being widely used in many applications from access control to smartphone unlock. Traditional approaches for PAD may lack good generalization capability into new application scenarios due to the limited number of subjects and data modality. In this work, we propose an end-to-end multi-modal fusion approach via spatial and channel attention to improve PAD performance on CASIA-SURF. Specifically , we first build four branches integrated with spatial and channel attention module to obtain the uniform features of different modalities, i.e., RGB, Depth, IR and the fused modality with 9 channels which concatenating three modalities. Subsequently, the features extracted from the four branches are concatenated and fed into the shared layers to learn more discriminative features from the fusion perspective. Finally, we get the classification confidence scores w.r.t. PAD or not. The entire network is optimized with the joint of the center loss and softmax loss and SGRD solver to update the parameters. The proposed approach shows promising results on the CASIA-SURF dataset.
Conference Paper
Full-text available
In this paper, we tackle the problem of domain generalization: how to learn a generalized feature representation for an "unseen" target domain by taking the advantage of multiple seen source-domain data. We present a novel framework based on adversarial autoencoders to learn a generalized latent feature representation across domains for domain generalization. To be specific, we extend ad-versarial autoencoders by imposing the Maximum Mean Discrepancy (MMD) measure to align the distributions among different domains, and matching the aligned distribution to an arbitrary prior distribution via adversarial feature learning. In this way, the learned feature representation is supposed to be universal to the seen source domains because of the MMD regularization, and is expected to generalize well on the target domain because of the introduction of the prior distribution. We proposed an algorithm to jointly train different components of our proposed framework. Extensive experiments on various vision tasks demonstrate that our proposed framework can learn better generalized features for the unseen target domain compared with state-of-the-art domain generalization methods.
Article
Full-text available
In this work, we propose a novel framework leveraging the advantages of the representational ability of deep learning and domain generalization for face spoofing detection. In particular, the generalized deep feature representation is achieved by taking both spatial and temporal information into consideration, and a 3D Convolutional Neural Network (3D CNN) architecture tailored for the spatial-temporal input is proposed. The network is first initialized by training with augmented facial samples based on cross-entropy loss and further enhanced with a specifically designed generalization loss, which coherently serves as the regularization term. The training samples from different domains can seamlessly work together for learning the generalized feature representation by manipulating their feature distribution distances. We evaluate the proposed framework with different experimental setups using various databases. Experimental results indicate that our method can learn more discriminative and generalized information compared with the state-of-the-art methods.
Article
Full-text available
Face anti-spoofing is the crucial step to prevent face recognition systems from a security breach. Previous deep learning approaches formulate face anti-spoofing as a binary classification problem. Many of them struggle to grasp adequate spoofing cues and generalize poorly. In this paper, we argue the importance of auxiliary supervision to guide the learning toward discriminative and generalizable cues. A CNN-RNN model is learned to estimate the face depth with pixel-wise supervision, and to estimate rPPG signals with sequence-wise supervision. Then we fuse the estimated depth and rPPG to distinguish live vs. spoof faces. In addition, we introduce a new face anti-spoofing database that covers a large range of illumination, subject, and pose variations. Experimental results show that our model achieves the state-of-the-art performance on both intra-database and cross-database testing.
Conference Paper
Full-text available
The face image is the most accessible biometric modality which is used for highly accurate face recognition systems, while it is vulnerable to many different types of presentation attacks. Face anti-spoofing is a very critical step before feeding the face image to biometric systems. In this paper , we propose a novel two-stream CNN-based approach for face anti-spoofing, by extracting the local features and holistic depth maps from the face images. The local features facilitate CNN to discriminate the spoof patches independent of the spatial face areas. On the other hand, holistic depth map examine whether the input image has a face-like depth. Extensive experiments are conducted on the challenging databases (CASIA-FASD, MSU-USSA, and Replay Attack), with comparison to the state of the art.
Conference Paper
Full-text available
The vulnerabilities of face-based biometric systems to presentation attacks have been finally recognized but yet we lack generalized software-based face presentation attack detection (PAD) methods performing robustly in practical mobile authentication scenarios. This is mainly due to the fact that the existing public face PAD datasets are beginning to cover a variety of attack scenarios and acquisition conditions but their standard evaluation protocols do not encourage researchers to assess the generalization capabilities of their methods across these variations. In this present work, we introduce a new public face PAD database, OULU-NPU, aiming at evaluating the generalization of PAD methods in more realistic mobile authentication scenarios across three covariates: unknown environmental conditions (namely illumination and background scene), acquisition devices and presentation attack instruments (PAI). This publicly available database consists of 5940 videos corresponding to 55 subjects recorded in three different environments using high-resolution frontal cameras of six different smartphones. The high-quality print and videoreplay attacks were created using two different printers and two different display devices. Each of the four unambiguously defined evaluation protocols introduces at least one previously unseen condition to the test set, which enables a fair comparison on the generalization capabilities between new and existing approaches. The baseline results using color texture analysis based face PAD method demonstrate the challenging nature of the database.
Conference Paper
Full-text available
The large pose discrepancy between two face images is one of the key challenges in face recognition. Conventional approaches for pose-invariant face recognition either perform face frontalization on, or learn a pose-invariant representation from, a non-frontal face image. We argue that it is more desirable to perform both tasks jointly to allow them to leverage each other. To this end, this paper proposes Disentangled Representation learning-Generative Adversarial Network (DR-GAN) with three distinct novelties. First, the encoder-decoder structure of the generator allows DR-GAN to learn a generative and discriminative representation, in addition to image synthesis. Second, this representation is explicitly disentangled from other face variations such as pose, through the pose code provided to the decoder and pose estimation in the discriminator. Third, DR-GAN can take one or multiple images as the input, and generate one unified representation along with an arbitrary number of synthetic images. Quantitative and qualitative evaluation on both controlled and in-the-wild databases demonstrate the superiority of DR-GAN over the state of the art.
Chapter
Full-text available
The aim of this paper is to give an overview of domain adaptation and transfer learning with a specific view on visual applications. After a general motivation, we first position domain adaptation in the larger transfer learning problem. Second, we try to address and analyze briefly the state-of-the-art methods for different types of scenarios, first describing the historical shallow methods, addressing both the homogeneous and the heterogeneous domain adaptation methods. Third, we discuss the effect of the success of deep convolutional architectures which led to new type of domain adaptation methods that integrate the adaptation within the deep architecture. Fourth, we overview the methods that go beyond image categorization, such as object detection or image segmentation, video analyses or learning visual attributes. Finally, we conclude the paper with a section where we relate domain adaptation to other machine learning solutions.
Article
Full-text available
The vulnerabilities of face biometric authentication systems to spoofing attacks have received a significant attention during the recent years. Some of the proposed countermeasures have achieved impressive results when evaluated on intra-tests i.e. the system is trained and tested on the same database. Unfortunately, most of these techniques fail to generalize well to unseen attacks e.g. when the system is trained on one database and then evaluated on another database. This is a major concern in biometric anti-spoofing research which is mostly overlooked. In this paper, we propose a novel solution based on describing the facial appearance by applying Fisher Vector encoding on Speeded-Up Robust Features (SURF) extracted from from different color spaces. The evaluation of our countermeasure on three challenging benchmark face spoofing databases, namely the CASIA Face Anti-Spoofing Database, the Replay- Attack Database and MSU Mobile Face Spoof Database, showed excellent and stable performance across all the three datasets. Most importantly, in inter-database tests, our proposed approach outperforms the state of the art and yields in very promising generalization capabilities, even when only limited training data is used.
Conference Paper
Full-text available
With the wide applications of user authentication based on face recognition, face spoof attacks against face recognition systems are drawing increasing attentions. While emerging approaches of face an-tispoofing have been reported in recent years, most of them limit to the non-realistic intra-database testing scenarios instead of the cross-database testing scenarios. We propose a robust representation integrating deep texture features and face movement cue like eye-blink as countermeasures for presentation attacks like photos and replays. We learn deep texture features from both aligned facial images and whole frames, and use a frame difference based approach for eye-blink detection. A face video clip is classified as live if it is categorized as live using both cues. Cross-database testing on public-domain face databases shows that the proposed approach significantly outperforms the state-of-the-art.
Article
Full-text available
In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base. More specifically, we propose a benchmark task to recognize one million celebrities from their face images, by using all the possibly collected face images of this individual on the web as training data. The rich information provided by the knowledge base helps to conduct disambiguation and improve the recognition accuracy, and contributes to various real-world applications, such as image captioning and news video analysis. Associated with this task, we design and provide concrete measurement set, evaluation protocol, as well as training data. We also present in details our experiment setup and report promising baseline results. Our benchmark task could lead to one of the largest classification problems in computer vision. To the best of our knowledge, our training dataset, which contains 10M images in version 1, is the largest publicly available one in the world.
Article
Full-text available
Research on non-intrusive software-based face spoofing detection schemes has been mainly focused on the analysis of the luminance information of the face images, hence discarding the chroma component, which can be very useful for discriminating fake faces from genuine ones. This paper introduces a novel and appealing approach for detecting face spoofing using a colour texture analysis. We exploit the joint colour-texture information from the luminance and the chrominance channels by extracting complementary low-level feature descriptions from different colour spaces. More specifically, the feature histograms are computed over each image band separately. Extensive experiments on the three most challenging benchmark data sets, namely, the CASIA face anti-spoofing database, the replay-attack database, and the MSU mobile face spoof database, showed excellent results compared with the state of the art. More importantly, unlike most of the methods proposed in the literature, our proposed approach is able to achieve stable performance across all the three benchmark data sets. The promising results of our cross-database evaluation suggest that the facial colour texture representation is more stable in unknown conditions compared with its gray-scale counterparts.
Conference Paper
Full-text available
Research on face spoofing detection has mainly been focused on analyzing the luminance of the face images, hence discarding the chrominance information which can be useful for discriminating fake faces from genuine ones. In this work, we propose a new face anti-spoofing method based on color texture analysis. We analyze the joint color-texture information from the luminance and the chrominance channels using a color local binary pattern descriptor. More specifically, the feature histograms are extracted from each image band separately. Extensive experiments on two benchmark datasets, namely CASIA face anti-spoofing and Replay-Attack databases, showed excellent results compared to the state-of-the-art. Most importantly, our inter-database evaluation depicts that the proposed approach showed very promising generalization capabilities.
Article
Full-text available
User authentication is an important step to protect information, and in this context, face biometrics is potentially advantageous. Face biometrics is natural, intuitive, easy to use, and less human-invasive. Unfortunately, recent work has revealed that face biometrics is vulnerable to spoofing attacks using cheap low-tech equipment. This paper introduces a novel and appealing approach to detect face spoofing using the spatiotemporal (dynamic texture) extensions of the highly popular local binary pattern operator. The key idea of the approach is to learn and detect the structure and the dynamics of the facial micro-textures that characterise real faces but not fake ones. We evaluated the approach with two publicly available databases (Replay-Attack Database and CASIA Face Anti-Spoofing Database). The results show that our approach performs better than state-of-the-art techniques following the provided evaluation protocols of each database.
Article
Full-text available
Automatic face recognition is now widely used in applications ranging from deduplication of identity to authentication of mobile payment. This popularity of face recognition has raised concerns about face spoof attacks (also known as biometric sensor presentation attacks), where a photo or video of an authorized person’s face could be used to gain access to facilities or services. While a number of face spoof detection techniques have been proposed, their generalization ability has not been adequately addressed. We propose an efficient and rather robust face spoof detection algorithm based on image distortion analysis (IDA). Four different features (specular reflection, blurriness, chromatic moment, and color diversity) are extracted to form the IDA feature vector. An ensemble classifier, consisting of multiple SVM classifiers trained for different face spoof attacks (e.g., printed photo and replayed video), is used to distinguish between genuine (live) and spoof faces. The proposed approach is extended to multiframe face spoof detection in videos using a voting-based scheme. We also collect a face spoof database, MSU mobile face spoofing database (MSU MFSD), using two mobile devices (Google Nexus 5 and MacBook Air) with three types of spoof attacks (printed photo, replayed video with iPhone 5S, and replayed video with iPad Air). Experimental results on two public-domain face spoof databases (Idiap REPLAY-ATTACK and CASIA FASD), and the MSU MFSD database show that the proposed approach outperforms the state-of-the-art methods in spoof detection. Our results also highlight the difficulty in separating genuine and spoof faces, especially in cross-database and cross-device scenarios.
Article
Full-text available
Though having achieved some progresses, the hand-crafted texture features, e.g., LBP [23], LBP-TOP [11] are still unable to capture the most discriminative cues between genuine and fake faces. In this paper, instead of designing feature by ourselves, we rely on the deep convolutional neural network (CNN) to learn features of high discriminative ability in a supervised manner. Combined with some data pre-processing, the face anti-spoofing performance improves drastically. In the experiments, over 70% relative decrease of Half Total Error Rate (HTER) is achieved on two challenging datasets, CASIA [36] and REPLAY-ATTACK [7] compared with the state-of-the-art. Meanwhile, the experimental results from inter-tests between two datasets indicates CNN can obtain features with better generalization ability. Moreover, the nets trained using combined data from two datasets have less biases between two datasets.
Conference Paper
Full-text available
The face recognition community has finally started paying more attention to the long-neglected problem of spoofing attacks and the number of countermeasures is gradually increasing. Fairly good results have been reported on the publicly available databases but it is reasonable to assume that there exists no superior anti-spoofing technique due to the varying nature of attack scenarios and acquisition conditions. Therefore, we propose to approach the problem of face spoofing as a set of attack-specific subproblems that are solvable with a proper combination of complementary countermeasures. Inspired by how we humans can perform reliable spoofing detection only based on the available scene and context information, this work provides the first investigation in research literature that attempts to detect the presence of spoofing medium in the observed scene. We experiment with two publicly available databases consisting of several fake face attacks of different nature under varying conditions and imaging qualities. The experiments show excellent results beyond the state of the art. More importantly, our cross-database evaluation depicts that the proposed approach has promising generalization capabilities.
Conference Paper
Full-text available
Current face biometric systems are vulnerable to spoof-ing attacks. A spoofing attack occurs when a person tries to masquerade as someone else by falsifying data and thereby gaining illegitimate access. Inspired by image quality as-sessment, characterization of printing artifacts, and differ-ences in light reflection, we propose to approach the prob-lem of spoofing detection from texture analysis point of view. Indeed, face prints usually contain printing quality defects that can be well detected using texture features. Hence, we present a novel approach based on analyzing facial image textures for detecting whether there is a live person in front of the camera or a face print. The proposed approach ana-lyzes the texture of the facial images using multi-scale local binary patterns (LBP). Compared to many previous works, our proposed approach is robust, computationally fast and does not require user-cooperation. In addition, the tex-ture features that are used for spoofing detection can also be used for face recognition. This provides a unique fea-ture space for coupling spoofing detection and face recog-nition. Extensive experimental analysis on a publicly avail-able database showed excellent results compared to existing works.
Article
Full-text available
A common technique to by-pass 2-D face recognition systems is to use photographs of spoofed identities. Un-fortunately, research in counter-measures to this type of attack have not kept-up -even if such threats have been known for nearly a decade, there seems to exist no consensus on best practices, techniques or protocols for developing and testing spoofing-detectors for face recog-nition. We attribute the reason for this delay, partly, to the unavailability of public databases and protocols to study solutions and compare results. To this purpose we introduce the publicly available PRINT-ATTACK database and exemplify how to use its companion pro-tocol with a motion-based algorithm that detects corre-lations between the person's head movements and the scene context. The results are to be used as basis for comparison to other counter-measure techniques. The PRINT-ATTACK database contains 200 videos of real-accesses and 200 videos of spoof attempts using printed photographs of 50 different identities.
Article
The explosive growth of digital images in video surveillance and social media has led to the significant need for efficient search of persons of interest in law enforcement and forensic applications. Despite tremendous progress in primary biometric traits (e.g., face and fingerprint) based person identification, a single biometric trait alone can not meet the desired recognition accuracy in forensic scenarios. Tattoos, as one of the important soft biometric traits, have been found to be valuable for assisting in person identification. However, tattoo search in a large collection of unconstrained images remains a difficult problem, and existing tattoo search methods mainly focus on matching cropped tattoos, which is different from real application scenarios. To close the gap, we propose an efficient tattoo search approach that is able to learn tattoo detection and compact representation jointly in a single convolutional neural network (CNN) via multi-task learning. While the features in the backbone network are shared by both tattoo detection and compact representation learning, individual latent layers of each sub-network optimize the shared features toward the detection and feature learning tasks, respectively. We resolve the small batch size issue inside the joint tattoo detection and compact representation learning network via random image stitch and preceding feature buffering. We evaluate the proposed tattoo search system using multiple public-domain tattoo benchmarks, and a gallery set with about 300K distracter tattoo images compiled from these datasets and images from the Internet. In addition, we also introduce a tattoo sketch dataset containing 300 tattoos for sketch-based tattoo search. Experimental results show that the proposed approach has superior performance in tattoo detection and tattoo search at scale compared to several state-of-the-art tattoo retrieval algorithms.
Conference Paper
3D mask face presentation attack, as a new challenge in face recognition, has been attracting increasing attention. Recently, remote Photoplethysmography (rPPG) is employed as an intrinsic liveness cue which is independent of the mask appearance. Although existing rPPG-based methods achieve promising results on both intra and cross dataset scenarios, they may not be robust enough when rPPG signals are contaminated by noise. In this paper, we propose a new liveness feature, called rPPG correspondence feature (CFrPPG) to precisely identify the heartbeat vestige from the observed noisy rPPG signals. To further overcome the global interferences, we propose a novel learning strategy which incorporates the global noise within the CFrPPG feature. Extensive experiments indicate that the proposed feature not only outperforms the state-of-the-art rPPG based methods on 3D mask attacks but also be able to handle the practical scenarios with dim light and camera motion.
Chapter
3D mask face presentation attack, as a new challenge in face recognition, has been attracting increasing attention. Recently, remote Photoplethysmography (rPPG) is employed as an intrinsic liveness cue which is independent of the mask appearance. Although existing rPPG-based methods achieve promising results on both intra and cross dataset scenarios, they may not be robust enough when rPPG signals are contaminated by noise. In this paper, we propose a new liveness feature, called rPPG correspondence feature (CFrPPG) to precisely identify the heartbeat vestige from the observed noisy rPPG signals. To further overcome the global interferences, we propose a novel learning strategy which incorporates the global noise within the CFrPPG feature. Extensive experiments indicate that the proposed feature not only outperforms the state-of-the-art rPPG based methods on 3D mask attacks but also be able to handle the practical scenarios with dim light and camera motion.
Article
We propose an unsupervised approach to learn image representations that consist of disentangled factors of variation. A factor of variation corresponds to an image attribute that can be discerned consistently across a set of images, such as the pose or color of objects. Our disentangled representation consists of a concatenation of feature chunks, each chunk representing a factor of variation. It supports applications such as transferring attributes from one image to another, by simply swapping feature chunks, and classification or retrieval based on one or several attributes, by considering a user specified subset of feature chunks. We learn our representation in an unsupervised manner, without any labeling or knowledge of the data domain, using an autoencoder architecture with two novel training objectives: first, we propose an invariance objective to encourage that encoding of each attribute, and decoding of each chunk, are invariant to changes in other attributes and chunks, respectively, and second, we include a classification objective, which ensures that each chunk corresponds to a consistently discernible attribute in the represented image, hence avoiding the shortcut where chunks are ignored completely. We demonstrate the effectiveness of our approach on the MNIST, Sprites, and CelebA datasets.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Book
This comprehensive text/reference presents a broad review of diverse domain adaptation (DA) methods for machine learning, with a focus on solutions for visual applications. The book collects together solutions and perspectives proposed by an international selection of pre-eminent experts in the field, addressing not only classical image categorization, but also other computer vision tasks such as detection, segmentation and visual attributes. Topics and features: • Surveys the complete field of visual DA, including shallow methods designed for homogeneous and heterogeneous data as well as deep architectures • Presents a positioning of the dataset bias in the CNN-based feature arena • Proposes detailed analyses of popular shallow methods that addresses landmark data selection, kernel embedding, feature alignment, joint feature transformation and classifier adaptation, or the case of limited access to the source data • Discusses more recent deep DA methods, including discrepancy-based adaptation networks and adversarial discriminative DA models • Addresses domain adaptation problems beyond image categorization, such as a Fisher encoding adaptation for vehicle re-identification, semantic segmentation and detection trained on synthetic images, and domain generalization for semantic part detection • Describes a multi-source domain generalization technique for visual attributes and a unifying framework for multi-domain and multi-task learning This authoritative volume will be of great interest to a broad audience ranging from researchers and practitioners, to students involved in computer vision, pattern recognition and machine learning. Dr. Gabriela Csurka is a Senior Scientist in the Computer Vision Team at Xerox Research Centre Europe, Meylan, France.
Article
Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.
Article
We present a new model DrNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the time-vary components enables prediction of future frames. We evaluate our approach on a range of synthetic and real videos, demonstrating the ability to coherently generate hundreds of steps into the future.
Conference Paper
This paper investigates a novel problem of generating images from visual attributes. We model the image as a composite of foreground and background and develop a layered generative model with disentangled latent variables that can be learned end-to-end using a variational auto-encoder. We experiment with natural images of faces and birds and demonstrate that the proposed models are capable of generating realistic and diverse samples with disentangled latent representations. We use a general energy minimization algorithm for posterior inference of latent variables given novel images. Therefore, the learned generative models show excellent quantitative and visual results in the tasks of attribute-conditioned image reconstruction and completion.
Conference Paper
In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base. More specifically, we propose a benchmark task to recognize one million celebrities from their face images, by using all the possibly collected face images of this individual on the web as training data. The rich information provided by the knowledge base helps to conduct disambiguation and improve the recognition accuracy, and contributes to various real-world applications, such as image captioning and news video analysis. Associated with this task, we design and provide concrete measurement set, evaluation protocol, as well as training data. We also present in details our experiment setup and report promising baseline results. Our benchmark task could lead to one of the largest classification problems in computer vision. To the best of our knowledge, our training dataset, which contains 10M images in version 1, is the largest publicly available one in the world.
Conference Paper
This paper describes InfoGAN, an information-theoretic extension to the Gener-ative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound of the mutual information objective that can be optimized efficiently. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, pres-ence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing supervised methods.
Article
With the wide deployment of face recognition systems in applications from de-duplication to mobile device unlocking,security against face spoofing attacks requires increased attention; such attacks can be easily launched via printed photos, video replays and 3D masks of a face. We address the problem of face spoof detection against print (photo) and replay (photo or video) attacks based on the analysis of image distortion (e.g., surface reflection, moir´e pattern, color distortion, and shape deformation) in spoof face images (or video frames). The application domain of interest is smartphone unlock, given that growing number of smartphones have face unlock and mobile payment capabilities. We build an unconstrained smartphone spoof attack database (MSU USSA) containing more than 1; 000 subjects. Both print and replay attacks are captured using the front and rear cameras of a Nexus 5 smartphone. We analyze the image distortion of print and replay attacks using different (i) intensity channels (R, G, B and grayscale), (ii) image regions (entire image, detected face, and facial component between the nose and chin), and (iii) feature descriptors. We develop an efficient face spoof detection system on an Android smartphone. Experimental results on the public-domain Idiap Replay-Attack,CASIA FASD, and MSU-MFSD databases, and the MSU USSA database show that the proposed approach is effective in face spoof detection for both cross-database and intra-database testing scenarios. User studies of our Android face spoof detection system involving 20 participants show that the proposed approach works very well in real application scenarios.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Conference Paper
Face antispoofing has now attracted intensive attention, aiming to assure the reliability of face biometrics. We notice that currently most of face antispoofing databases focus on data with little variations, which may limit the generalization performance of trained models since potential attacks in real world are probably more complex. In this paper we release a face antispoofing database which covers a diverse range of potential attack variations. Specifically, the database contains 50 genuine subjects, and fake faces are made from the high quality records of the genuine faces. Three imaging qualities are considered, namely the low quality, normal quality and high quality. Three fake face attacks are implemented, which include warped photo attack, cut photo attack and video attack. Therefore each subject contains 12 videos (3 genuine and 9 fake), and the final database contains 600 video clips. Test protocol is provided, which consists of 7 scenarios for a thorough evaluation from all possible aspects. A baseline algorithm is also given for comparison, which explores the high frequency information in the facial region to determine the liveness. We hope such a database can serve as an evaluation platform for future researches in the literature.
Article
Can we efficiently learn the parameters of directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and in case of large datasets? We introduce a novel learning and approximate inference method that works efficiently, under some mild conditions, even in the on-line and intractable case. The method involves optimization of a stochastic objective function that can be straightforwardly optimized w.r.t. all parameters, using standard gradient-based optimization methods. The method does not require the typically expensive sampling loops per datapoint required for Monte Carlo EM, and all parameter updates correspond to optimization of the variational lower bound of the marginal likelihood, unlike the wake-sleep algorithm. These theoretical advantages are reflected in experimental results.
Article
Biometrics is a rapidly developing technology that is to identify a person based on his or her physiological or behavioral characteristics. To ensure the correction of authentication, the biometric system must be able to detect and reject the use of a copy of a biometric instead of the live biometric. This function is usually termed "liveness detection". This paper describes a new method for live face detection. Using structure and movement information of live face, an effective live face detection algorithm is presented. Compared to existing approaches, which concentrate on the measurement of 3D depth information, this method is based on the analysis of Fourier spectra of a single face image or face image sequences. Experimental results show that the proposed method has an encouraging performance.
Article
We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD).We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests. © 2012 Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf and Alexander Smola.