ArticlePublisher preview available

CA-MoEiT: Generalizable Face Anti-spoofing via Dual Cross-Attention and Semi-fixed Mixture-of-Expert

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract and Figures

Although the generalization of face anti-spo-ofing (FAS) is increasingly concerned, it is still in the initial stage to solve it based on Vision Transformer (ViT). In this paper, we present a cross-domain FAS framework, dubbed the Transformer with dual Cross-Attention and semi-fixed Mixture-of-Expert (CA-MoEiT), for stimulating the generalization of Face Anti-Spoofing (FAS) from three aspects: (1) Feature augmentation. We insert a MixStyle after PatchEmbed layer to synthesize diverse patch embeddings from novel domains and enhance the generalizability of the trained model. (2) Feature alignment. We design a dual cross-attention mechanism which extends the self-attention to align the common representation from multiple domains. (3) Feature complement. We design a semi-fixed MoE (SFMoE) to selectively replace MLP by introducing a fixed super expert. Benefiting from the gate mechanism in SFMoE, professional experts are adaptively activated with independent learning domain-specific information, which is used as a supplement to domain-invariant features learned by the super expert to further improve the generalization. It is important that the above three technologies can be compatible with any variant of ViT as plug-and-play modules. Extensive experiments show that the proposed CA-MoEiT is effective and outperforms the state-of-the-art methods on several public datasets.
The architecture of proposed Transformer with dual Cross-Attention and semi-fixed MoE (CA-MoEiT). It is built on a standard ViT (Dosovitskiy et al., 2021a) and consists of Tokenization Module, Transformer Encoder, and Classification Head. Note that the dotted line represents its parallel operation with the real line. First, the MixStyle is inserted after PatchEmbed Layer to process live patch tokens Xpatl\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{X}^{l}_{pat}$$\end{document} into more diverse patch tokens X^patl\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\hat{X}}^{l}_{pat}$$\end{document}; Then, taking the red sequences as an example (similar for green sequences), we divide it into source-domain sequences ZS\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_{S}$$\end{document} participating in Multi-headed Self-Attention (MSA) and the cross-domain sequences ZC\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_{C}$$\end{document}, which are generated by Dual Cross-Attention (DCA) with 2 randomly selected sequences from ZS\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_{S}$$\end{document}, such as Zd1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_{d1}$$\end{document} and Zd2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{Z}_{d2}$$\end{document}. Finally, a shared classifier is provided for all sequences, which are optimized by Lcls\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {L}_{cls}$$\end{document}. And an Asymmetric Triplet loss Ltrip\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {L}_{trip}$$\end{document} is introduced only in source-domain sequences
… 
This content is subject to copyright. Terms and conditions apply.
International Journal of Computer Vision (2024) 132:5439–5452
https://doi.org/10.1007/s11263-024-02135-2
CA-MoEiT: Generalizable Face Anti-spoofing via Dual Cross-Attention
and Semi-fixed Mixture-of-Expert
Ajian Liu1
Received: 31 July 2023 / Accepted: 30 May 2024 / Published online: 15 June 2024
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024
Abstract
Although the generalization of face anti-spo-ofing (FAS) is increasingly concerned, it is still in the initial stage to solve it based
on Vision Transformer (ViT). In this paper, we present a cross-domain FAS framework, dubbed the Transformer with dual
Cross-Attention and semi-fixed Mixture-of-Expert (CA-MoEiT), for stimulating the generalization of Face Anti-Spoofing
(FAS) from three aspects: (1) Feature augmentation. We insert a MixStyle after PatchEmbed layer to synthesize diverse
patch embeddings from novel domains and enhance the generalizability of the trained model. (2) Feature alignment. We
design a dual cross-attention mechanism which extends the self-attention to align the common representation from multiple
domains. (3) Feature complement. We design a semi-fixed MoE (SFMoE) to selectively replace MLP by introducing a fixed
super expert. Benefiting from the gate mechanism in SFMoE, professional experts are adaptively activated with independent
learning domain-specific information, which is used as a supplement to domain-invariant features learned by the super expert
to further improve the generalization. It is important that the above three technologies can be compatible with any variant of
ViT as plug-and-play modules. Extensive experiments show that the proposed CA-MoEiT is effective and outperforms the
state-of-the-art methods on several public datasets.
Keywords Face anti-spoofing ·Domain generalization ·Vision transformer ·Mixture-of-experts
1 Introduction
Face Anti-Spoofing (FAS) plays a vital role in protecting face
recognition systems from malicious Presentation Attacks
(PAs), ranging from print-attack (Zhang et al., 2012), replay-
attack (Chingovska et al., 2012) and mask-attack (Erdogmus
& Marcel, 2014). Despite the existing methods (Yang et al.,
2014; Patel et al., 2016; Liu et al., 2018; George & Marcel,
2019; Yu et al., 2020c; Zhang et al., 2020a; Liu et al., 2020;
Yu et al., 2020b) obtain remarkable performance in intra-
dataset experiments where training and testing data are from
the same domain, on the other hand, is an unsolved challenge
to the cross-dataset experiments due to large distribution dis-
crepancies among different domains.
Communicated by Segio Escalera.
BAjian Liu
ajianliu92@gmail.com
1The State Key Laboratory of Multimodal Artificial
Intelligence Systems (MAIS), Institute of Automation,
Chinese Academy of Sciences (CASIA), Beijing, China
There are two schemes to improve the generalization
of Presentation Attack Detection (PAD) technology: (1)
Domain Adaptation (DA). It aims to minimize the distri-
bution discrepancy between the source and target domain by
leveraging the unlabeled target data. However, the target data
is difficult to collect, or even unknown during training which
limits the utilization of DA methods (Li et al., 2018a;Tu
et al., 2019; Wang et al., 2019). (2) Domain Generalization
(DG). It can conquer this by taking the advantage of multiple
source domains without seeing any target data.
A straightforward strategy is to collect diverse source data
from multiple relevant domains to train a model with more
domain-invariant and generalizable representations. Some
methods (Menon, 2019; Yang et al., 2021) directly use aug-
mentation from the data level to improve the diversity of
training data. While (Wang et al., 2022) and (Huang et al.,
2022) suggest that more efficient data can be obtained by
extending augmentation from the image level to multiple fea-
ture levels.
In addition to implicitly synthesizing samples from the
novel domains, some methods (Shao et al., 2019; Saha et al.,
2020; Jia et al., 2020; Kim & Kim, 2021; Wang et al., 2022)
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Both physical and digital attacks pose substantial threats to the security of face recognition systems. Due to the distinct characteristics of these attacks, existing studies typically treat them as separate issues [17], [18], as shown in Fig. 1. However, real-world attacks often involve a combination of both, rendering single detection methods ineffective in realworld scenarios [17], [19], [20]. ...
... Due to the distinct characteristics of these attacks, existing studies typically treat them as separate issues [17], [18], as shown in Fig. 1. However, real-world attacks often involve a combination of both, rendering single detection methods ineffective in realworld scenarios [17], [19], [20]. Furthermore, treating these attacks independently necessitates deploying multiple models, leading to increased computational costs, which also fails to provide a unified framework for detecting fake faces from both attack modalities simultaneously. ...
Preprint
Face recognition systems are vulnerable to physical attacks (e.g., printed photos) and digital threats (e.g., DeepFake), which are currently being studied as independent visual tasks, such as Face Anti-Spoofing and Forgery Detection. The inherent differences among various attack types present significant challenges in identifying a common feature space, making it difficult to develop a unified framework for detecting data from both attack modalities simultaneously. Inspired by the efficacy of Mixture-of-Experts (MoE) in learning across diverse domains, we explore utilizing multiple experts to learn the distinct features of various attack types. However, the feature distributions of physical and digital attacks overlap and differ. This suggests that relying solely on distinct experts to learn the unique features of each attack type may overlook shared knowledge between them. To address these issues, we propose SUEDE, the Shared Unified Experts for Physical-Digital Face Attack Detection Enhancement. SUEDE combines a shared expert (always activated) to capture common features for both attack types and multiple routed experts (selectively activated) for specific attack types. Further, we integrate CLIP as the base network to ensure the shared expert benefits from prior visual knowledge and align visual-text representations in a unified space. Extensive results demonstrate SUEDE achieves superior performance compared to state-of-the-art unified detection methods.
... To address this issue, recent methods have employed DAbased techniques (Liu et al. 2022b;Yue et al. 2023;Liu et al. 2024c) and DG-based approaches (Zheng et al. 2024b,c;Liu et al. 2023b;Cai et al. 2024;Liu et al. 2024b;Liu 2024) aim to learn domain-invariant features across multiple source domains. Also, incremental learning (IL) methods (Guo et al. 2022;Wang et al. 2024) are considered to tackle the catastrophic forgetting problem in the context of domain discontinuity in FAS. ...
Article
Unified detection of digital and physical attacks in facial recognition systems has become a focal point of research in recent years. However, current multi-modal methods typically ignore the intra-class and inter-class variability across different types of attacks, leading to degraded performance. To address this limitation, we propose MoAE-CR, a framework that effectively leverages class-aware information for improved attack detection. Our improvements manifest at two levels, i.e., the feature and loss level. At the feature level, we propose Mixture-of-Attack-Experts (MoAEs) to capture more subtle differences among various types of fake faces. At the loss level, we introduce Class Regularization (CR) through the Disentanglement Module (DM) and the Cluster Distillation Module (CDM). The DM enhances class separability by increasing the distance between the centers of live and fake face classes. However, center-to-center constraints alone are insufficient to ensure distinctive representations for individual features. Thus, we propose the CDM to further cluster features around their class centers while maintaining separation from other classes. Moreover, specific attacks that significantly deviate from common attack patterns are often overlooked. To address this issue, our distance calculation prioritizes more distant features. Extensive experiments on two unified physical-digital attack datasets demonstrate the state-of-the-art performance of the proposed method.
... To address this issue, recent methods have employed DAbased techniques (Liu et al. 2022b;Yue et al. 2023;Liu et al. 2024c) and DG-based approaches (Zheng et al. 2024b,c;Liu et al. 2023b;Cai et al. 2024;Liu et al. 2024b;Liu 2024) aim to learn domain-invariant features across multiple source domains. Also, incremental learning (IL) methods (Guo et al. 2022;Wang et al. 2024) are considered to tackle the catastrophic forgetting problem in the context of domain discontinuity in FAS. ...
Preprint
Facial recognition systems in real-world scenarios are susceptible to both digital and physical attacks. Previous methods have attempted to achieve classification by learning a comprehensive feature space. However, these methods have not adequately accounted for the inherent characteristics of physical and digital attack data, particularly the large intra class variation in attacks and the small inter-class variation between live and fake faces. To address these limitations, we propose the Fine-Grained MoE with Class-Aware Regularization CLIP framework (FG-MoE-CLIP-CAR), incorporating key improvements at both the feature and loss levels. At the feature level, we employ a Soft Mixture of Experts (Soft MoE) architecture to leverage different experts for specialized feature processing. Additionally, we refine the Soft MoE to capture more subtle differences among various types of fake faces. At the loss level, we introduce two constraint modules: the Disentanglement Module (DM) and the Cluster Distillation Module (CDM). The DM enhances class separability by increasing the distance between the centers of live and fake face classes. However, center-to-center constraints alone are insufficient to ensure distinctive representations for individual features. Thus, we propose the CDM to further cluster features around their respective class centers while maintaining separation from other classes. Moreover, specific attacks that significantly deviate from common attack patterns are often overlooked. To address this issue, our distance calculation prioritizes more distant features. Experimental results on two unified physical-digital attack datasets demonstrate that the proposed method achieves state-of-the-art (SOTA) performance.
... However, the presence of unseen attacks and domain shifts can significantly degrade the performance of FAS systems. To address this limitation, domain adaptation [26,7,51] and domain generalization [52,53,54,55,56] are proposed to enhance robustness across attack variations. Apart from these mainstream approaches, researchers also investigate adversarial learning [53], meta learning [5] and continual learning [6,57,58] to handle novel or unexpected spoof scenarios. ...
Preprint
Facial recognition systems are vulnerable to physical (e.g., printed photos) and digital (e.g., DeepFake) face attacks. Existing methods struggle to simultaneously detect physical and digital attacks due to: 1) significant intra-class variations between these attack types, and 2) the inadequacy of spatial information alone to comprehensively capture live and fake cues. To address these issues, we propose a unified attack detection model termed Frequency-Aware and Attack-Agnostic CLIP (FA\textsuperscript{3}-CLIP), which introduces attack-agnostic prompt learning to express generic live and fake cues derived from the fusion of spatial and frequency features, enabling unified detection of live faces and all categories of attacks. Specifically, the attack-agnostic prompt module generates generic live and fake prompts within the language branch to extract corresponding generic representations from both live and fake faces, guiding the model to learn a unified feature space for unified attack detection. Meanwhile, the module adaptively generates the live/fake conditional bias from the original spatial and frequency information to optimize the generic prompts accordingly, reducing the impact of intra-class variations. We further propose a dual-stream cues fusion framework in the vision branch, which leverages frequency information to complement subtle cues that are difficult to capture in the spatial domain. In addition, a frequency compression block is utilized in the frequency stream, which reduces redundancy in frequency features while preserving the diversity of crucial cues. We also establish new challenging protocols to facilitate unified face attack detection effectiveness. Experimental results demonstrate that the proposed method significantly improves performance in detecting physical and digital face attacks, achieving state-of-the-art results.
Preprint
The challenge of Domain Generalization (DG) in Face Anti-Spoofing (FAS) is the significant interference of domain-specific signals on subtle spoofing clues. Recently, some CLIP-based algorithms have been developed to alleviate this interference by adjusting the weights of visual classifiers. However, our analysis of this class-wise prompt engineering suffers from two shortcomings for DG FAS: (1) The categories of facial categories, such as real or spoof, have no semantics for the CLIP model, making it difficult to learn accurate category descriptions. (2) A single form of prompt cannot portray the various types of spoofing. In this work, instead of class-wise prompts, we propose a novel Content-aware Composite Prompt Engineering (CCPE) that generates instance-wise composite prompts, including both fixed template and learnable prompts. Specifically, our CCPE constructs content-aware prompts from two branches: (1) Inherent content prompt explicitly benefits from abundant transferred knowledge from the instruction-based Large Language Model (LLM). (2) Learnable content prompts implicitly extract the most informative visual content via Q-Former. Moreover, we design a Cross-Modal Guidance Module (CGM) that dynamically adjusts unimodal features for fusion to achieve better generalized FAS. Finally, our CCPE has been validated for its effectiveness in multiple cross-domain experiments and achieves state-of-the-art (SOTA) results.
Article
Face Anti-spoofing (FAS) is essential to secure face recognition systems from various physical attacks. However, recent research generally focuses on short-distance applications ( i.e ., phone unlocking) while lacking consideration of long-distance scenes ( i.e ., surveillance security checks). In order to promote relevant research and fill this gap in the community, we collect a large-scale Su rveillance Hi gh- Fi delity Mask (SuHiFiMask) dataset captured under 40 surveillance scenes, which has 101 subjects from different age groups with 232 3D attacks (high-fidelity masks), 200 2D attacks (posters, portraits, and screens), and 2 adversarial attacks. In this scene, low image resolution and noise interference are new challenges faced in surveillance FAS. Together with the SuHiFiMask dataset, we propose a Contrastive Quality-Invariance Learning (CQIL) network to alleviate the performance degradation caused by image quality from three aspects: (1) An Image Quality Variable module (IQV) is introduced to recover image information associated with discrimination by combining the super-resolution network. (2) Using generated sample pairs to simulate quality variance distributions to help contrastive learning strategies obtain robust feature representation under quality variation. (3) A Separate Quality Network (SQN) is designed to learn discriminative features independent of image quality. Finally, a large number of experiments verify the quality of the SuHiFiMask dataset and the superiority of the proposed CQIL.
Article
The availability of handy multi-modal ( i.e ., RGB-D) sensors has brought about a surge of face anti-spoofing research. However, the current multi-modal face presentation attack detection (PAD) has two defects: (1) The framework based on multi-modal fusion requires providing modalities consistent with the training input, which seriously limits the deployment scenario. (2) The performance of ConvNet-based model on high fidelity datasets is increasingly limited. In this work, we present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT), for face anti-spoofing to flexibly target any single-modal ( i.e ., RGB) attack scenarios with the help of available multi-modal data. Specifically, FM-ViT retains a specific branch for each modality to capture different modal information and introduces the Cross-Modal Transformer Block (CMTB), which consists of two cascaded attentions named Multi-headed Mutual-Attention (MMA) and Fusion-Attention (MFA) to guide each modal branch to mine potential features from informative patch tokens, and to learn modality-agnostic liveness features by enriching the modal information of own CLS token, respectively. Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin, and approaches the multi-modal frameworks introduced with smaller FLOPs and model parameters.
Chapter
Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising performance. A key component in vision transformers is the fully-connected self-attention which is more powerful than CNNs in modelling long range dependencies. However, since the current dense self-attention uses all image patches (tokens) to compute attention matrix, it may neglect locality of images patches and involve noisy tokens (e.g., clutter background and occlusion), leading to a slow training process and potential degradation of performance. To address these problems, we propose the k-NN attention for boosting vision transformers. Specifically, instead of involving all the tokens for attention matrix calculation, we only select the top-k similar tokens from the keys for each query to compute the attention map. The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations, as nearby tokens tend to be more similar than others. In addition, the k-NN attention allows for the exploration of long range correlation and at the same time filters out irrelevant tokens by choosing the most similar tokens from the entire image. Despite its simplicity, we verify, both theoretically and empirically, that k-NN attention is powerful in speeding up training and distilling noise from input tokens. Extensive experiments are conducted by using 11 different vision transformer architectures to verify that the proposed k-NN attention can work with any existing transformer architectures to improve its prediction performance. The codes are available at https://github.com/damo-cv/KVT.
Chapter
While recent face anti-spoofing methods perform well under the intra-domain setups, an effective approach needs to account for much larger appearance variations of images acquired in complex scenes with different sensors for robust performance. In this paper, we present adaptive vision transformers (ViT) for robust cross-domain face anti-spoofing. Specifically, we adopt ViT as a backbone to exploit its strength to account for long-range dependencies among pixels. We further introduce the ensemble adapters module and feature-wise transformation layers in the ViT to adapt to different domains for robust performance with a few samples. Experiments on several benchmark datasets show that the proposed models achieve both robust and competitive performance against the state-of-the-art methods for cross-domain face anti-spoofing using a few samples.