Fig 1 - uploaded by Haibin Wu
Content may be subject to copyright.
The proposed framework. X, Z, H, A are the acoustic features, hidden features, bottleneck features and the output for Question-answer layer, respectively. f and g are the SENet feature extractor and self-attention layer, corresponding to (1)-(8) and (9) in Table 1, respectively. QA and AF are the question-answering (fake span discovery) and anti-spoofing layers with loss calculation procedures respectively.

The proposed framework. X, Z, H, A are the acoustic features, hidden features, bottleneck features and the output for Question-answer layer, respectively. f and g are the SENet feature extractor and self-attention layer, corresponding to (1)-(8) and (9) in Table 1, respectively. QA and AF are the question-answering (fake span discovery) and anti-spoofing layers with loss calculation procedures respectively.

Source publication
Preprint
Full-text available
The past few years have witnessed the significant advances of speech synthesis and voice conversion technologies. However, such technologies can undermine the robustness of broadly implemented biometric identification models and can be harnessed by in-the-wild attackers for illegal uses. The ASVspoof challenge mainly focuses on synthesized audios b...