Figure - uploaded by Haibin Wu
Content may be subject to copyright.
The EERs using MSTFT features. w/o or w/ mean with or without. w/ or w/o re-synthesis correspond to using the re-synthesised audios by Griffin-Lim and WORLD or not.

The EERs using MSTFT features. w/o or w/ mean with or without. w/ or w/o re-synthesis correspond to using the re-synthesised audios by Griffin-Lim and WORLD or not.

Source publication
Preprint
Full-text available
The past few years have witnessed the significant advances of speech synthesis and voice conversion technologies. However, such technologies can undermine the robustness of broadly implemented biometric identification models and can be harnessed by in-the-wild attackers for illegal uses. The ASVspoof challenge mainly focuses on synthesized audios b...

Contexts in source publication

Context 1
... spectral features, some extra experiments are operated on cepstral and NN-based features to increase diversity for achieving a better performance in the stage of fusion. The FFT window size, hop size, and number of output bins are fixed to 384, 128, and 80 respectively for Mel-frequency cepstral coefficients (MFCC), linear frequency cepstral coefficients (LFCC), and SincNet [34], as we find the FFT window size of 384 performs well as shown in Table 3. ...
Context 2
... the model with self-attention will be adopted for the following experiments, unless specified otherwise. In the main experiments as shown in Table 3, the input representations are MSTFTs with hop size of 128, output bins as 80, and FFT window size ranging from 384-768. Table 3 exhausts the experimental settings under four different window sizes, three pooling strategies, whether to use the data augmentation and whether to use the re-synthesised fake audios by Griffin-Lim and WORLD. ...
Context 3
... the main experiments as shown in Table 3, the input representations are MSTFTs with hop size of 128, output bins as 80, and FFT window size ranging from 384-768. Table 3 exhausts the experimental settings under four different window sizes, three pooling strategies, whether to use the data augmentation and whether to use the re-synthesised fake audios by Griffin-Lim and WORLD. We have the following observations. ...
Context 4
... the SAP and ASP pooling significantly improve the EERs when both data re-synthesis and augmentation are applied. We also can observe that the best EER for a single model is 11.1% shown in Table 3. ...
Context 5
... order to increase diversity of models for achieving a better performance in the stage of model fusion, we further take MFCC, LFCC and SincNet as input features to train the models. We cannot exhaust all the settings due to limited computing resources, thus we refer to Table 3 to select the setting to conduct the experiments. We fix the FFT window size as 384, apply only ASP pooling, adopt data augmentation and the re-synthesised data. ...
Context 6
... spectral features, some extra experiments are operated on cepstral and NN-based features to increase diversity for achieving a better performance in the stage of fusion. The FFT window size, hop size, and number of output bins are fixed to 384, 128, and 80 respectively for Mel-frequency cepstral coefficients (MFCC), linear frequency cepstral coefficients (LFCC), and SincNet [34], as we find the FFT window size of 384 performs well as shown in Table 3. ...
Context 7
... the model with self-attention will be adopted for the following experiments, unless specified otherwise. In the main experiments as shown in Table 3, the input representations are MSTFTs with hop size of 128, output bins as 80, and FFT window size ranging from 384-768. Table 3 exhausts the experimental settings under four different window sizes, three pooling strategies, whether to use the data augmentation and whether to use the re-synthesised fake audios by Griffin-Lim and WORLD. ...
Context 8
... the main experiments as shown in Table 3, the input representations are MSTFTs with hop size of 128, output bins as 80, and FFT window size ranging from 384-768. Table 3 exhausts the experimental settings under four different window sizes, three pooling strategies, whether to use the data augmentation and whether to use the re-synthesised fake audios by Griffin-Lim and WORLD. We have the following observations. ...
Context 9
... the SAP and ASP pooling significantly improve the EERs when both data re-synthesis and augmentation are applied. We also can observe that the best EER for a single model is 11.1% shown in Table 3. ...
Context 10
... order to increase diversity of models for achieving a better performance in the stage of model fusion, we further take MFCC, LFCC and SincNet as input features to train the models. We cannot exhaust all the settings due to limited computing resources, thus we refer to Table 3 to select the setting to conduct the experiments. We fix the FFT window size as 384, apply only ASP pooling, adopt data augmentation and the re-synthesised data. ...