To read the full-text of this research, you can request a copy directly from the authors.
... Several methods have been proposed for detecting synthetic speech [11,20,[27][28][29]63,66]. Some of the methods for synthetic speech detection either use hand-crafted features such as Linear Frequency Cepstral Coefficients (LFCCs) and Constant Q Cepstral Coefficients (CQCCs) [52], the timedomain speech signal [20] or an image representation [41] of the speech signal [39,66,69]. These methods have shown promising detection accuracy but lack in interpretability. ...
... The Fourier Transform can be used to convert a time domain speech signal into an image representation known as a spectrogram [41]. The spectrogram has been used for speech forensics using a transformer neural network [54] or a Convolutional Neural Network (CNN) [27,39,60,66,69]. ...
Tools to generate high quality synthetic speech signal that is perceptually indistinguishable from speech recorded from human speakers are easily available. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods as a black box without providing reasoning for the decisions they make. This limits the interpretability of these approaches. In this paper, we propose Disentangled Spectrogram Variational Auto Encoder (DSVAE) which is a two staged trained variational autoencoder that processes spectrograms of speech using disentangled representation learning to generate interpretable representations of a speech signal for detecting synthetic speech. DSVAE also creates an activation map to highlight the spectrogram regions that discriminate synthetic and bona fide human speech signals. We evaluated the representations obtained from DSVAE using the ASVspoof2019 dataset. Our experimental results show high accuracy (>98%) on detecting synthetic speech from 6 known and 10 out of 11 unknown speech synthesizers. We also visualize the representation obtained from DSVAE for 17 different speech synthesizers and verify that they are indeed interpretable and discriminate bona fide and synthetic speech from each of the synthesizers.
... Moreover, using this command, voice clips created with Deep Voice software can be easily synthesized into voice files recorded in a physical environment. Recently, research on detecting voice-forged files created through editing techniques such as splicing and copy-move [5,[8][9][10] and voice-forged files created through Deep Voice software and audio deepfakes [11][12][13] has been actively conducted using deep learning models, which mainly convert voice signals into spectrograms and use them as a dataset. ...
This study focuses on the field of voice forgery detection, which is increasing in importance owing to the introduction of advanced voice editing technologies and the proliferation of smartphones. This study introduces a unique dataset that was built specifically to identify forgeries created using the “Mix Paste” technique. This editing technique can overlay audio segments from similar or different environments without creating a new timeframe, making it nearly infeasible to detect forgeries using traditional methods. The dataset consists of 4665 and 45,672 spectrogram images from 1555 original audio files and 15,224 forged audio files, respectively. The original audio was recorded using iPhone and Samsung Galaxy smartphones to ensure a realistic sampling environment. The forged files were created from these recordings and subsequently converted into spectrograms. The dataset also provided the metadata of the original voice files, offering additional context and information that could be used for analysis and detection. This dataset not only fills a gap in existing research but also provides valuable support for developing more efficient deep learning models for voice forgery detection. By addressing the “Mix Paste” technique, the dataset caters to a critical need in voice authentication and forensics, potentially contributing to enhancing security in society.
... As high-permission devices, they can control other IoT devices, exchange data under the cover of valid identities, and even purchase goods or services with the users' accounts (e.g., Amazon Echo). Single 98%+ (ACC) TE-ResNet [13] 2021 Multiple 81%+ (AUC) Camacho et al. [14] 2021 Single 95%+ (AUC) RES-EfficientCNN [15] 2020 Multiple 97%+ (F1) Farid et al. [6] 2019 Single 99%+ (AUC) Hafiz et al. [16] 2019 Single 100% (ACC) ...
Voice is an essential medium for human communication and collaboration, and its trustworthiness is of great importance to humans. Synthesizing fake voices and detecting synthesized voices are two sides of a coin. Both sides have made great strides with the recently prospering deep learning techniques. Attackers started using AI techniques to synthesize, even clone, human voices. Researchers also proposed a series of AI-synthesized voice detection approaches and achieved promising results in laboratory environments.
In this paper, we introduced the concept of speaker-irrelative features (SiFs) and a novel detection-bypass idea to camouflage AI-synthesized voices: replacing SiFs of AI-synthesized voices with crafted ones. We implemented a proof-of-concept framework named SiF-DeepVC based on our detection-bypass idea. Experiments show that the existing detection systems would consider the voices output by SiF-DeepVC more human-like than human voices, proving our detection-bypass idea is effective and SiFs are noteworthy in camouflaging AI-synthesized voices.
... For the FSD task, many studies Chettri, Kinnunen and Benetos (2020b); Yang, Das and Li (2019a); Zhang, Yi and Zhao (2021c) have shown that different frequency bands have different effects. In Zhang et al. (2021b), the authors focus on global channel attention using squeeze and extraction blocks and explore the impact of high frequency and low frequency subband for the FSD task. ...
The rhythm of synthetic speech is usually too smooth, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 subband so as to improve the performance of FSD, the spatial reconstructed local attention Res2Net (SR-LA Res2Net) is proposed. Specifically, Res2Net is used as a backbone network to obtain multiscale information, and enhanced with a spatial reconstruction mechanism to avoid losing important information when the channel group is constantly superimposed. In addition, local attention is designed to make the model focus on the local information of the F0 subband. Experimental results on the ASVspoof 2019 LA dataset show that our proposed method obtains an equal error rate (EER) of 0.47% and a minimum tandem detection cost function (min t-DCF) of 0.0159, achieving the state-of-the-art performance among all of the single systems.
... In the back-end were defined three different architectures, LCNN-LSTM, LCNN-Attention, and LCNN-trim-pad. In [19] the characteristics of the voice spectrum were extracted using MFCC, CQCC, and Log Power Spectrum (LPS). The extracted characteristics are sent to a transformer encoder to extract the deepest features, then, a residual network (ResNet) calculates a composed score. ...
... However, the proposed detection strategy did not have an ideal performance due to achieving an Equal Error Rate (EER) score of 40.50% over ADD2022 dataset [22]. Zhang et al. [23] proposed a new detection scheme that depends on the TE with the ResNet network (TE-ResNet). In this research, the efficacy of the proposed method was improved via the use of five augmentation strategies, which were used across ASV spoof 2019 [24] and Fake or Real (FOR) [25]. ...
One of the most significant discussions in forensics is Audio Deepfake, where AI-generated tools are used to clone audio content of people’s voices. Although it was intended to improve people’s lives, attackers utilized it maliciously, compromising the public’s safety. Thus, Machine Learning (ML) and Deep Learning (DL) methods have been developed to detect imitated or synthetically faked voices. However, the developed methods suffered from massive training data or excessive pre-processing. To the author’s best knowledge, Arabic speech has not yet been explored with synthetic fake audio, and it is very limited to the challenged fakeness, which is imitation. This paper proposed a new Audio Deepfake detection method called Arabic-AD based on self-supervised learning techniques to detect both synthetic and imitated voices. Additionally, it contributed to the literature by creating the first synthetic dataset of a single speaker who perfectly speaks Modern Standard Arabic (MSA). Besides, the accent was also considered by collecting Arabic recordings from non-Arabic speakers to evaluate the robustness of Arabic-AD. Three extensive experiments were conducted to measure the proposed method and compare it to well-known benchmarks in literature. As a result, Arabic-AD outperformed other state-of-the-arts methods with the lowest EER rate (0.027%), and high detection accuracy (97%) while avoiding the need for excessive training.
... Traditional methods typically rely on the extraction of acoustic features in the spectral domain. In fact, through time-frequency analysis it is possible to obtain a rich and expressive representation of the voice, based on the well-known frequency cepstral coefficients or constant Q cepstral coefficients [2], [3]. Also effective are first-and second-order spectral features [4] as well as higher-order polyspectral features [5], exploiting the observation that synthesized speech can introduce uncommon spectral correlations. ...
Thanks to recent advances in deep learning, sophisticated generation tools exist, nowadays, that produce extremely realistic synthetic speech. However, malicious uses of such tools are possible and likely, posing a serious threat to our society. Hence, synthetic voice detection has become a pressing research topic, and a large variety of detection methods have been recently proposed. Unfortunately, they hardly generalize to synthetic audios generated by tools never seen in the training phase, which makes them unfit to face real-world scenarios. In this work, we aim at overcoming this issue by proposing a new detection approach that leverages only the biometric characteristics of the speaker, with no reference to specific manipulations. Since the detector is trained only on real data, generalization is automatically ensured. The proposed approach can be implemented based on off-the-shelf speaker verification tools. We test several such solutions on three popular test sets, obtaining good performance, high generalization ability, and high robustness to audio impairment.
... Transformer. The Transformer architecture has also found its way into the field of audio spoof detection [21]. We use four self-attention layers with 256 hidden dimensions and skipconnections, and encode time with positional encodings [22]. ...
... The so-called "DeepSonar" architecture was applied in three datasets of two different languages (English and Chinese), specifically the publicly available FoR dataset, a self-generated English dataset based on the VC Sprocket toolkit [29] and a self-generated Chinese dataset based on the Baidu speech synthesis system [30], and showed promising generalization capabilities. Zhang et al. [31] introduced a fake speech detection architecture based on a ResNet combined with a transformer encoder. They applied five popular audio data augmentation methods, namely Gaussian noise addition, signal-to-noise ratio noise addition, time shifting, pitch shifting, and time stretching, and extracted spectral features based on magnitude Short-time Fourier Transform (STFT) representations to train the transformer encoder ResNet. ...
In this paper the current status and open challenges of synthetic speech detection are addressed. The work comprises an initial analysis of available open datasets and of existing detection methods, a description of the requirements for new research datasets compliant with regulations and better representing real-case scenarios, and a discussion of the desired characteristics of future trustworthy detection methods in terms of both functional and non-functional requirements. Compared to other works, based on specific detection solutions or presenting single dataset of synthetic speeches, our paper is meant to orient future state-of-the-art research in the domain, to quickly lessen the current gap between synthesis and detection approaches.
... DL methods were successfully applied to various tasks in audio forensics, including double compression detection [16], audio recapture detection [15], speech presentation attack detection [12], and fake speech detection [35]. Surprisingly, the detection of audio splicing received little attention so far. ...
Freely available and easy-to-use audio editing tools make it straightforward to perform audio splicing. Convincing forgeries can be created by combining various speech samples from the same person. Detection of such splices is important both in the public sector when considering misinformation, and in a legal context to verify the integrity of evidence. Unfortunately, most existing detection algorithms for audio splicing use handcrafted features and make specific assumptions. However, criminal investigators are often faced with audio samples from unconstrained sources with unknown characteristics, which raises the need for more generally applicable methods. With this work, we aim to take a first step towards unconstrained audio splicing detection to address this need. We simulate various attack scenarios in the form of post-processing operations that may disguise splicing. We propose a Transformer sequence-to-sequence (seq2seq) network for splicing detection and localization. Our extensive evaluation shows that the proposed method outperforms existing dedicated approaches for splicing detection [3, 10] as well as the general-purpose networks EfficientNet [28] and RegNet [25].
... This model has better generalization ability, however, the performance degrades on noisy samples. Zhang et al. [287] propose a model to detect fake speech by using a ResNet model with a transformer encoder (TEResNet). Initially, a transformer encoder is employed to compute a contextual representation of the acoustic keypoints by considering the correlation between audio signal frames. ...
Easy access to audio-visual content on social media, combined with the availability of modern tools such as Tensorflow or Keras, and open-source trained models, along with economical computing infrastructure, and the rapid evolution of deep-learning (DL) methods have heralded a new and frightening trend. Particularly, the advent of easily available and ready to use Generative Adversarial Networks (GANs), have made it possible to generate deepfakes media partially or completely fabricated with the intent to deceive to disseminate disinformation and revenge porn, to perpetrate financial frauds and other hoaxes, and to disrupt government functioning. Existing surveys have mainly focused on the detection of deepfake images and videos; this paper provides a comprehensive review and detailed analysis of existing tools and machine learning (ML) based approaches for deepfake generation, and the methodologies used to detect such manipulations in both audio and video. For each category of deepfake, we discuss information related to manipulation approaches, current public datasets, and key standards for the evaluation of the performance of deepfake detection techniques, along with their results. Additionally, we also discuss open challenges and enumerate future directions to guide researchers on issues which need to be considered in order to improve the domains of both deepfake generation and detection. This work is expected to assist readers in understanding how deepfakes are created and detected, along with their current limitations and where future research may lead.
... Transformer. The Transformer architecture has also found its way into the field of audio spoof detection [21]. We use four self-attention layers with 256 hidden dimensions and skipconnections, and encode time with positional encodings [22]. ...
Current text-to-speech algorithms produce realistic fakes of human voices, making deepfake detection a much-needed area of research. While researchers have presented various techniques for detecting audio spoofs, it is often unclear exactly why these architectures are successful: Preprocessing steps, hyperparameter settings, and the degree of fine-tuning are not consistent across related work. Which factors contribute to success, and which are accidental? In this work, we address this problem: We systematize audio spoofing detection by re-implementing and uniformly evaluating architectures from related work. We identify overarching features for successful audio deepfake detection, such as using cqtspec or logspec features instead of melspec features, which improves performance by 37% EER on average, all other factors constant. Additionally, we evaluate generalization capabilities: We collect and publish a new dataset consisting of 37.9 hours of found audio recordings of celebrities and politicians, of which 17.2 hours are deepfakes. We find that related work performs poorly on such real-world data (performance degradation of up to one thousand percent). This may suggest that the community has tailored its solutions too closely to the prevailing ASVSpoof benchmark and that deepfakes are much harder to detect outside the lab than previously thought.
... With the development of fake audio detection, many kinds of acoustic features have been proposed to improve detection performances. Linear frequency cepstral coefficient (LFCC) and Mel-frequency cepstral coefficient (MFCC) are two of the most used acoustic features for the detection of fake speech [18][19][20]. In this paper, we extract LFCC and MFCC features from the audio signals as the input data of the ASLNet. ...
Audio splicing means inserting an audio segment into another audio, which presents a great challenge to audio forensics. In this paper, a novel audio splicing detection and localization method based on an encoder-decoder architecture (ASLNet) is proposed. Firstly, an audio clip is divided into several small audio segments according to the size of the smallest localization region L slr , and the acoustic feature matrix and corresponding binary ground truth mask are created from each audio segment. Then, we concatenate acoustic feature matrices from all segments of an audio clip into an acoustic feature matrix and send it to a fully convolutional network (FCN) based encoder-decoder architecture which consists of a series of convolutional, pooling and transposed convolutional layers to get a binary output mask. Next, the binary output mask is divided into small segments according to the L slr , and the ratio ρ of the number of elements equal to one to the number of all elements in a small segment is calculated. Finally, we compare ρ with the predetermined threshold T to determine whether the corresponding audio segment is spliced. We evaluate the effectiveness of the proposed ASLNet on four datasets produced from publicly available speech corpus. Extensive experiments show that the best detection accuracy of ASLNet for the intradatabase and cross-database evaluation can achieve 0.9965 and 0.9740 receptively, which outperforms the state-of-the-art method.
This paper presents a review of techniques involved in the creation and detection of audio deepfakes, the first Section provides information about general deep fakes. In the second section, the main methods for audio deepfakes are outlined and subsequently compared. The results discuss various methods for detecting audio deepfakes, including analyzing statistical properties, examining media consistency, and utilizing machine learning and deep learning algorithms. Major methods used to detect fake audio in these studies included Support Vector Machines (SVMs), Decision Trees (DTs), Convolutional Neural Networks (CNNs), Siamese CNNs, Deep Neural Networks (DNNs), and a combination of CNNs and Recurrent Neural Networks (RNNs). The accuracy of these methods varied, with the highest accuracy being 99% for SVM and the lowest being 73.33% for DT. The Equal Error Rate (EER) was reported in a few of the studies, with the lowest being 2% for Deep-Sonar and the highest being 12.24 for DNN-HLLs. The t-DCF was also reported in some of the studies, with the Siamese CNN performing the best with a 55% improvement in min-t-DCF and EER compared to other methods.
Generative deep learning techniques have invaded the public discourse recently. Despite the advantages, the applications to disinformation are concerning as the counter-measures advance slowly. As the manipulation of multimedia content becomes easier, faster, and more credible, developing effective forensics becomes invaluable. Other works have identified this need but neglect that disinformation is inherently multimodal. Overall in this survey, we exhaustively describe modern manipulation and forensic techniques from the lens of video, audio and their multimodal fusion. For manipulation techniques, we give a classification of the most commonly applied manipulations. Generative techniques can be exploited to generate datasets; we provide a list of current datasets useful for forensics. We have reviewed forensic techniques from 2018 to 2023, examined the usage of datasets, and given a comparative analysis of each modality. Finally, we give another comparison of end-to-end forensics tools for end-users. From our analysis clear trends are found with diffusion models, dataset granularity, explainability techniques, synchronisation improvements, and learning task diversity. We find a roadmap of deep challenges ahead, including multilinguality, multimodality, improving data quality (and variety), all in an adversarial ever-changing environment.
Freely available and easy-to-use audio editing tools make it straightforward to perform audio splicing. Convincing forgeries can be created by combining various speech samples from the same person. Detection of such splices is important both in the public sector when considering misinformation, and in a legal context to verify the integrity of evidence. Unfortunately, most existing detection algorithms for audio splicing use handcrafted features and make specific assumptions. However, criminal investigators are often faced with audio samples from unconstrained sources with unknown characteristics, which raises the need for more generally applicable methods.With this work, we aim to take a first step towards unconstrained audio splicing detection to address this need. We simulate various attack scenarios in the form of post-processing operations that may disguise splicing. We propose a Transformer sequence-to-sequence (seq2seq) network for splicing detection and localization. Our extensive evaluation shows that the proposed method outperforms existing dedicated approaches for splicing detection [3, 10] as well as the general-purpose networks EfficientNet [28] and RegNet [25]. Our source code is available at: https://www.cs1.tf.fau.de/research/multimedia-security/codeKeywordsAudio Splicing DetectionForensicsDeep Learning
As increasing development of text-to-speech (TTS) and voice conversion (VC) technologies, the detection of synthetic speech has been suffered dramatically. In order to promote the development of synthetic speech detection model against Mandarin TTS and VC technologies, we have constructed a challenging Mandarin dataset and organized the accompanying audio track of the first fake media forensic challenge of China Society of Image and Graphics (FMFCC-A). The FMFCC-A dataset is by far the largest publicly-available Mandarin dataset for synthetic speech detection, which contains 40,000 synthesized Mandarin utterances that generated by 11 Mandarin TTS systems and two Mandarin VC systems, and 10,000 genuine Mandarin utterances collected from 58 speakers. The FMFCC-A dataset is divided into the training, development and evaluation sets, which are used for the research of detection of synthesized Mandarin speech under various previously unknown speech synthesis systems or audio post-processing operations. In addition to describing the construction of the FMFCC-A dataset, we provide a detailed analysis of two baseline methods and the top-performing submissions from the FMFCC-A, which illustrates the usefulness and challenge of FMFCC-A dataset. We hope that the FMFCC-A dataset can fill the gap of lack of Mandarin datasets for synthetic speech detection.KeywordsMandarin datasetSynthetic speech detectionText-to-speechVoice conversionAudio post-processing operation
The threat of spoofing can pose a risk to the reliability of automatic speaker verification. Results from the biannual ASVspoof evaluations show that effective countermeasures demand front-ends designed specifically for the detection of spoofing artefacts. Given the diversity in spoofing attacks, ensemble methods are particularly effective. The work in this paper shows that a bank of very simple classifiers, each with a front-end tuned to the detection of different spoofing attacks and combined at the score level through non-linear fusion, can deliver superior performance than more sophisticated ensemble solutions that rely upon complex neural network architectures. Our comparatively simple approach outperforms all but 2 of the 48 systems submitted to the logical access condition of the most recent ASVspoof 2019 challenge.
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as “presentation attacks.” These vulnerabilities are generally unacceptable and call for spoofing countermeasures or “presentation attack detection” systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks.
The ASVspoof challenge initiative was created to foster research on anti-spoofing and to provide common platforms for the assessment and comparison of spoofing countermeasures. The first edition, ASVspoof 2015, focused upon the study of countermeasures for detecting of text-to-speech synthesis (TTS) and voice conversion (VC) attacks. The second edition, ASVspoof 2017, focused instead upon replay spoofing attacks and countermeasures. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona fide utterances even by human subjects. It is expected that the ASVspoof 2019 database, with its varied coverage of different types of spoofing data, could further foster research on anti-spoofing.
In recent years, automatic speaker verification (ASV) is used extensively for voice biometrics. This leads to an increased interest to secure these voice biometric systems for real-world applications. The ASV systems are vulnerable to various kinds of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins, and impersonation. This paper provides the literature review of ASV spoof detection, novel acoustic feature representations, deep learning, end-to-end systems, etc. Furthermore, the paper also summaries previous studies of spoofing attacks with emphasis on SS, VC, and replay along with recent efforts to develop countermeasures for spoof speech detection (SSD) task. The limitations and challenges of SSD task are also presented. While several countermeasures were reported in the literature, they are mostly validated on a particular database, furthermore, their performance is far from perfect. The security of voice biometrics systems against spoofing attacks remains a challenging topic. This paper is based on a tutorial presented at APSIPA Annual Summit and Conference 2017 to serve as a quick start for those interested in the topic.
The Information and Communication Technology (ICT) revolution heralding the emergence and dominance of social media has always been viewed as a turning point in free speech and communication. Indeed, the social media ordinarily represents the freedom of all people to speech and information. But then, there is also the side of the social media that has been often ignored; that it serves as platform for all and sundry to express themselves with little, if any regulation or legal consequences. This as a result has led to global explosion of hate speech and fake news. Hate speech normally lead to tension and holds in it, the potential for national or even international crisis of untold proportions. It also has the likelihood to scare people away from expressing themselves for fear of hate-filled responses and becoming a source of fake news. Using doctrinal as well as comparative methodologies, this paper appraises the trend between states of passing laws or proposing laws to regulate hate speech and fake news; it also appraises the contents of such laws from different countries with the aim of identifying how they may be used to suppress free speech under the guise of regulating hate speech and fake news. It argues that the alarming trend of hate speech and fake news presented an opportunity for leaders across the globe to curb free speech. The paper concludes that the advancement in ICT helped in a great deal to advance free speech; it may as well, because of the spread of hate speech and fake news, lead to a reverse of that success story.
Nowadays spoofing detection is one of the priority research areas in the field of automatic speaker verification.The success of Automatic Speaker Verification Spoofing and Countermeasures(ASVspoof) Challenge 2015 confirmed the impressive perspective in detection of unforeseen spoofing trials based on speechsynthesis and voice conversion techniques. However, there isa small number of researches addressed to replay spoofing at-tacks which are more likely to be used by non-professional im-personators. This paper describes the Speech Technology Cen-ter (STC) anti-spoofing system submitted for ASVspoof 2017which is focused on replay attacks detection. Here we inves-tigate the efficiency of a deep learning approach for solution of the mentioned-above task. Experimental results obtained on the Challenge corpora demonstrate that the selected approachoutperforms current state-of-the-art baseline systems in termsof spoofing detection quality. Our primary system produced anEER of 6.73% on the evaluation part of the corpora which is 72% relative improvement over the ASVspoof 2017 baseline system.
Concerns regarding the vulnerability of automatic speaker verification (ASV) technology against spoofing can undermine confidence in its reliability and form a barrier to exploitation. The absence of competitive evaluations and the lack of common datasets has hampered progress in developing effective spoofing countermeasures. This paper describes the ASV Spoofing and Countermeasures (ASVspoof) initiative, which aims to fill this void. Through the provision of a common dataset, protocols, and metrics, ASVspoof promotes a sound research methodology and fosters technological progress. This paper also describes the ASVspoof 2015 dataset, evaluation, and results with detailed analyses. A review of post-evaluation studies conducted using the same dataset illustrates the rapid progress stemming from ASVspoof and outlines the need for further investigation. Priority future research directions are presented in the scope of the next ASVspoof evaluation planned for 2017.
Efforts to develop new countermeasures in order to protect automatic speaker verification from spoofing have intensified over recent years. The ASVspoof 2015 initiative showed that there is great potential to detect spoofing attacks, but also that the detection of previously unforeseen spoofing attacks remains challenging. This paper argues that there is more to be gained from the study of features rather than classifiers and introduces a new feature for spoofing detection based on the constant Q transform, a perceptually-inspired time-frequency analysis tool popular in the study of music. Experimental results obtained using the standard ASVspoof 2015 database show that, when coupled with a standard Gaussian mixture model-based classifier, the proposed constant Q cepstral coefficients (CQCCs) outperform all previously reported results by a significant margin. In particular, those for a subset of unknown spoofing attacks (for which no matched training data was used) is 0.46%, a relative improvement of 72% over the best, previously reported results.
The performance of biometric systems based on automatic speaker recognition technology is severely degraded due to spoofing attacks with synthetic speech generated using different voice conversion (VC) and speech synthesis (SS) techniques. Various countermeasures are proposed to detect this type of attack , and in this context, choosing an appropriate feature extraction technique for capturing relevant information from speech is an important issue. This paper presents a concise experimental review of different features for synthetic speech detection task. A wide variety of features considered in this study include previously investigated features as well as some other potentially useful features for characterizing real and synthetic speech. The experiments are conducted on recently released ASVspoof 2015 corpus containing speech data from a large number of VC and SS technique. Comparative results using two different classifiers indicate that features representing spectral information in high-frequency region, dynamic information of speech, and detailed information related to subband characteristics are considerably more useful in detecting synthetic speech.
A representation and interpretation of the area under a receiver operating characteristic (ROC) curve obtained by the "rating" method, or by mathematical predictions based on patient characteristics, is presented. It is shown that in such a setting the area represents the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a randomly chosen non-diseased subject. Moreover, this probability of a correct ranking is the same quantity that is estimated by the already well-studied nonparametric Wilcoxon statistic. These two relationships are exploited to (a) provide rapid closed-form expressions for the approximate magnitude of the sampling variability, i.e., standard error that one uses to accompany the area under a smoothed ROC curve, (b) guide in determining the size of the sample required to provide a sufficiently reliable estimate of this area, and (c) determine how large sample sizes should be to ensure that one can statistically detect differences in the accuracy of diagnostic techniques.
Recent works on the vulnerability of automatic speaker verification (ASV) systems confirm that malicious spoofing attacks using synthetic speech can provoke significant increase in false acceptance rate. A reliable detection of synthetic speech is key to develop countermeasure for synthetic speech based spoofing attacks. In this paper, we targeted that by focusing on three major types of artifacts related to magnitude, phase and pitch variation, which are introduced during the generation of synthetic speech. We proposed a new approach to detect synthetic speech using score-level fusion of front-end features namely, constant Q cepstral coefficients (CQCCs), all-pole group delay function (APGDF) and fundamental frequency variation (FFV). CQCC and APGDF were individually used earlier for spoofing detection task and yielded the best performance among magnitude and phase spectrum related features, respectively. The novel FFV feature introduced in this paper to extract pitch variation at frame-level, provides complementary information to CQCC and APGDF. Experimental results show that the proposed approach produces the best stand-alone spoofing detection performance using Gaussian mixture model (GMM) based classifier on ASVspoof 2015 evaluation dataset. An overall equal error rate of 0.05% with a relative performance improvement of 76.19% over the next best-reported results is obtained using the proposed method. In addition to outperforming all existing baseline features for both known and unknown attacks, the proposed feature combination yields superior performance for ASV system (GMM with universal background model/i-vector) integrated with countermeasure framework. Further, the proposed method is found to have relatively better generalization ability when either one or both of copy-synthesized data and limited spoofing data are available a priori in the training pool.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Recent advancements in voice conversion (VC) and speech synthesis (SS) research make speech-based biometric systems highly prone to spoofing attacks. This can provoke an increase in false acceptance rate in such systems and requires countermeasure to mitigate such spoofing attacks. In this work, we first study the characteristics of synthetic speech vis-'a-vis natural speech and then propose a set of novel short-term spectral features that can efficiently capture the discriminative information between them. The proposed features are computed using inverted frequency warping scale and overlapped block transformation of filter bank log energies. Our study presents a detailed analysis of anti-spoofing performance with respect to the variations in the warping scale for inverted frequency and block size for the block transform. For performance analysis, Gaussian mixture model (GMM) based synthetic speech detector is used as classifier on a stand-alone basis and also, integrated with automatic speaker verification (ASV) systems. For ASV systems, standard mel-frequency cepstral coefficients (MFCC) are used as feature while GMM with universal background model (GMMUBM) and i-vector are used as classifiers. The experiments are conducted on ten different kinds of synthetic data from ASVspoof 2015 corpus. The results show that the countermeasures based on the proposed features outperform other spectral features for both known and unknown attacks. An average equal error rate (EER) of 0.00 % has been achieved for nine attacks that use VC or SS speech and the best performance of 7.12 % EER is arrived at the remaining natural speech concatenation based spoofing attack.
Recent evaluations such as ASVspoof 2015 and the similarly-named AVspoof have stimulated a great deal of progress to develop spoofing countermeasures for automatic speaker verification. This paper reports an approach which combines speech signal analysis using the constant Q transform with traditional cepstral processing. The resulting constant Q cepstral coefficients (CQCCs) were introduced recently and have proven to be an effective spoofing countermeasure. An extension of previous work, the paper reports an assessment of CQCCs generalisation across three different databases and shows that they deliver state-of-the-art performance in each case. The benefit of CQCC features stems from a variable spectro-temporal resolution which, while being fundamentally different to that used by most automatic speaker verification system front-ends, also captures reliably the tell-tale signs of manipulation artefacts which are indicative of spoofing attacks. The second contribution relates to a cross-database evaluation. Results show that CQCC configuration is sensitive to the general form of spoofing attack and use case scenario. This finding suggests that the past single-system pursuit of generalised spoofing detection may need rethinking.
Any biometric recognizer is vulnerable to spoofing attacks and hence voice biometric, also called automatic speaker verification (ASV), is no exception; replay, synthesis, and conversion attacks all provoke false acceptances unless countermeasures are used. We focus on voice conversion (VC) attacks considered as one of the most challenging for modern recognition systems. To detect spoofing, most existing countermeasures assume explicit or implicit knowledge of a particular VC system and focus on designing discriminative features. In this paper, we explore back-end generative models for more generalized countermeasures. In particular, we model synthesis-channel subspace to perform speaker verification and antispoofing jointly in the ${i}$ -vector space, which is a well-established technique for speaker modeling. It enables us to integrate speaker verification and antispoofing tasks into one system without any fusion techniques. To validate the proposed approach, we study vocoder-matched and vocoder-mismatched ASV and VC spoofing detection on the NIST 2006 speaker recognition evaluation data set. Promising results are obtained for standalone countermeasures as well as their combination with ASV systems using score fusion and joint approach.
We introduce Adam, an algorithm for first-order gradient-based optimization
of stochastic objective functions. The method is straightforward to implement
and is based an adaptive estimates of lower-order moments of the gradients. The
method is computationally efficient, has little memory requirements and is well
suited for problems that are large in terms of data and/or parameters. The
method is also ap- propriate for non-stationary objectives and problems with
very noisy and/or sparse gradients. The method exhibits invariance to diagonal
rescaling of the gradients by adapting to the geometry of the objective
function. The hyper-parameters have intuitive interpretations and typically
require little tuning. Some connections to related algorithms, on which Adam
was inspired, are discussed. We also analyze the theoretical convergence
properties of the algorithm and provide a regret bound on the convergence rate
that is comparable to the best known results under the online convex
optimization framework. We demonstrate that Adam works well in practice when
experimentally compared to other stochastic optimization methods.
The attacker's perspective on automatic speaker verification: An overview
Jan 2020
4213
Das Rohan Kumar
Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof
Jan 2019
Hye-Jin Jee-Weon Jung
Hee-Soo Shim
Ha-Jin Heo
Yu
Heo Hee-Soo
Bhusan Chettri, Daniel Stoller, Veronica Morfi, Marco A Mart'inez Ram'irez, Emmanouil Benetos, and Bob L Sturm
Jan 2019
Bhusan Chettri
Daniel Stoller
Veronica Morfi
A Marco
Emmanouil Mart'inez Ram'irez
Bob L Benetos
Sturm
Chettri Bhusan
Synthesized speech detection based on spectrogram and convolutional neural networks
Jan 2019
Tijana Nosek
Boris Sinivs A Suzić
Papić
Nosek Tijana
Synthetic speech spoofing detection using MFCC and SVM
Jan 2017
Anagha Sonawane
Inamdar
Sonawane Anagha
Shengyun Wei, Shun Zou, Feifan Liao, et almbox. 2020. A comparison on data augmentation methods based on deep learning for audio classification
Jan 2020
Shengyun Wei
Shun Zou
Feifan Liao
Wei Shengyun
A study on spoofing attack in state-of-the-art speaker verification: The telephone speech case