ArticlePDF Available

Detecting AI-Synthesized Speech Using Bispectral Analysis

Authors:
Detecting AI-Synthesized Speech Using Bispectral Analysis
Ehab A. AlBadawy and Siwei Lyu
University at Albany, SUNY
Albany NY, USA
{ealbadawy, slyu}@albany.edu
Hany Farid
University of California, Berkeley
Berkeley CA, USA
{hfarid}@berkeley.edu
Abstract
From speech to images, and videos, advances in machine
learning have led to dramatic improvements in the qual-
ity and realism of so-called AI-synthesized content. While
there are many exciting and interesting applications, this
type of content can also be used to create convincing and
dangerous fakes. We seek to develop forensic techniques
that can distinguish a real human voice from synthesized
voice. We observe that deep neural networks used to syn-
thesize speech introduce specific and unusual spectral cor-
relations not typically found in human speech. Although not
necessarily audible, these correlations can be measured us-
ing tools from bispectral analysis and used to distinguish
human from synthesized speech.
1. Introduction
Recent advances in AI-synthesized content-generation
are leading to the creation of highly realistic audio [11,4],
image [6,5], and video [10,7,14,13,1]. While there
are many interesting and artistic applications for this type
of synthesized content, these same techniques can also be
weaponized to, for example, create a video of a world leader
threatening another nation leading to an international cri-
sis, or a video of a presidential candidate saying something
inappropriate which, if released 24 hours before an elec-
tion, could lead to interference with a democratic election,
or a video of a CEO privately claiming that her company’s
profits are down leading to global stock manipulation. Ad-
vances in deep learning have led to the development of syn-
thesis tools for creating the video and audio that can create
these types of fakes.
As these synthesis tools become more powerful and
readily available, there is a growing need to develop foren-
sic techniques to detect the resulting synthesized content.
We describe a technique for distinguishing human speech
from synthesized speech that leverages higher-order spec-
tral correlations revealed by bispectral analysis. We show
that these correlations are not present in a wide variety of
recorded human speech, but are present in speech synthe-
sized with several state of the art AI systems. We also show
that these correlations are likely the result of fundamental
properties of the synthesis process, which would be diffi-
cult to eliminate as a counter measure.
In the general area of audio forensics, there are a num-
ber of techniques for detecting various forms of audio
spoofing [15]. These techniques, however, do not explic-
itly address the detection of synthesized speech. Previous
work [3] showed that certain forms of audio tampering can
introduce the same type of higher-order artifacts that we ex-
ploit here. This previous work, however, did not address the
issue of synthesized content.
In comparing different features and techniques for
synthetic-speech detection, the authors in [12] found that
features based on high-frequency spectral magnitudes and
phases are most effective for distinguishing human from
synthesized speech. These features are based on first-order
Fourier coefficients or their second-order power spectrum
correlations. In contrast to these first- and second-order
spectral features – which might be easy to adjust to match
human speech – we explore higher-order polyspectral fea-
tures which are both discriminating and should prove to be
more difficult to adjust by the synthesizer.
2. Methods
We begin by describing the data set of human and syn-
thesized content that we recorded and created. We then
describe the polyspectral analysis tools that underlie our
technique followed by a qualitative assessment of the dif-
ferences in the bispectral properties of human and synthe-
sized content. We conclude this section with a description
of a simple classifier that characterizes these differences for
the purposes of automatically distinguishing between hu-
man and synthesized speech.
2.1. Data set
We collected a data set consisting of 1,845 human and
synthesized speech recordings. The human speech are ob-
tained from nine people (five male and four female). These
1104
recordings were extracted from various high-quality pod-
casts. Each recording averaged 10.5seconds in length.
The same texts spoken by the human subjects (tran-
scribed from the recordings) were used to synthesize au-
dio samples using various automatic text-to-speech synthe-
sis methods including Amazon Polly, Apple text-to-speech,
Baidu DeepVoice, and Google WaveNet1. We also include
samples generated using the Lyrebird.ai API, which, un-
like other synthesis methods, generates personalized speech
styles (because of limited access to this API, the texts
spoken were not matched to the human and other synthe-
sized speech). In synthesizing these recordings, a range of
speaker profiles was selected to increase the diversity of the
synthesized voices.
2.2. Bispectral Analysis
In this section, we describe the basic statistical tools used
to analyze audio recordings. The bispectrum of a signal
represents higher-order correlations in the Fourier domain.
An audio signal y(k)is first decomposed according to
the Fourier transform:
Y(ω) =
X
k=−∞
y(k)eikω ,(1)
with ω[π, π]. It is common practice to use the
power spectrum of the signal P(ω)to detect the presence
of second-order correlations, which is defined as:
P(ω) = Y(ω)Y(ω),(2)
where denotes complex conjugate. The power spectrum
is, however, blind to higher-order correlations, which are of
primary interest to us. These third-order correlations can
be detected by turning to higher-order spectral analysis [9].
The bispectrum, for example, is used to detect the presence
of third-order correlations:
B(ω1, ω2) = Y(ω1)Y(ω2)Y(ω1+ω2).(3)
Unlike the power spectrum, the bispectral response reveals
correlations between the triple of harmonics [ω1, ω1, ω1+
ω1],[ω2, ω2, ω2+ω2],[ω1, ω2, ω1+ω2], and [ω1,ω2, ω1
ω2]. Note that, unlike the power spectrum, the bispectrum
in Equation (3) is a complex-valued quantity. From an inter-
pretive stance it will be convenient to express the complex
bispectrum with respect to its magnitude:
|B(ω1, ω2)|=|Y(ω1)|·|Y(ω2)|·|Y(ω1+ω2)|,(4)
1Sources: Amazon Polly aws.amazon.com/polly/, Apple
text-to-speech API developer.apple.com/documentation/
appkit/nsspeechsynthesizer, Baidu DeepVoice r9y9.
github.io/deepvoice3_pytorch/, and Google WaveNet
r9y9.github.io/wavenet_vocoder/.
and phase:
B(ω1, ω2) = Y(ω1) + Y(ω2)Y(ω1+ω2).(5)
Also from an interpretive stance it is helpful to work with
the normalized bispectrum [2], the bicoherence:
Bc(ω1, ω2) = Y(ω1)Y(ω2)Y(ω1+ω2)
p|Y(ω1)Y(ω2)|2|Y(ω1+ω2)|2.(6)
This normalized bispectrum yields magnitudes in the range
[0,1].
In the absence of noise, the bicoherence can be estimated
from a single realization as in Equation (6). However in the
presence of noise some form of averaging is required to en-
sure stable estimates. A common form of averaging is to
divide the signal into multiple segments. For example the
signal y(n)with n[1, N ]can be divided into Kseg-
ments of length M=N/K, or Koverlapping segments
with M > N/K. The bicoherence is then estimated from
the average of each segment’s bicoherence spectrum:
ˆ
Bc(ω1, ω2) =
1
KPkYk(ω1)Yk(ω2)Y
k(ω1+ω2)
q1
KPk|Yk(ω1)Yk(ω2)|21
KPk|Yk(ω1+ω2)|2
.
(7)
Throughout, we compute the bicoherence with a segment
length of N= 64 with an overlap of 32 samples.
2.3. Bispectral Artifacts
Shown in Figure 1is the bicoherent magnitude and phase
for three different human speakers. Shown in the second
to the sixth rows are the bicoherent magnitude and phase
for five different synthesized voices, as described in Sec-
tion 2.1. Each bicoherent magnitude and phase panel are
displayed on the same intensity scale. At first glance, there
are some glaring differences in the bicoherent magnitude
(with the exception of Apple) between the human and syn-
thesized speech. There are also strong differences in the
bicoherent phases across all synthesized speech.
Because most synthesis methods use deep neural net-
works, we hypothesize that these bicoherence differences
are due to the underlying speech-synthesis network archi-
tecture and, in particular, that long-range temporal connec-
tions give rise to the unusual spectral correlations. To de-
termine if this might be the case, we created three “clipped”
WaveNet network architectures in which the network con-
nectivity was effectively reduced. This was done by first
noticing that WaveNet employs 3-tap filters in its convolu-
tional layers. We, therefore, truncate the full WaveNet mod-
els in which the left-most value of the convolution filter in
one of three layers was fixed at a value of zero2. With a
2A more direct approach is to use simply use a 2-tap filter. This, how-
ever, would require retraining the entire model and so we adopted the sim-
pler approach of zeroing out one of the filter values.
105
speaker 1 speaker 2 speaker 3
Human
Amazon
Apple
Baidu
Lyrebird
WaveNet
WaveNet
(low clipped)
WaveNet
(medium clipped)
WaveNet
(high clipped)
Figure 1. Bicoherent magnitude and phase for human speakers and five different synthesized voices. Shown in the lower three rows are the
results for three different clipped versions of the WaveNet architecture. The magnitude plots are displayed on an intensity scale of [0,1]
and the phase plots are displayed on a scale of [π, π]. Note the generally larger magnitudes and the stronger phase correlations in the
synthesized speech as compared to the human speech, and the reduction in magnitude for the clipped WaveNet architectures.
106
total of 24 convolutional layers we performed this manipu-
lation at level 24 (closest to the output level), 12, or 1(clos-
est to the input level). The effective network clipping was
more pronounced for the manipulations at the levels closest
to the input level, as this clipping propagates through the
entire network.
Shown in the last three rows of Figure 3are the result-
ing bicoherence magnitudes and phases for three record-
ings synthesized with these three networks with increasing
amounts of “clipping”. As can be clearly seen, the bico-
herence magnitude reduces with an increasing reduction in
network connectivity, and begins to appear more like the
human speakers in the first row of Figure 3. At the same
time, there is little impact on the bicoherence phase, most
likely because our network manipulation did not remove all
of the long-range connections. Although this does not prove
that the network architecture is solely responsible for the in-
creased bicoherence properties, it provides preliminary ev-
idence to suggest that this is the case. We note that the ar-
tifacts from Apple are more subdued than others. This may
be related to the fact that Apple’s quality of speech is sig-
nificantly less realistic than Google and Amazon, possibly
because the underlying technique is not based on the same
type of network architecture that we believe is introducing
the polyspectral correlations. Regardless of precisely why
these correlations are introduced, we next show that the bi-
coherence differences can be used to automatically distin-
guish between human and synthesized speeches.
2.4. Bispectral Classification
The bicohernece, Equation (7), is computed for each hu-
man and synthesized speech, from which the bicoherence
magnitude and phase are computed. These two-dimensional
quantities are normalized such that the magnitude and phase
for each frequency ω1are normalized into the range [0,1] by
subtracting the minimum value and dividing by the result-
ing maximum value.
The normalized magnitude and phase are each character-
ized using the first four statistical moments. Let the random
variable Mand Pdenote the underlying distribution for the
bicoherence magnitude and phase. The first four statistical
moments are given by:
mean, µX=EX[X]
variance, σX=EX[(XµX)2]
skewness, γX=EXXµX
σX3
kurtosis, κX=EXXµX
σX4
where EX[·]is the expected-value operator with regards to
random variable X. From the magnitude X=Mand phase
Figure 2. A 2-D slice of the full 8-D statistical characterization of
the bicoherence magnitude and phase. The open blue circles cor-
respond to human speech and the remaining filled colored circles
correspond to synthesized speech. Even in this reduced dimen-
sional space, the human speech is clearly distinct from the synthe-
sized speech.
X=P, these four moments are estimated by replacing the
expected-value operator with an average. With this statis-
tical characterization, each recording is reduced to an 8-D
feature vector.
Shown in Figure 2is a scatter plot of the mean bicoher-
ence magnitude versus the mean bicoherence phase for the
human speech and each type of synthesized speech. This
figure illustrates some interesting aspects of the bicoherence
statistics of the human and synthesized recordings. Even in
this reduced-dimensional space that does not account for
variance, skewness, or kurtosis, each type of signal is well
clustered and (with the exception of Amazon and WaveNet)
distinct from the other types. This suggests that it will be
relatively straight-forward to distinguish between these dif-
ferent recordings.
Also shown in Figure 2are six speech samples syn-
thesized with a more recent generative adversary network
(GAN) based model [8]3. Although the GAN-based model
has a different synthesis mechanism, the synthesized con-
tents still exhibit distinct bispectral statistics.
The scatter plot in Figure 2suggests two possible ap-
proaches to building a classifier. A one-class non-linear
support vector machine (SVM) or a collection of linear clas-
sifiers. We, primarily for simplicity, choose the latter. In
particular, we train a linear classifier to distinguish each
3There is no code publicly available and the six samples were down-
loaded from fangfm.github.io/crosslingualvc.html.
107
Figure 3. ROC curve for binary classification of human versus syn-
thetic speech (solid red line). The dashed and dotted lines corre-
spond to the accuracies for these same recordings with varying
amounts of additive noise. See also Figure 4.
category of recording – human, Amazon, Apple, Baidu,
Google, and Lyrebird – from all other recordings. Fol-
lowing this strategy, five separate logistic regression clas-
sifiers are trained to distinguish each synthesized audio
from all other categories. For example, the first classifier
is trained to distinguish Amazon recordings from Apple,
Baidu, Google, Lyrebird, and human recordings. Our full
data set consists of 100 human recordings, and 800 Ama-
zon (8speaker profiles), 400 Apple (4speaker profiles),
100 Baidu (1speaker profile), 400 Google (4speaker pro-
files), and 45 Lyrebird recordings (5recordings for each of
9speaker profiles). Because of the across class imbalance,
the training data set consisted of 70% of these samples with
a maximum of 90 samples per category, with the remaining
data used for testing.
The logistic regression classifier is implemented using
scikit-learn4. At testing, a speech sample is clas-
sified by each classifier (Amazon, Apple, Baidu, Google,
and Lyrebird). If the maximum classification score across
all five classifiers is above a specified threshold, then the
recording is classified as synthesized, otherwise it is classi-
fied as human.
3. Results
We test the performance of distinguishing human speech
from synthesized speech based on the 8-D summary bico-
herence statistics. Shown in Figure 3are the receiver opera-
4scikit-learn.org
Figure 4. Confusion matrix for classifying a recording as human
or as synthesized by one of five techniques. See also Figure 3.
tor characteristic (ROC) curves for this binary classification.
The solid curve with an area under the curve (AUC) of 0.99
corresponds to the original quality recordings. The remain-
ing dashed/dotted colored curves correspond to the record-
ings that were laundered with varying amounts of additive
noise (with a signal-to-noise ratio (SNR) between 20 and 40
dB) followed by re-compression at a quality of 128 kilobits
per second (kbit/s). At high SNR, the AUC remains above
0.98, and the AUC decreases with increasing amounts of
additive noise.
When the original recordings are recompressed at a
lower quality of 64 kbit/s, the overall AUC remains high
at 0.99 suggesting that the bispectral statistics are robust to
recompression.
Shown in Figure 4is the confusion matrix for the multi-
class classification showing that the differences in bicoher-
ence statistics are sufficient not only to distinguish human
from synthesized speeches but also, with a reasonable de-
gree of accuracy, to distinguish between different types of
synthesized speech.
4. Discussion
We have developed a forensic technique that can distin-
guish human from synthesized speech. This technique is
based on the observation that current speech-synthesis al-
gorithms introduce specific and unusual higher-order bis-
pectral correlations that are not typically found in human
speech. We have provided preliminary evidence that these
correlations are the result of the long-range correlations in-
108
troduced by the underlying network architectures used to
synthesize speech. This bodes well for us in the forensic
community as it appears that these network architectures are
also what is giving rise to more realistic sounding speech
(despite the unusual bispectral correlations). More work,
however, remains to be done to more precisely understand
the specific source of the unusual bispectral correlations.
As with any forensic technique, thought must be given
to counter-measures that our adversary might adopt. While
it would be straight-forward to match first-order spectral
correlations between human and synthesized speech, the
higher-order spectral correlations are not so easily matched.
In particular, we know of no closed-form solution for invert-
ing the bispectrum or bicoherence. It remains to be seen if
other techniques like generative adversarial networks can
synthesize audio while matching the bispectral artifacts that
currently can be used to distinguish human from synthe-
sized speech.
Acknowledgment
This research was developed with funding from Mi-
crosoft, a Google Faculty Research Award, and the Defense
Advanced Research Projects Agency (DARPA FA8750-16-
C-0166). The views, opinions, and findings expressed are
those of the authors and should not be interpreted as repre-
senting the official views or policies of the Department of
Defense or the U.S. Government.
References
[1] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and
Alexei A Efros. Everybody dance now. arXiv preprint
arXiv:1808.07371, 2018. 1
[2] J.W.A. Fackrell and Stephen McLaughlin. Detecting
nonlinearities in speech sounds using the bicoherence.
Proceedings of the Institute of Acoustics, 18(9):123–
130, 1996. 2
[3] Hany Farid. Detecting digital forgeries using bispec-
tral analysis. Technical Report AI Memo 1657, MIT,
June 1999. 1
[4] Yu Gu and Yongguo Kang. Multi-task WaveNet: A
multi-task generative model for statistical parametric
speech synthesis without fundamental frequency con-
ditions. In Interspeech, Hyderabad, India, 2018. 1
[5] Tero Karras, Timo Aila, Samuli Laine, and Jaakko
Lehtinen. Progressive growing of GANs for im-
proved quality, stability, and variation. arXiv preprint
arXiv:1710.10196, 2017. 1
[6] Tero Karras, Samuli Laine, and Timo Aila. A style-
based generator architecture for generative adversarial
networks. arXiv preprint arXiv:1812.04948, 2018. 1
[7] Hyeongwoo Kim, Pablo Garrido, Ayush Tewari,
Weipeng Xu, Justus Thies, Matthias Nießner, Patrick
P´
erez, Christian Richardt, Michael Zollh ¨
ofer, and
Christian Theobalt. Deep Video Portraits. ACM Trans-
actions on Graphics, (4):163, 2018. 1
[8] Jaime Lorenzo-Trueba, Fuming Fang, Xin Wang, Isao
Echizen, Junichi Yamagishi, and Tomi Kinnunen. Can
we steal your vocal identity from the internet?: Initial
investigation of cloning Obama’s voice using GAN,
WaveNet and low-quality found data. In The Speaker
and Language Recognition Workshop (Odyssey), Les
Sables d’Olonne, France, 2018. 4
[9] Jerry M. Mendel. Tutorial on higher order statistics
(spectra) in signal processing and system theory: the-
oretical results and some applications. Proceedings of
the IEEE, 79:278–305, 1996. 2
[10] Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei,
Zimo Li, Shunsuke Saito, Aviral Agarwal, Jens Fur-
sund, Hao Li, Richard Roberts, et al. paGAN: real-
time avatars using dynamic textures. In SIGGRAPH
Asia 2018 Technical Papers, page 258. ACM, 2018. 1
[11] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O
Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman,
and John Miller. Deep voice 3: 2000-speaker neu-
ral text-to-speech. arXiv preprint arXiv:1710.07654,
2017. 1
[12] Md Sahidullah, Tomo Kinnunen, and Cemal Hanilci.
A comparison of features for synthetic speech detec-
tion. In Interspeech, Dresden, Germany, 2015. 1
[13] Supasorn Suwajanakorn, Steven M Seitz, and Ira
Kemelmacher-Shlizerman. Synthesizing Obama:
learning lip sync from audio. ACM Transactions on
Graphics, 36(4):95, 2017. 1
[14] Justus Thies, Michael Zollh ¨
ofer, Christian Theobalt,
Marc Stamminger, and Matthias Nießner. Headon:
Real-time reenactment of human portrait videos. ACM
Transactions on Graphics, 36(4):95, 2018. 1
[15] Mohammed Zakariah, Muhammad Khurram Khan,
and Hafiz Malik. Digital multimedia audio forensics:
past, present and future. Multimedia Tools and Appli-
cations, 77(1):1009–1040, 2018. 1
109
... Existing approaches have either targeted spatial and temporal artifacts left during the generation or datadriven classification. The spatial artifacts include inconsistencies [78,81,114,188,193,[201][202][203], abnormalities in background [160,194,198], and GAN fingerprints [74,163,204,205]. The temporal artifacts involve detecting variation in a person's behavior [83,88,200], physiological signals [77,78,85,89], coherence [190,199,206], or video frame synchronization [33,75,91,138,207,208]. ...
... Due to handcrafted features, however, this work is not generalized to unseen manipulations. In [202] bispectral analysis is performed in order to identify specific and unusual spectral correlations present in GAN generated speech samples. Similarly, in [281] bispectral and Melcepstral analysis are performed in order to detect missing durable power components in synthesized speech. ...
... The computed features are then used to train several ML-based classifiers and attained the best performance using a Quadratic SVM. These approaches [202,281] are robust to TTS synthesized audio, however, they may not be able to detect highquality synthesized speech. Chen et al. [285] propose a DLbased framework for audio deepfake detection. ...
Article
Full-text available
Easy access to audio-visual content on social media, combined with the availability of modern tools such as Tensorflow or Keras, and open-source trained models, along with economical computing infrastructure, and the rapid evolution of deep-learning (DL) methods have heralded a new and frightening trend. Particularly, the advent of easily available and ready to use Generative Adversarial Networks (GANs), have made it possible to generate deepfakes media partially or completely fabricated with the intent to deceive to disseminate disinformation and revenge porn, to perpetrate financial frauds and other hoaxes, and to disrupt government functioning. Existing surveys have mainly focused on the detection of deepfake images and videos; this paper provides a comprehensive review and detailed analysis of existing tools and machine learning (ML) based approaches for deepfake generation, and the methodologies used to detect such manipulations in both audio and video. For each category of deepfake, we discuss information related to manipulation approaches, current public datasets, and key standards for the evaluation of the performance of deepfake detection techniques, along with their results. Additionally, we also discuss open challenges and enumerate future directions to guide researchers on issues which need to be considered in order to improve the domains of both deepfake generation and detection. This work is expected to assist readers in understanding how deepfakes are created and detected, along with their current limitations and where future research may lead.
... To combat this growing threat, the research community has recently focused significant effort on developing algorithms to detect deepfakes. Several approaches have been proposed to identify both video [25,36,32,3,12] and audio [34,5] deepfakes, and multiple deepfake databases have been created to support this research [46,33,19,56]. Since deepfake technology continues to evolve, developing a wide variety of detection methods is essential to address this problem. ...
... Alternatively, in [28] the authors show that the use of long-term audio features has benefits over short-term ones. In [5], audio bicoherence is used instead. More recent methods explore fully data-driven approaches [18]. ...
... There are so many cloned speeches and dangerous deep fakes flooding everywhere, that it causes an urgent concern for authenticating the digital data prior to putting trust in it's content. Though the research in speech forensics has expedited in the last decade, still the literature presents limited research that deals with synthetic speech generated using well known applications like Baidu's text to speech, Amazon's Alexa, Google's wave-net, Apple's Siri, etc. [7]. The speech generation methods using deep neural nets has become so common that free open source code are readily available for generation of synthetic audios. ...
... We tested the accuracies of these features over different ML algorithms when taken individually which are listed in Table 1 to Table 5. • Sub-Task 2: In this task, we combined two of the features i.e Bicoherence Magnitude, and Bicoherence Phase and tested their accuracies over different ML algorithms. These were the same features as considered by [7] in Table 6. ...
Preprint
The recent developments in technology have re-warded us with amazing audio synthesis models like TACOTRON and WAVENETS. On the other side, it poses greater threats such as speech clones and deep fakes, that may go undetected. To tackle these alarming situations, there is an urgent need to propose models that can help discriminate a synthesized speech from an actual human speech and also identify the source of such a synthesis. Here, we propose a model based on Convolutional Neural Network (CNN) and Bidirectional Recurrent Neural Network (BiRNN) that helps to achieve both the aforementioned objectives. The temporal dependencies present in AI synthesized speech are exploited using Bidirectional RNN and CNN. The model outperforms the state-of-the-art approaches by classifying the AI synthesized audio from real human speech with an error rate of 1.9% and detecting the underlying architecture with an accuracy of 97%.
... Witkowski et al. [202] proposed a model that exploits the high-frequency signal of the replayed recording for spoof detection. AlBadawy et al. [203] did a bi-spectral analysis to find the high order spectral correlation introduced by the speech synthesis mechanism. ...
Article
Full-text available
In the last few years, with the advancement of deep learning methods, especially Generative Adversarial Networks (GANs) and Variational Auto-encoders (VAEs), fabricated content has become more realistic and believable to the naked eye. Deepfake is one such emerging technology that allows the creation of highly realistic, believable synthetic content. On the one hand, Deepfake has paved the way for highly advanced applications in various fields like advertising, creative arts, and film productions. On the other hand, it poses a threat to various Multimedia Information Retrieval Systems (MIPR) such as face recognition and speech recognition systems and has more significant societal implications in spreading misleading information. This paper aims to assist an individual in understanding the deepfake technology (along with its application), current state-of-the-art methods and gives an idea about the future pathway of this technology. In this paper, we have presented a comprehensive literature survey on the application of deepfakes, followed by discussions on state-of-the-art methods for deepfake generation and detection for three media: Image, Video, and Audio. Next, we have extensively discussed the architectural components and dataset used for various methods of deepfakes. Furthermore, we discuss the various limitations and open challenges of deepfakes to identify the research gaps in this field. Finally, discuss the conclusion and future directions to explore the potential of this technology in the coming years.
... Some of them using the FoR or ASVSpoof dataset, while others use H-Voice [24] and Imitation datasets [25]. In DeepSonar [26], the authors use a neural network architecture, whose input corresponds to bispectral images obtained from the audios of the FoR dataset, similar to that suggested by [27]. The authors proposed two schemes (i.e., TKAN and ACN) with different results for the same input images. ...
Chapter
The increase in the number of algorithms and commercial tools for creating synthetic audio has led to a high level of misinformation, especially on social media. As a consequence, efforts have been focused in recent years on detecting this type of content. However, this task is far from being successfully addressed, as the naturalness of fake audios is increasing. In this paper we present a model to classify audios between natural and fake, using an audio preparation stage that includes raw audio transformation, and a modelling stage by means of a custom Convolutional Neural Network (CNN) architecture. Our model is trained on data from the FoR dataset, which contains natural and synthetic audios obtained from several algorithms for deepfake content generation. The performance of the model is evaluated with different metrics such as F1 score, precision (P) and recall (R). According to the results, the audios are successfully classified in 88.9% of the cases.
Chapter
One particular disconcerting form of disinformation are the impersonating audios/videos backed by advanced AI technologies, in particular, deep neural networks (DNNs). These media forgeries are commonly known as the DeepFakes. The AI-based tools are making it easier and faster than ever to create compelling fakes that are challenging to spot. While there are interesting and creative applications of this technology, it can be weaponized to cause negative consequences. In this chapter, we survey the state-of-the-art DeepFake detection methods.
Article
Full-text available
We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.
Article
Full-text available
Digital audio forensics is used for a variety of applications ranging from authenticating audio files to link an audio recording to the acquisition device (e.g., microphone), and also linking to the acoustic environment in which the audio recording was made, and identifying traces of coding or transcoding. This survey paper provides an overview of the current state-of-the-art (SOA) in digital audio forensics and highlights some open research problems and future challenges in this active area of research. The paper categorizes the audio file analysis into container and content-based analysis in order to detect the authenticity of the file. Existing SOA, in audio forensics, is discussed based on both container and content-based analysis. The importance of this research topic has encouraged many researchers to contribute in this area; yet, further scopes are available to help researchers and readers expand the body of knowledge. The ultimate goal of this paper is to introduce all information on audio forensics and encourage researchers to solve the unanswered questions. Our survey paper would contribute to this critical research area, which has addressed many serious cases in the past, and help solve many more cases in the future by using advanced techniques with more accurate results.
Conference Paper
Full-text available
The performance of biometric systems based on automatic speaker recognition technology is severely degraded due to spoofing attacks with synthetic speech generated using different voice conversion (VC) and speech synthesis (SS) techniques. Various countermeasures are proposed to detect this type of attack , and in this context, choosing an appropriate feature extraction technique for capturing relevant information from speech is an important issue. This paper presents a concise experimental review of different features for synthetic speech detection task. A wide variety of features considered in this study include previously investigated features as well as some other potentially useful features for characterizing real and synthetic speech. The experiments are conducted on recently released ASVspoof 2015 corpus containing speech data from a large number of VC and SS technique. Comparative results using two different classifiers indicate that features representing spectral information in high-frequency region, dynamic information of speech, and detailed information related to subband characteristics are considerably more useful in detecting synthetic speech.
Article
We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.
Article
We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CelebA images at 1024^2. We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CelebA dataset.
Article
Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.
Article
this paper we address the problem of authenticating digital signals assuming no explicit prior knowledge of the original. The basic approach that we take is to assume that in the frequency domain a "natural" signal has weak higher-order statistical correlations. We then show that "un-natural" correlations are introduced if this signal is passed through a non-linearity (which would almost surely occur in the creation of a forgery). Techniques from polyspectral analysis are then used to detect the presence of these correlations. We review the basics of polyspectral analysis, show how and why these tools can be used in detecting forgeries and show their effectiveness in analyzing human speech
Detecting nonlinearities in speech sounds using the bicoherence
  • J W A Fackrell
  • Stephen Mclaughlin
J.W.A. Fackrell and Stephen McLaughlin. Detecting nonlinearities in speech sounds using the bicoherence. Proceedings of the Institute of Acoustics, 18(9):123-130, 1996. 2