ArticlePDF Available

Abstract

This paper presents H-Voice, a dataset of 6672 histograms of original and fake voice recordings obtained by the Imitation [1, 2] and the Deep Voice [3] methods. The dataset is organized into six directories: Training_fake, Training_original, Validation_fake, Validation_original, External_test1, and External_test2. The training directories include 2088 histograms of fake voice recordings and 2020 histograms of original voice recordings. Each validation directory has 864 histograms obtained from fake voice recordings and original voice recordings. Finally, External_test1 has 760 histograms (380 from fake voice recordings obtained by the Imitation method and 380 from original voice recordings), and External_test2 has 76 histograms (72 from fake voice recordings obtained by the Deep Voice method and 4 from original voice recordings). With this dataset, the researchers can train, cross-validate and test classification models using machine learning techniques to identify fake voice recordings.
Journal Pre-proof
A dataset of histograms of original and fake voice recordings (H-Voice)
Dora M. Ballesteros, Yohanna Rodriguez, Diego Renza
PII: S2352-3409(20)30225-0
DOI: https://doi.org/10.1016/j.dib.2020.105331
Reference: DIB 105331
To appear in: Data in Brief
Received Date: 2 December 2019
Revised Date: 4 February 2020
Accepted Date: 17 February 2020
Please cite this article as: D.M. Ballesteros, Y. Rodriguez, D. Renza, A dataset of histograms of original
and fake voice recordings (H-Voice), Data in Brief, https://doi.org/10.1016/j.dib.2020.105331.
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.
© 2020 The Author(s). Published by Elsevier Inc.
A dataset of histograms of original and fake voice recordings
(H-Voice)
Dora M. Ballesteros
1
, Yohanna Rodriguez
1
, Diego Renza
1
1. Universidad Militar Nueva Granada
Corresponding author
Dora M. Ballesteros (dora.ballesteros@unimilitar.edu.co)
Abstract
This paper presents H-Voice, a dataset of 6672 histograms of original and fake voice recordings obtained
by the Imitation [1, 2] and the Deep Voice [3] methods. The dataset is organized into six directories:
Training_fake, Training_original, Validation_fake, Validation_original, External_test1, and
External_test2. The training directories include 2088 histograms of fake voice recordings and 2020
histograms of original voice recordings. Each validation directory has 864 histograms obtained from fake
voice recordings and original voice recordings. Finally, External_test1 has 760 histograms (380 from fake
voice recordings obtained by the Imitation method and 380 from original voice recordings), and
External_test2 has 76 histograms (72 from fake voice recordings obtained by the Deep Voice method
and 4 from original voice recordings). With this dataset, the researchers can train, cross-validate and
test classification models using machine learning techniques to identify fake voice recordings.
Keywords
Fake voice; Machine Learning; Convolutional Neural Networks; binary classification; Imitation; Deep
Voice; H-Voice.
Specifications Table
Subject
Computer
Vision
and
Pattern
Specific
subject
area
Image
processing
related
to
identify/classify
tampered
data
Type
of
data
Images
How
data
w
ere
acquired
The
images
were
obtained
by
calculating
the
histogram
of
original
and
fake voice recordings from a repository of the Deep Voice
(
https://audiodemos.github.io/
)
and
the
I
mitation
methods
(http://dx.doi.org/10.17632/ytkv9w92t6.1)
Data
format
Raw
:
histograms
(PNG)
Parameters
for
data
collection
The
voice
recordings
are
re
-
quantized
to
16
bits.
The
histogram
s
with
2
16
bins are calculated from the voice recording (original or fake)
Description
of
data
collection
The
dataset
is
composed
by
six
directories,
organized
as
follows:
1. Training_fake: 2088 histograms from fake voice recordings (by
the Imitation and the Deep Voice methods)
2. Training_original: 2020 histograms from original voice recordings
3. Validation_fake: 864 histograms from fake voice recordings (by
the Imitation method)
4. Validation_original: 864 histograms from original voice
recordings
5. External_test1: 760 histograms from original and fake voice
recordings (by the Imitation method)
6. External_test2: 76 histograms from original and fake voice
recordings (by the Deep voice method)
Data
source
location
City:
Bogotá
Country: Colombia
Data
accessibility
Repository
name:
Mendeley
Data name: H-Voice: Fake voice histograms (Imitation+DeepVoice) [4]
Direct URL to data: http://dx.doi.org/10.17632/k47yd3m28w.4
Value of the Data
This is the first dataset of histograms from original and fake voice recordings. The histograms
are obtained from real signals (original and fake) using the Imitation [1,2] and the Deep Voice [3]
methods.
This dataset of histograms allows fake voice classifiers to be trained, cross-validated and tested
using machine learning techniques such as convolutional neural networks, like how it is done in
anti-spoofing speaker verification systems that use spectrograms as features [5, 6].
The dataset is balanced between original and fake voice recordings which is a desirable
condition to obtain a good trade-off between precision and recall.
This dataset can be used for comparing the performance of different fake voice classification
models.
Data Description
This dataset is composed by histograms (images) from original and fake voice recordings obtained by
two methods: Imitation [1, 2] and Deep Voice [3]. This data set has four versions in Mendeley, the
difference between them corresponding to the number of histograms. Version 1 has 3432 histograms,
version 2 has 3792 histograms and version 3 has 6672 histograms. In version 4, corrupted images have
been fixed. The latest version (i.e. version 4) is the one explained in this document, which is organized in
six directories: Training_original, Training_fake, Validation_original, Validation_fake, External_test1,
and External_test 2 [4].
Figure 1. H-Voice dataset structure.
Figure 1 shows the structure of the dataset. This is explained below:
1. Training_original: 2020 histograms from original voice recordings.
2. Training_fake: 2088 histograms from fake voice recordings, of which 2016 histograms are
obtained by the Imitation method, and 72 by the Deep Voice method.
3. Validation_original: 864 histograms from original voice recordings.
4. Validation_fake: 864 histograms from fake voice recordings obtained by the Imitation method.
5. External_test1: this is composed of 380 histograms of original voice recordings and 380
histograms of fake voice recordings obtained by the Imitation method.
6. External_test2: this is composed of 4 histograms of original voice recordings and 72 histograms
of fake voice recordings obtained by the Deep Voice method.
Figure 2 shows examples of histograms of original and fake voice recordings of the training and
validation directories. Figure 3 and Figure 4 show examples of the External_test1 and External_test2
directories, respectively.
(a)
(b)
(c)
(d)
Figure 2. First example of histograms, located at: a)Training_original, b)Training_fake,
c)Validation_original, and d)Validation_fake directories.
(a)
(b)
Figure 3. Second example of histograms, located at External_test1 directory): a) original voice recording,
b) fake voice recording obtained by the Imitation method.
(a)
(b)
Figure 4. Third example of histograms, located at External_test2 directory): a) original voice recording, b)
fake voice recording obtained by the Deep Voice method.
Experimental Design, Materials, and Methods
Fake voice files are created entirely by a machine, either by machine learning (e.g. the Deep Voice
method) or by signal processing techniques (e.g. the Imitation method). Unlike false voice recordings
that are obtained by spoofing the voice, or by manipulating an original voice signal with insertion tasks,
deletion or splicing. In the case of the Deep Voice method, a convolutional neural network is trained
with original voice recordings to create new (fake) voice recordings with different plain text than the
original. On the other hand, the Imitation method uses a re-ordering process of the wavelet coefficients
of the original voice signal by imitating the genre, intonation and rhythm of another speaker.
The first step in creating our histograms was to obtain examples of fake voice recordings from the Deep
Voice and the Imitation methods. In the case of Deep Voice, we use the voice recordings publicly
available at https://audiodemos.github.io/. But, in the case of Imitation, we ourselves created fake voice
recordings with the following code (based on the algorithm proposed in [2]):
%
Inputs:
original.wav,
target.wav
% Outputs: fake.wav, key
[original, FS]=audioread('original.wav'); % read the original voice recording
[target, FS]=audioread('target.wav'); % read the target voice recording (to be imitated)
[C1,L1] = wavedec(target,4,'db10'); % obtain the wavelet coefficients of the original voice recording
[C2,L2] = wavedec(original,4,'db10'); % obtain the wavelet coefficients of the target voice recording
[B1,IX1] = sort(C1,'descend'); % sort the wavelet coefficients of the original voice recording
[B2,IX2] = sort(C2,'descend'); % sort the wavelet coefficients of the target voice recording
C2m(IX1)=C2(IX2); % re-ordering the wavelet coefficients of the original voice recording
key(IX1)=IX2; % obtain the key to reverse the process
fake=waverec(C2m,L1,'db10'); % create the fake voice obtained from the original voice recording
audiowrite(‘fake.wav’,fake,FS,'BitsPerSample',16);
%
save
the
fake
voice
recording
Examples of original and fake voice recordings obtained with the above algorithm are available at
http://dx.doi.org/10.17632/ytkv9w92t6.1
Once the fake voice recordings have been generated, the following code in Matlab allows us to draw the
histograms (original/fake):
%
Input:
name.wav
% Output: histogram of the voice recording
[voice, FS]=audioread('name.wav'); % read the original/fake voice recording
nbins= 65536; % number of bins of the histogram
h = histogram(x, nbins); % plot the histogram
It is important to note that the examples of fake voice recordings obtained by Deep Voice published at
https://audiodemos.github.io/ have been re-quantized to 16-bits before their histograms were
obtained.
Acknowledgments
This work is supported by the “Universidad Militar Nueva Granada Vicerrectoría de Investigaciones”
under the grant IMP-ING-2936 of 2019.
Competing Interests
The authors declare that they have no known competing financial interests or personal relationships
which have, or could be perceived to have, influenced the work reported in this article.
References
[1] DM. Ballesteros L, JM. Moreno A, Highly transparent steganography model of speech signals using
Efficient Wavelet Masking, Expert Systems with Applications. 39 (2012) 9141-9149.
https://doi.org/10.1016/j.eswa.2012.02.066.
[2] DM. Ballesteros L, JM. Moreno A, On the ability of adaptation of speech signals and data hiding,
Expert Systems with Applications. 39 (2012) 12574-12579. https://doi.org/10.1016/j.eswa.2012.05.027.
[3] S.O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J.
Raiman, S. Sengupta, M. Shoeybi. Deep Voice: Real-time Neural Text-to-Speech, In Proceedings of the
34th International Conference on Machine Learning. 70 (2017) 195-
204. https://arxiv.org/abs/1702.07825
[4] DM. Ballesteros, YP. Rodriguez, D. Renza, H-Voice: Fake voice histograms (Imitation+DeepVoice), v4,
2020. http://dx.doi.org/10.17632/k47yd3m28w.4
[5] I. Himawan, F. Villavicencio, S. Sridharan, C. Fookes, Deep domain adaptation for anti-spoofing in
speaker verification systems, Computer Speech and Language. 58(2019) 377-402.
https://doi.org/10.1016/j.csl.2019.05.007
[6] C. Zhang, C. Yu, JHL. Hansen, An investigation of deep-learning frameworks for speaker verification
antispoofing, IEEE J. Sel. Top. Signal Process. 11 (2017) 684-694.
https://doi.org/10.1109/JSTSP.2016.2647199
... Among them, StarGANvc-v2 [13], a non-parallel and many-to-many speech conversion method that uses a generative adversarial network (GAN), is trained with only 20 English speakers, and outperforms previous models by being able to perform cross-language conversions, not requiring text labels and using Parallel WaveGAN [14] as a vocoder. These advances have led to the creation of numerous deepfake datasets such as H-voice [15], FakeAVCeleb [16], or Half-truth Speech [17], a dataset that used a text-to-speech model to generate and replace various words in audio clips. ...
Article
Full-text available
Detecting audio deepfakes is crucial to ensure authenticity and security, especially in contexts where audio veracity can have critical implications, such as in the legal, security or human rights domains. Various elements, such as complex acoustic backgrounds, enhance the realism of deepfakes; however, their effect on the processes of creation and detection of deepfakes remains under-explored. This study systematically analyses how factors such as the acoustic environment, user type and signal-to-noise ratio influence the quality and detectability of deepfakes. For this study, we use the WELIVE dataset, which contains audio recordings of 14 female victims of gender-based violence in real and uncontrolled environments. The results indicate that the complexity of the acoustic scene affects both the generation and detection of deepfakes: classifiers, particularly the linear SVM, are more effective in complex acoustic environments, suggesting that simpler acoustic environments may facilitate the generation of more realistic deepfakes and, in turn, make it more difficult for classifiers to detect them. These findings underscore the need to develop adaptive models capable of handling diverse acoustic environments, thus improving detection reliability in dynamic and real-world contexts.
... These advances led to the creation of numerous databases of deepfakes such as H-voice [8] or FakeAVCeleb [9]. The risk appears when, as the technology for generating audio deepfakes becomes more complex and accessible, non-expert users can easily create fake audios. ...
... Pendekatan deteksi tepi yang umum menggunakan metode Sobel, Canny, dan Roberts [7], [8], [9]. Selain itu pendekatan deteksi objek dapat menggunakan histogram [10]. Setiap citra dari suatu objek dapat diketahui distribusi intensitas piksel, memberikan informasi tentang kontras, kecerahan, dan distribusi warna dalam citra. ...
Article
Full-text available
Peningkatan volume lalu lintas dan kepadatan kendaraan di perkotaan menimbulkan tantangan besar dalam menjaga kelancaran dan efisiensi sistem transportasi. Identifikasi objek yang akurat untuk pengelolaan lalu lintas yang efektif. Penelitian ini mengombinasikan metode ensemble dalam klasifikasi objek dengan menggunakan teknik ekstraksi fitur deteksi tepi dan histogram. Tujuan dari penelitian ini adalah untuk meningkatkan akurasi dan efisiensi dalam deteksi objek. Teknik ekstraksi fitur deteksi tepi digunakan untuk mengidentifikasi karakteristik penting dari objek yang dapat memfasilitasi proses klasifikasi. Fitur histogram untuk mengekstrak informasi dari distribusi intensitas piksel dalam citra, yang memberikan gambaran mengenai kontras, kecerahan, dan distribusi warna dalam citra. Hasil eksperimen menunjukkan bahwa kombinasi metode ensemble dengan ekstraksi berbasis deteksi tepi dan histogram secara signifikan meningkatkan performa klasifikasi dibandingkan dengan penggunaan metode klasifikasi tunggal. Metode kombinasi tanpa penambahan fitur histogram mencapai akurasi 72.78%, presisi 72.37%, recall 72.61%, dan F1-Score 72.45%. Penambahan fitur histogram mencapai peningkatan akurasi, presisi, recall, dan F1-Score yang luar biasa menjadi 99.75%. Hasil penelitian memberikan kontribusi dalam meningkatkan akurasi dan efisiensi sistem pengenalan objek, serta menunjukkan bahwa pendekatan multi-metode yang menggabungkan berbagai jenis fitur dapat memberikan hasil yang lebih akurat dan andal dalam pengenalan objek. Integrasi teknik deteksi tepi dan histogram dengan algoritma ensemble seperti Random Forest terbukti sangat efektif dalam meningkatkan performa model klasifikasi gambar secara keseluruhan.
... H-Voice (Histograms Voice), Recent work has resulted in the creation of a dataset known as H-Voice [169], which uses synthesized and imitative voices to speak languages including English, French, Tagalog, Portuguese, and Spanish. In the PNG format, we find the samples that were originally stored in a histogram. ...
Article
Full-text available
This paper presents a review of techniques involved in the creation and detection of audio deepfakes, the first Section provides information about general deep fakes. In the second section, the main methods for audio deepfakes are outlined and subsequently compared. The results discuss various methods for detecting audio deepfakes, including analyzing statistical properties, examining media consistency, and utilizing machine learning and deep learning algorithms. Major methods used to detect fake audio in these studies included Support Vector Machines (SVMs), Decision Trees (DTs), Convolutional Neural Networks (CNNs), Siamese CNNs, Deep Neural Networks (DNNs), and a combination of CNNs and Recurrent Neural Networks (RNNs). The accuracy of these methods varied, with the highest accuracy being 99% for SVM and the lowest being 73.33% for DT. The Equal Error Rate (EER) was reported in a few of the studies, with the lowest being 2% for Deep-Sonar and the highest being 12.24 for DNN-HLLs. The t-DCF was also reported in some of the studies, with the Siamese CNN performing the best with a 55% improvement in min-t-DCF and EER compared to other methods.
... Waveform-based methods aim to analyze the raw waveform of the audio signal instead. To facilitate the development of deepfake audio detection models, numerous databases, such as M-AILABS Speech [292], GAN based synthesized audio dataset [293], Half-Truth [294], and H-Voice [295] have been created. Overall, deepfake audio detection is a challenging but important task that requires the constant development and refinement of detection methods. ...
Preprint
Full-text available
Synthetic realities are digital creations or augmentations that are contextually generated through the use of Artificial Intelligence (AI) methods, leveraging extensive amounts of data to construct new narratives or realities, regardless of the intent to deceive. In this paper, we delve into the concept of synthetic realities and their implications for Digital Forensics and society at large within the rapidly advancing field of AI. We highlight the crucial need for the development of forensic techniques capable of identifying harmful synthetic creations and distinguishing them from reality. This is especially important in scenarios involving the creation and dissemination of fake news, disinformation, and misinformation. Our focus extends to various forms of media, such as images, videos, audio, and text, as we examine how synthetic realities are crafted and explore approaches to detecting these malicious creations. Additionally, we shed light on the key research challenges that lie ahead in this area. This study is of paramount importance due to the rapid progress of AI generative techniques and their impact on the fundamental principles of Forensic Science.
Article
Full-text available
The revolutionary breakthroughs in Machine Learning (ML) and Artificial Intelligence (AI) are extensively being harnessed across a diverse range of domains, e.g., forensic science, healthcare, virtual assistants, cybersecurity, and robotics. On the flip side, they can also be exploited for negative purposes, like producing authentic-looking fake news that propagates misinformation and diminishes public trust. Deepfakes pertain to audio or visual multimedia contents that have been artificially synthesized or digitally modified through the application of deep neural networks. Deepfakes can be employed for benign purposes (e.g., refinement of face pictures for optimal magazine cover quality) or malicious intentions (e.g., superimposing faces onto explicit image/video to harm individuals producing fake audio recordings of public figures making inflammatory statements to damage their reputation). With mobile devices and user-friendly audio and visual editing tools at hand, even non-experts can effortlessly craft intricate deepfakes and digitally altered audio and facial features. This presents challenges to contemporary computer forensic tools and human examiners, including common individuals and digital forensic investigators. There is a perpetual battle between attackers armed with deepfake generators and defenders utilizing deepfake detectors. This paper first comprehensively reviews existing image, video, and audio deepfake databases with the aim of propelling next-generation deepfake detectors for enhanced accuracy, generalization, robustness, and explainability. Then, the paper delves deeply into open challenges and potential avenues for research in the audio and video deepfake generation and mitigation field. The aspiration for this article is to complement prior studies and assist newcomers, researchers, engineers, and practitioners in gaining a deeper understanding and in the development of innovative deepfake technologies.
Article
Full-text available
In the past years, multimedia content has improved in realism and plausibility owing to the development of deep learning techniques, particularly the generative adversarial networks and variational auto‐encoders. Though digital content, especially digital movies shot from a certain viewpoint gives a true representation of reality, yet the ubiquitous usage of content manipulation techniques casts doubt on its veracity. Deepfaking an AI based tampering technique, is able to map facial and acoustic features of a source person onto the target with an intention to make target say or enact the things that has not happened in real. Numerous approaches have been proposed in the literature for detection of image and video deepfakes. With technological advancement, researchers have also started to examine audio deepfakes and ways to detect them. As there is currently no comprehensive overview of audio deepfake generation and detection techniques, this paper aims to provide a survey of the relevant literature in this area. This survey paper intends to help research fraternity about the available audio generation and detection approaches for design of reliable detection models in future to classify fake and real audios.
Article
Full-text available
In this study, we explore the use of deep learning approaches for spoofing detection in speaker verification. Most spoofing detection systems that have achieved recent success employ hand-craft features with specific spoofing prior knowledge, which may limit the feasibility to unseen spoofing attacks.We aim to investigate the genuine-spoofing discriminative ability from the back-end stage, utilizing recent advancements in deep learning research. In this work, alternative network architectures are exploited to target spoofed speech. Based on this analysis, a novel spoofing detection system which simultaneously employs Convolutional Neural networks (CNNs) and Recurrent Neural Networks (RNNs) is proposed. In this framework, CNN is treated as a convolutional feature extractor applied on the speech input. On top of the CNN processed output, recurrent networks are employed to capture long-term dependencies across the time domain. Novel features including Teager Energy Operator Critical Band Autocorrelation Envelope (TEO-CB-Auto-Env), Perceptual Minimum Variance Distortionless Response (PMVDR) and a more general spectrogram are also investigated as inputs to our proposed deep learning frameworks. Experiments using the ASVspoof 2015 Corpus show that the integrated CNN-RNN framework achieves state-of-the-art single system performance. The addition of score-level fusion further improves system robustness. A detailed analysis shows that our proposed approach can potentially compensate for the issue due to short duration test utterances which is also an issue in the evaluation corpus.
Article
With the increasing use of voice as a biometric, it has become imperative to develop countermeasures to thwart malicious spoofing attacks on speaker recognition systems. Even though there has been a significant research effort over the last few years dedicated to the development of countermeasures to detect and deflect spoofing attacks, the problem is far from being solved. While a deep learning technique has been successfully applied in anti-spoofing research, it suffers from a data scarcity issue where large amounts of labeled training data are required to build a robust model. In this paper, we investigate a domain adaptation approach of deep architectures in both a supervised setting where we use labeled data, and in an unsupervised setting where we assume unlabeled data when transferring knowledge from the source to target domain. Specifically, we employ convolutional neural networks (CNNs) as back-end classifiers for spoofed speech detection. For supervised domain adaptation, we propose joint neural network training while allowing the weights to be shared between the source and target streams, and an additional domain regularizer. In the unsupervised domain adaptation scenario, the weights are not shared in order to explicitly model the domain shift. However, the weights are related by weight regularizers to take into account the difference between the two domains. We conduct extensive cross-database (domain mismatch) experiments using ASVspoof 2015 and BTAS 2016 datasets to demonstrate the generalization capability of the proposed deep domain architectures for spoofing detection. Experimental results reveal that the proposed architectures can generalize across databases for both supervised and unsupervised adaptation scenarios.
Article
The efficient wavelet masking is a scheme of speech-in-speech hiding based on the ability of adaptation of speech signals under the hypothesis “any (speech) secret signal may seem similar to a (speech) host signal if its wavelet coefficients are sorted” (Ballesteros L & Moreno A, 2012). In this paper, we delimitate the conditions under which the above hypothesis is true, as follows: (i) the secret and host signals must belong to legible voice signals, (ii) both signals must have the same sampling frequency, (iii) both signals must have the same time-frame, and finally (iv) the ratio between the non-zero coefficients of them should be in the interval [0.8 1.2]. Experimental tests were conducted to demonstrate the hypothesis on different cases: vowel to vowel, message to message and vowel & message, in three languages: English, French, and German. The parameter used to measure the similarity between the adapted secret message and the host signal is the squared Pearson correlation coefficient, r2. The results demonstrate that the hypothesis is true under the theoretical conditions because in all the test cases r2 was closed to 1 and the p-value was lower than 0.05.
Article
Steganography is the process of hiding information on a host signal. Transparency is referred to the ability to avoid suspicion about the existence of a secret message. The most popular mechanisms for hiding data in audio signals are the Least Significant Bit (LSB) substitution, Frequency Masking (FM), Spread Spectrum (SS), and Shift Spectrum Algorithm (SSA). In this paper, we adapt the Frequency Masking concept using an efficient sorting of the wavelet coefficients of the secret messages and use an indirect LSB substitution for hiding speech signals into speech signals. The experimental results show that the proposed model, the Efficient Wavelet Masking (EWM) scheme, has a hiding capacity significantly higher than the Spread and Shift Spectrum Algorithms and additionally a statistical transparency higher than all of the above mentioned mechanisms. Moreover, the transparency is not dependent of the host signal chosen because the wavelet sorting guarantees the adaptation of the secret message to the host signal.
Deep Voice: Real-time Neural Text-to-Speech
  • S O Arik
  • M Chrzanowski
  • A Coates
  • G Diamos
  • A Gibiansky
  • Y Kang
  • X Li
  • J Miller
  • A Ng
  • J Raiman
  • S Sengupta
  • M Shoeybi
S.O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, S. Sengupta, M. Shoeybi. Deep Voice: Real-time Neural Text-to-Speech, In Proceedings of the 34th International Conference on Machine Learning. 70 (2017) 195-204. https://arxiv.org/abs/1702.07825
  • Dm
  • Y P Ballesteros
  • D Rodriguez
  • H Renza
  • Voice
DM. Ballesteros, YP. Rodriguez, D. Renza, H-Voice: Fake voice histograms (Imitation+DeepVoice), v4, 2020. http://dx.doi.org/10.17632/k47yd3m28w.4