Available via license: CC BY 4.0
Content may be subject to copyright.
Journal Pre-proof
A dataset of histograms of original and fake voice recordings (H-Voice)
Dora M. Ballesteros, Yohanna Rodriguez, Diego Renza
PII: S2352-3409(20)30225-0
DOI: https://doi.org/10.1016/j.dib.2020.105331
Reference: DIB 105331
To appear in: Data in Brief
Received Date: 2 December 2019
Revised Date: 4 February 2020
Accepted Date: 17 February 2020
Please cite this article as: D.M. Ballesteros, Y. Rodriguez, D. Renza, A dataset of histograms of original
and fake voice recordings (H-Voice), Data in Brief, https://doi.org/10.1016/j.dib.2020.105331.
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.
© 2020 The Author(s). Published by Elsevier Inc.
A dataset of histograms of original and fake voice recordings
(H-Voice)
Dora M. Ballesteros
1
, Yohanna Rodriguez
1
, Diego Renza
1
1. Universidad Militar Nueva Granada
Corresponding author
Dora M. Ballesteros (dora.ballesteros@unimilitar.edu.co)
Abstract
This paper presents H-Voice, a dataset of 6672 histograms of original and fake voice recordings obtained
by the Imitation [1, 2] and the Deep Voice [3] methods. The dataset is organized into six directories:
Training_fake, Training_original, Validation_fake, Validation_original, External_test1, and
External_test2. The training directories include 2088 histograms of fake voice recordings and 2020
histograms of original voice recordings. Each validation directory has 864 histograms obtained from fake
voice recordings and original voice recordings. Finally, External_test1 has 760 histograms (380 from fake
voice recordings obtained by the Imitation method and 380 from original voice recordings), and
External_test2 has 76 histograms (72 from fake voice recordings obtained by the Deep Voice method
and 4 from original voice recordings). With this dataset, the researchers can train, cross-validate and
test classification models using machine learning techniques to identify fake voice recordings.
Keywords
Fake voice; Machine Learning; Convolutional Neural Networks; binary classification; Imitation; Deep
Voice; H-Voice.
Specifications Table
Subject
Computer
Vision
and
Pattern
Recognition
Specific
subject
area
Image
processing
related
to
identify/classify
tampered
data
Type
of
data
Images
How
data
w
ere
acquired
The
images
were
obtained
by
calculating
the
histogram
of
original
and
fake voice recordings from a repository of the Deep Voice
(
https://audiodemos.github.io/
)
and
the
I
mitation
methods
(http://dx.doi.org/10.17632/ytkv9w92t6.1)
Data
format
Raw
:
histograms
(PNG)
Parameters
for
data
collection
The
voice
recordings
are
re
-
quantized
to
16
bits.
The
histogram
s
with
2
16
bins are calculated from the voice recording (original or fake)
Description
of
data
collection
The
dataset
is
composed
by
six
directories,
organized
as
follows:
1. Training_fake: 2088 histograms from fake voice recordings (by
the Imitation and the Deep Voice methods)
2. Training_original: 2020 histograms from original voice recordings
3. Validation_fake: 864 histograms from fake voice recordings (by
the Imitation method)
4. Validation_original: 864 histograms from original voice
recordings
5. External_test1: 760 histograms from original and fake voice
recordings (by the Imitation method)
6. External_test2: 76 histograms from original and fake voice
recordings (by the Deep voice method)
Data
source
location
City:
Bogotá
Country: Colombia
Data
accessibility
Repository
name:
Mendeley
Data name: H-Voice: Fake voice histograms (Imitation+DeepVoice) [4]
Direct URL to data: http://dx.doi.org/10.17632/k47yd3m28w.4
Value of the Data
• This is the first dataset of histograms from original and fake voice recordings. The histograms
are obtained from real signals (original and fake) using the Imitation [1,2] and the Deep Voice [3]
methods.
• This dataset of histograms allows fake voice classifiers to be trained, cross-validated and tested
using machine learning techniques such as convolutional neural networks, like how it is done in
anti-spoofing speaker verification systems that use spectrograms as features [5, 6].
• The dataset is balanced between original and fake voice recordings which is a desirable
condition to obtain a good trade-off between precision and recall.
• This dataset can be used for comparing the performance of different fake voice classification
models.
Data Description
This dataset is composed by histograms (images) from original and fake voice recordings obtained by
two methods: Imitation [1, 2] and Deep Voice [3]. This data set has four versions in Mendeley, the
difference between them corresponding to the number of histograms. Version 1 has 3432 histograms,
version 2 has 3792 histograms and version 3 has 6672 histograms. In version 4, corrupted images have
been fixed. The latest version (i.e. version 4) is the one explained in this document, which is organized in
six directories: Training_original, Training_fake, Validation_original, Validation_fake, External_test1,
and External_test 2 [4].
Figure 1. H-Voice dataset structure.
Figure 1 shows the structure of the dataset. This is explained below:
1. Training_original: 2020 histograms from original voice recordings.
2. Training_fake: 2088 histograms from fake voice recordings, of which 2016 histograms are
obtained by the Imitation method, and 72 by the Deep Voice method.
3. Validation_original: 864 histograms from original voice recordings.
4. Validation_fake: 864 histograms from fake voice recordings obtained by the Imitation method.
5. External_test1: this is composed of 380 histograms of original voice recordings and 380
histograms of fake voice recordings obtained by the Imitation method.
6. External_test2: this is composed of 4 histograms of original voice recordings and 72 histograms
of fake voice recordings obtained by the Deep Voice method.
Figure 2 shows examples of histograms of original and fake voice recordings of the training and
validation directories. Figure 3 and Figure 4 show examples of the External_test1 and External_test2
directories, respectively.
(a)
(b)
(c)
(d)
Figure 2. First example of histograms, located at: a)Training_original, b)Training_fake,
c)Validation_original, and d)Validation_fake directories.
(a)
(b)
Figure 3. Second example of histograms, located at External_test1 directory): a) original voice recording,
b) fake voice recording obtained by the Imitation method.
(a)
(b)
Figure 4. Third example of histograms, located at External_test2 directory): a) original voice recording, b)
fake voice recording obtained by the Deep Voice method.
Experimental Design, Materials, and Methods
Fake voice files are created entirely by a machine, either by machine learning (e.g. the Deep Voice
method) or by signal processing techniques (e.g. the Imitation method). Unlike false voice recordings
that are obtained by spoofing the voice, or by manipulating an original voice signal with insertion tasks,
deletion or splicing. In the case of the Deep Voice method, a convolutional neural network is trained
with original voice recordings to create new (fake) voice recordings with different plain text than the
original. On the other hand, the Imitation method uses a re-ordering process of the wavelet coefficients
of the original voice signal by imitating the genre, intonation and rhythm of another speaker.
The first step in creating our histograms was to obtain examples of fake voice recordings from the Deep
Voice and the Imitation methods. In the case of Deep Voice, we use the voice recordings publicly
available at https://audiodemos.github.io/. But, in the case of Imitation, we ourselves created fake voice
recordings with the following code (based on the algorithm proposed in [2]):
%
Inputs:
original.wav,
target.wav
% Outputs: fake.wav, key
[original, FS]=audioread('original.wav'); % read the original voice recording
[target, FS]=audioread('target.wav'); % read the target voice recording (to be imitated)
[C1,L1] = wavedec(target,4,'db10'); % obtain the wavelet coefficients of the original voice recording
[C2,L2] = wavedec(original,4,'db10'); % obtain the wavelet coefficients of the target voice recording
[B1,IX1] = sort(C1,'descend'); % sort the wavelet coefficients of the original voice recording
[B2,IX2] = sort(C2,'descend'); % sort the wavelet coefficients of the target voice recording
C2m(IX1)=C2(IX2); % re-ordering the wavelet coefficients of the original voice recording
key(IX1)=IX2; % obtain the key to reverse the process
fake=waverec(C2m,L1,'db10'); % create the fake voice obtained from the original voice recording
audiowrite(‘fake.wav’,fake,FS,'BitsPerSample',16);
%
save
the
fake
voice
recording
Examples of original and fake voice recordings obtained with the above algorithm are available at
http://dx.doi.org/10.17632/ytkv9w92t6.1
Once the fake voice recordings have been generated, the following code in Matlab allows us to draw the
histograms (original/fake):
%
Input:
name.wav
% Output: histogram of the voice recording
[voice, FS]=audioread('name.wav'); % read the original/fake voice recording
nbins= 65536; % number of bins of the histogram
h = histogram(x, nbins); % plot the histogram
It is important to note that the examples of fake voice recordings obtained by Deep Voice published at
https://audiodemos.github.io/ have been re-quantized to 16-bits before their histograms were
obtained.
Acknowledgments
This work is supported by the “Universidad Militar Nueva Granada – Vicerrectoría de Investigaciones”
under the grant IMP-ING-2936 of 2019.
Competing Interests
The authors declare that they have no known competing financial interests or personal relationships
which have, or could be perceived to have, influenced the work reported in this article.
References
[1] DM. Ballesteros L, JM. Moreno A, Highly transparent steganography model of speech signals using
Efficient Wavelet Masking, Expert Systems with Applications. 39 (2012) 9141-9149.
https://doi.org/10.1016/j.eswa.2012.02.066.
[2] DM. Ballesteros L, JM. Moreno A, On the ability of adaptation of speech signals and data hiding,
Expert Systems with Applications. 39 (2012) 12574-12579. https://doi.org/10.1016/j.eswa.2012.05.027.
[3] S.O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J.
Raiman, S. Sengupta, M. Shoeybi. Deep Voice: Real-time Neural Text-to-Speech, In Proceedings of the
34th International Conference on Machine Learning. 70 (2017) 195-
204. https://arxiv.org/abs/1702.07825
[4] DM. Ballesteros, YP. Rodriguez, D. Renza, H-Voice: Fake voice histograms (Imitation+DeepVoice), v4,
2020. http://dx.doi.org/10.17632/k47yd3m28w.4
[5] I. Himawan, F. Villavicencio, S. Sridharan, C. Fookes, Deep domain adaptation for anti-spoofing in
speaker verification systems, Computer Speech and Language. 58(2019) 377-402.
https://doi.org/10.1016/j.csl.2019.05.007
[6] C. Zhang, C. Yu, JHL. Hansen, An investigation of deep-learning frameworks for speaker verification
antispoofing, IEEE J. Sel. Top. Signal Process. 11 (2017) 684-694.
https://doi.org/10.1109/JSTSP.2016.2647199