Conference PaperPDF Available

Overcoming Data Scarcity in Speaker Identification: Dataset Augmentation with Synthetic MFCCs via Character-level RNN

Authors:

Abstract and Figures

Autonomous speaker identification suffers issues of data scarcity due to it being unrealistic to gather hours of speaker audio to form a dataset, which inevitably leads to class imbalance in comparison to the large data availability from non-speakers since large-scale speech datasets are available online. In this study, we explore the possibility of improving speaker recognition by augmenting the dataset with synthetic data produced by training a Character-level Recurrent Neural Network on a short clip of five spoken sentences. A deep neural network is trained on a selection of the Flickr8k dataset as well as the real and synthetic speaker data (all in the form of MFCCs) as a binary classification problem in order to discern the speaker from the Flickr speakers. Ranging from 2,500 to 10,000 synthetic data objects, the network weights are then transferred to the original dataset of only Flickr8k and the real speaker data, in order to discern whether useful rules can be learnt from the synthetic data. Results for all three subjects show that fine-tune learning from datasets augmented with synthetic speech improve the classification accuracy, F1 score, precision, and the recall when applied to the scarce real data vs non-speaker data. We conclude that even with just five spoken short sentences, data augmentation via synthetic speech data generated by a Char-RNN can improve the speaker classification process. Accuracy and related metrics are shown to improve from around 93% to 99% for three subjects classified from thousands of others when fine-tuning from exposure to 2500-1000 synthetic data points. High F1 scores, precision and recall also show that issues due to class imbalance are also solved.
Content may be subject to copyright.
Overcoming Data Scarcity in Speaker Identification:
Dataset Augmentation with Synthetic MFCCs via
Character-level RNN
Jordan J. Bird1Diego R. Faria2Cristiano Premebida3Anik´
o Ek´
art4Pedro P. S. Ayrosa5
Abstract—Autonomous speaker identification suffers issues of
data scarcity due to it being unrealistic to gather hours of speaker
audio to form a dataset, which inevitably leads to class imbalance
in comparison to the large data availability from non-speakers
since large-scale speech datasets are available online. In this
study, we explore the possibility of improving speaker recognition
by augmenting the dataset with synthetic data produced by
training a Character-level Recurrent Neural Network on a short
clip of five spoken sentences. A deep neural network is trained
on a selection of the Flickr8k dataset as well as the real and
synthetic speaker data (all in the form of MFCCs) as a binary
classification problem in order to discern the speaker from
the Flickr speakers. Ranging from 2,500 to 10,000 synthetic
data objects, the network weights are then transferred to the
original dataset of only Flickr8k and the real speaker data, in
order to discern whether useful rules can be learnt from the
synthetic data. Results for all three subjects show that fine-tune
learning from datasets augmented with synthetic speech improve
the classification accuracy, F1 score, precision, and the recall
when applied to the scarce real data vs non-speaker data. We
conclude that even with just five spoken short sentences, data
augmentation via synthetic speech data generated by a Char-
RNN can improve the speaker classification process. Accuracy
and related metrics are shown to improve from around 93% to
99% for three subjects classified from thousands of others when
fine-tuning from exposure to 2500-1000 synthetic data points.
High F1 scores, precision and recall also show that issues due to
class imbalance are also solved.
Index Terms—Data Augmentation, Speaker Identification,
Speech Recognition, Generative Models, Human-robot Interac-
tion, Autonomous Systems
I. INTRODUCTION
Although there is a large amount of speech-related audio
data available in the form of public datasets, autonomous
speaker classification suffers issues of data scarcity due to
users understandably unwilling to provide multiple hours of
speech to form a dataset which inevitably leads to a heavy
class imbalance. Methods such as weighting of classes during
the learning process often help with the issues posed by
unbalanced datasets, but this can also be detrimental depending
1and 2are with ARVIS Lab, Aston University, Birmingham, UK. Emails:
{birdj1, d.faria}@aston.ac.uk
3is with the Institute of Systems and Robotics, Department of Electrical
and Computer Engineering, University of Coimbra, Coimbra, Portugal. Email:
cpremebida@isr.uc.pt
4is with the School of Engineering and Applied Science, Aston University,
Birmingham, UK. Email: a.ekart@aston.ac.uk
5is with the Department of Computer Science, Universidade Estadual de
Londrina, Londrina, Brazil. Email: ayrosa@uel.br
Fig. 1: Pepper (left) and Nao (right) are state-of-the-art robots
that can perform general speech recognition but not speaker
classification.
on the severity of the underrepresented classes. This is one of
the reasons that autonomous machines, for example Pepper
and Nao shown in Fig. 1 often have the ability of speech
recognition in the form of transcription of words and phrases
but do not possess the ability to classify a specific speaker in
the form of a biometric measurement.
A relatively new idea based on the development of
generative models is that of dataset augmentation. That is,
to learn rules within data in order to be able to produce
new data that bares similarity. The most famous example of
this, at the time of writing, is the field of ‘AI Art’ where
models such as the Generative Adversarial Network learn to
generalise a set of artworks in order to produce new images.
Though experiments like this are the most famous, dataset
augmentation is a rapidly growing line of thought in multiple
fields, the question asked is “can the synthetic data produced
by a generative model aid in the classification process of the
original data?”. If this is possible, then problems encountered
due to class imbalance and under-representation may possibly
be mitigated by exposing algorithms to synthetic data that
has been produced by a generative model, based on learning
from a limited set of scarce data points. In this work, in
terms of contribution, we perform the first benchmarks
of data augmentation fine-tune learning of Mel-Frequency
Cepstral Coefficients (MFCCs) for speaker classification1.
1to the best of our knowledge and based on literature review.
These original findings and results support the hypothesis that
synthetic MFCCs are useful in improving the classification of
an unbalanced speaker classification dataset when knowledge
gained from deep learning is transferred from the augmented
data to the original set in question. This is attempted for
2,500, 5,000, 7,500 and 10,000 synthetic data objects for
three subjects from the United Kingdom, Republic of Ireland,
and the United States of America.
The remainder of this work is as follows. Firstly, Section II
explores the background philosophy of the processes followed
by this work and related experiments as well as discussing the
state-of-the-art in the field. Section III outlines the Method
followed including data collection, synthetic data generation,
feature extraction to form datasets, and the learning processes
to discern whether augmentation of MFCCs improves speaker
recognition. The results of the experiments are then discussed
in Section IV before future work is outlined and conclusions
drawn in Section V.
II. BACKGROU ND A ND RE LATE D WORK
A. Speaker Identification
Speaker Identification, also known as recognition or veri-
fication, is a pattern recognition classification task in which
an individual’s voice data is classified on a personal level i.e.,
”is person A speaking?” [1]. The task is useful in multiple
domains, for Human-robot Interaction [2], Forensics [3] and
Biometrics [4]. Identifying speakers has shown to be a rel-
atively easy problem when a small sample size are defined,
for example it is possible to perfectly classify database of
21 speakers’ MFCC data extracted from audio [5]. On the
other hand, researchers have pointed out an open issue in the
state of the art where far more data is present, noting the then
much more difficult speaker identification problem [6]–[8]. In
this work, we attempt to improve the classification process
of a speaker when many thousands of alternative speakers
are present, through pattern matching against a large-scale
dataset. The idea behind this as well as related work are further
explored in Section II-B.
B. Dataset Augmentation through Synthesis
Many philosophical, psychological, and sociological
studies have explored the idea of learning from imagined
situations and actions [9]–[12]. Researchers argue that the
ability of imagination is paramount in the learning process
by improving abilities and skills through visualisation and
logical dissemination. Research has also shown that imagined
situations are not a perfect reflection their counterparts in
reality [13]–[15]. The conclusion thus is that humans regularly
learn from imagined data that does not truly reflect reality, and
yet, this process is important for effective learning regardless
of how realistic the imagination is, or most importantly, isn’t.
The idea of data synthesis and dataset augmentation for
fine-tune learning on basis data is generally inspired by
the above psychological phenomenon. This is the process
of generating (or ‘imagining’) new data that is inspired
by the real data. Although the data is not a reflection of
the basis dataset, since it does not technically exist, the
idea is that patterns and rules within the basis data will be
further reflected and explored in a more abstract sense in
the synthetic data, and that this exercise can further improve
learning processes when applied to the original basis dataset
prior to any synthesis or augmentation. This idea in machine
learning is a very young field of thought, becoming prominent
only during the latter part of the 2010’s where success has
been shown in several preliminary experiments.
Though the state-of-the-art is young, there are multiple
published works that show the philosophy behind this
experiment in action. In Xu et al., researchers found that data
augmentation leads to an overall best F-1 score for relation
classification of the SemEval dataset when implemented
as part of the training process for a Recurrent Neural
Network [16]. Augmentation was performed by leveraging
the direction of relation. A related experiment shows that
NLP-based word augmentation helps to improve classification
of sentences by both CNN and RNN models [17]. Of the
small number of works in the field, many focus on medical
data since many classification experiments suffer from an
extreme lack of data. In Frid-Adar et al., researchers showed
that the classification of liver lesions could be improved
by also generating synthetic images of lesions using a
convolutional generative adversarial network [18]. Following
this, Shin et al., argued the same hypothesis for the image
classification of both Alzheimer’s Disease via neuroimaging
and multimodal brain tumour image segmentation via a set
of differing MRI images [19].
In terms of audio, related works have also argued in
favour of the hypothesis being tested in this experiment. A
closely related work showed that acoustic scene classification
of mel-spectrograms could be improved through synthetic
data augmentation [20]. Models such as Tacotron [21]
learn to produce audio spectrograms from training data in
order to perform realistic text-to-speech from either textual
representation or internationally recognised phonemes. In
terms of the problem faced by this study, speaker recognition,
limited work with i-vector representations of utterances have
shown promise in terms of classification after augmentation
has been performed via a GAN [22].
In this work, we implement a Character-level RNN in
order to generate synthetic speech data by learning from
the speaker’s utterances. Character-level RNNs have shown
to be effective when generating natural written text [23],
[24], composing music [25], and creating artwork [26].
More importantly related to this study, RNNs have also been
effective in generating accurate timeseries [27] and moreover,
also MFCC data [28]. Recurrence in deep learning has proved
to be revolutionary in the field of speech processing [29]–
[31]. Most importantly for the idea behind this work, the
Small dataset
(real data)
Char-RNN
Large dataset
(synthetic data)
Flickr8k
Dataset Comparison
of metrics
Neural network
2,500 - 10,000
synthetic data objects
Neural network
Neural network
Weight
transfer
Flickr8k
Dataset
No synthetic
data
Fig. 2: Overall diagram of the experiment in which a neural
network is compared to another which has been exposed to
synthetic data prior to weight transfer to the clean dataset.
Note that metrics regarding models trained with synthetic data
are not considered.
technology has shown to be likewise as useful for the
synthesis of new speech based on statistical rules within
audio [28], [32], [33]. The idea is that we generate synthetic
speech based on short utterances by a subject, and attempt to
increase the ability of identification by also learning from the
synthetic data generated by an RNN. Should this be possible,
it would reduce data requirements by introducing similar
speech autonomously, without the need of more extensive
audio recordings.
Though some works have considered human speech in the
related state-of-the-art, this work is the first preliminary explo-
ration of synthetic data augmentation in MFCC classification
for speaker identification, to the best of our knowledge.
III. MET HO D
In this section, the method of the experiment is described.
Figure 2 shows a diagram of the experimental method. To gain
a set of results, three networks are trained; without synthetic
data, with synthetic data, and a third without synthetic data
but with the weights transferred from training when exposed
to synthetic data. Thus, the comparison of the first and third
models are performed since they derive directly comparable
results.
A. Real and Synthetic Data Collection
The data for each experiment is split into a binary classi-
fication problem. Class 0denotes ‘not the speaker’ whereas
class 1denotes ‘the speaker’.
In order to gather a large corpus of speech for class 0,
the Flickr8k dataset is gathered [34]. The dataset contains
40,000 spoken captions of 8,000 images by many speakers
(unspecified by dataset authors). The process in subsection
III-B is followed to generate features, and 100,000 data objects
are selected at random. 50,000 of the data objects are selected
in blocks of 1,000 and the remaining 50,000 are selected at
random - this produces a set populated by lengthier spoken
text as well as short samples of many thousands of speakers
additionally.
To gather real data for class 1, three subjects as observed in
Table I were asked to speak five random Harvard Sentences,
based on the IEEE recommended practice for speech quality
measurements [35]. This short process gathers several seconds
of speech in a user-friendly manner. Users are asked to record
the data via their smartphone microphone, subjects 1 and 2
used an iPhone 7 whereas subject 3 used a Samsung Galaxy
S7. Although the same data is provided by all three subjects,
subjects 2 and 3 spoke at a much quicker pace and thus
provided far fewer data objects than subject 1.
Synthetic data for class 1 is generated by a Character level
Recurrent Neural Network (Char-RNN) [36], [37], topology
of the network for subject 1 is shown in Table II. The Char-
RNN learns to model the probability distribution of the next
character in a sequence after observing a sequence of previous
characters, where previous characters are those that the RNN
has also generated. By performing this one character at a time,
the model initially learns the CSV formatting of the dataset
(26 comma separated numerical values followed by class label
’1’ and a line break character) and then learns to form MFCC
data based on observing the provided dataset.
An RNN is trained for each individual subject’s MFCC data
(see subsection III-B) for 100 epochs before producing 10,000
synthetic data objects. This is approximately a sequence of
2,000,000 characters for each subject. An example of some
data generated can be seen in Fig. 3, what seems like sound
wave behaviour can be observed in the synthetic data but the
nature is also noticeably different. Behaviours such as the
peaks observed in the synthetic data may aid in classification
of real data through augmentation, provided there are useful
patterns within the probability distribution observed from the
real data. This argues the need for fine-tune learning rather
than transfer, since this would allow the neural network to then
discard the information in the synthetic data that is unnatural,
but could possibly carry forward useful rules within the nature
of the generative model output.
B. Feature Extraction
The non-stationary nature of audio poses a difficult classi-
fication problem when single data points are considered [38],
[39]. To overcome this, temporal statistical features are ex-
tracted from the wave. In this work, we extract the first 26
Mel-Frequency Cepstral Coefficients (MFCC) [40], [41] of the
audio clips through a set of sliding windows of 0.025 seconds
in length at a step of 0.01 seconds.
The MFCC extraction process is as follows:
TABLE I: Information regarding the data collection from the three subjects
Subject Sex Age Nationality Dialect/Accent Time Taken (s) Data Objects Captured
1M 23 British Birmingham 24 4978
2M 24 American Tampa, Florida 13 2421
3F 28 Irish Dublin 12 2542
Flickr8K 100,000
(a) 2500 Real MFCCs from Subject 1 (b) 2500 Synthetic MFCCs from Subject 1
Fig. 3: Two sets of values from 26 MFCCs for 2500 time windows for subject 1, one is real whereas the second is generated
by the Char-RNN. A difference in patterns can be seen between the two since synthetic human speech is imperfect. X axis is
temporal (each 0.025s window) and the Y axis is the MFCC value.
TABLE II: Topology of the Character-level Recurrent Neural
Network
Layer Output Parameters
Embedding (16, 64, 512) 7680
CuDNN LSTM (16, 64, 256) 788,480
Dropout (0.2) (16, 64, 256) 0
CuDNN LSTM (16, 64, 256) 526,336
Dropout (0.2) (16, 64, 256) 0
CuDNN LSTM (16, 64, 256) 526,336
Dropout (0.2) (16, 64, 256) 0
Time Distributed
(size of vocabulary) (16, 64, 15) 3855
Softmax (16, 64, 15) 0
1) The Fourier Transform (FT) of the time window data ω
is calculated:
X() = Z
−∞
x(t)ejωt dt. (1)
2) The powers from the FT are mapped to the Mel scale, the
psychological scale of audible pitch [42] via a triangular
temporal window.
3) The Mel-Frequency Cepstrum (MFC), or power spec-
trum of sound, is considered and logs of each of their
powers are taken.
4) The derived Mel-log powers are treated as a signal, and
a Discrete Cosine Transform (DCT) is measured. This
is given as:
Xk=
N1
X
n=0
xncos π
N(n+1
2)kk= 0, ..., N 1,(2)
where xis the array of length N,kis the index of
the output coefficient being calculated, where Nreal
numbers x0...xn1are transformed into the Nreal
numbers X0...Xn1by the formula.
The amplitudes of the spectrum are known as the MFCCs.
The resultant data then provides a mathematical description
of wave behaviour in terms of sounds, each data object made
of 26 attributes produced from the sliding window are then
treated as the input attributes for the neural networks.
This process is performed for all of the selected Flickr8K
data as well as the real data recorded from the subjects. The
MFCC data from each of the three subjects’ audio recordings
is used as input to the Char-RNN generative model.
C. Speaker Classification Learning Process
Datasets are organised into the following for each subject:
1) Flickr data + recorded audio
2) Flickr data + recorded audio + 2,500 synthetic data
3) Flickr data + recorded audio + 5,000 synthetic data
4) Flickr data + recorded audio + 7,500 synthetic data
5) Flickr data + recorded audio + 10,000 synthetic data
A baseline is given through the classification of set 1. Follow-
ing this, models are trained on sets 2-5 in order to produce
models that have been exposed to the base dataset as well as
synthetic data produced by the subject’s RNN model. Finally,
the results are gathered by applying the model weights trained
by models 2-5 and applying each of them individually to
set 1 through a method of fine-tune learning. Should the
classification metrics of set 1 be improved by introducing
weights trained on sets 2-5 then this supports the hypothesis
TABLE III: Classification metrics for the three subjects with
regards to fine-tune learning from synthetic data (scores are
given for transfer learning, NOT classification of synthetic
data)
Subject Synthetic
Data
Metrics
Accuracy F1 Precision Recall
1
093.57 0.94 0.93 0.93
2500 98.31 0.98 0.98 0.98
5000 98.56 0.99 0.99 0.98
7500 99.03 0.99 0.99 0.99
10000 98.33 0.98 0.98 0.98
2
095.13 0.95 0.95 0.95
2500 98.43 0.98 0.98 0.98
5000 99.19 0.99 0.99 0.99
7500 99.11 0.99 0.99 0.99
10000 97.37 0.97 0.97 0.97
3
096.58 0.97 0.97 0.97
2500 97.77 0.97 0.97 0.97
5000 97.83 0.98 0.98 0.98
7500 98.35 0.98 0.98 0.98
10000 98.83 0.99 0.99 0.99
that synthetic data allows for better classification of speaker.
This process is shown as a flow diagram in Fig. 2, note that the
synthetic data is not part of the two models that are compared
to derive results, rather, they provide weights to be transferred
to the third network. Thus, the two networks compared are
trained with identical data, and differ only in terms of starting
weight distribution for fine-tune learning. The hyperparameters
and topology of the deep neural network are selected based
on an evolutionary search approach that three deep layers
of 30, 7, and 29 hidden neurons were a strong solution for
the classification of MFCC attributes. Activation functions of
the layers are ReLu and the ADAM optimiser [43] is used.
Training is not limited to a set number of epochs, rather, early
stopping is introduced at a threshold of 25 epochs with no
improvement of ability before training is ceased. This therefore
allows all networks to stabilise to an asymptote. Classification
errors are weighted by the prominence of the class in the
dataset.
All of the deep learning experiments performed in this work
were executed on an Nvidia GTX980Ti GPU.
IV. PRELIMINARY RESULTS
The results for the three subjects can be seen in Table III. It
is shown that introducing synthetic data and then transferring
weights to the non-synthetic dataset network improves over
no data augmentations (0 in column 2) in every case for all
subjects. That is, all transfer networks outperform all non-
transfer networks for each of the subjects. In each of the 12
fine-tuning experiments, classification metrics were shown to
improve when the learnt knowledge was applied to the non-
synthetic dataset which argues in favour of the hypothesis that
the false data produced by the RNN helps to improve the
speaker classification process and overcome difficulties faced
by data scarcity and imbalance. Interestingly, the two male
subjects hit peak performance at 5,000 to 7,500 synthetic data
objects, whereas the female subject peaks at 10,000 which
suggests the possibility of difference based on either gender
or accent, which must be explored further in order to identify
the cause (providing that it is not a fluke occurrence). This
should be performed with a larger range of subjects in order to
show why statistical differences may occur for improvement of
classification ability. The performance for the first two subjects
seemingly begins to decrease at fine-tuning from exposure to
10,000 synthetic data objects, suggesting that the generative
model could be improved to prevent confusion in the model.
In terms of the best improvements to classification accuracy,
Subject 1 increased by 5.46% (93.57% to 99.03%) with the
introduction of transfer learning from 7,500 synthetic data
objects. Subject 2 increased by 4.06% (95.13% to 99.19%)
by transfer learning from 5,000 synthetic data objects, and
classification of Subject 3 was improved by 2.25% (96.58%
to 98.83%) with transfer learning from 10,000 synthetic data
objects. On average, this is a classification accuracy improve-
ment of 3.92%. Also observed in Table III are improvements
to the F1score, precision and recall metrics for each of the
models when transfer learning from synthetic datasets.
V. FUTURE WORK AND CONCLUSION
Since this work has provided argument in favour of the
hypothesis that exposing a speech classification network to
synthetic data improves speaker recognition, further work is
enabled to explore this in more detail. A large limitation to
the RNN was the time spent on simply learning the format of
the data, that is, comma separated numerical values strictly 26
in length before being followed by a new line character. Mod-
els such as a Generative Adversarial Network (GAN) could
have these rules as standard (i.e., 26 Generator outputs, 26
Discriminator inputs, 1 Discriminator output) which enables
the learning to focus purely on values of attributes and their
relationships with one another as well as the class label. As
an extension, this experiment should be repeated with data
produced by a GAN in order to compare the two methods of
data synthesis.
Additionally, more subjects should be considered in future
for a wider range of languages and locales for further compari-
son. This study was somewhat ranged with American, Irish and
English subjects but further exploration should be performed
in order to discern whether the effects of augmentation may
change for some accents, as well as the effects that are
observed should the speakers and comparison dataset speakers
use a language other than English.
On a more generalised view of augmentation for learning,
literature review revealed that the majority of work was
performed only in the latter part of the last decade. There
are many fields of machine learning in which augmentation
has either been explored only slightly or even not at all, and
as such cross-field co-operation is needed to further exploit
the possibilities of generative augmentation processes.
To conclude, all of the experiments in this work that
were augmented to any extent by synthetic data then had a
measurably better classification ability of the original dataset
when compared to the learning process on said original data.
This preliminary work enables much future exploration in
terms of both learning models and application of findings
since the issues arising from dataset imbalance are somewhat
mitigated by exposure to new data.
REFERENCES
[1] A. Poddar, M. Sahidullah, and G. Saha, “Speaker verification with
short utterances: a review of challenges, trends and opportunities,IET
Biometrics, vol. 7, no. 2, pp. 91–101, 2017.
[2] E. Mumolo and M. Nolich, “Distant talker identification by nonlinear
programming and beamforming in service robotics,” in IEEE-EURASIP
Workshop on Nonlinear Signal and Image Processing, pp. 8–11, 2003.
[3] P. Rose, Forensic speaker identification. cRc Press, 2002.
[4] N. K. Ratha, A. Senior, and R. M. Bolle, “Automated biometrics,” in
International Conference on Advances in Pattern Recognition, pp. 447–
455, Springer, 2001.
[5] M. R. Hasan, M. Jamil, M. Rahman, et al., “Speaker identification using
mel frequency cepstral coefficients,variations, vol. 1, no. 4, 2004.
[6] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale
speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
[7] S. Yadav and A. Rai, “Learning discriminative features for speaker
identification and verification.,” in Interspeech, pp. 2237–2241, 2018.
[8] H. Zeinali, S. Wang, A. Silnova, P. Matˇ
ejka, and O. Plchot, “But
system description to voxceleb speaker recognition challenge 2019,
arXiv preprint arXiv:1910.12592, 2019.
[9] K. Egan, “Memory, imagination, and learning: Connected by the story,”
Phi Delta Kappan, vol. 70, no. 6, pp. 455–459, 1989.
[10] G. Heath, “Exploring the imagination to establish frameworks for learn-
ing,” Studies in Philosophy and Education, vol. 27, no. 2-3, pp. 115–123,
2008.
[11] P. MacIntyre and T. Gregersen, “Emotions that facilitate language
learning: The positive-broadening power of the imagination,” 2012.
[12] K. Egan, Imagination in teaching and learning: The middle school years.
University of Chicago Press, 2014.
[13] D. Beres, “Perception, imagination, and reality,International Journal
of Psycho-Analysis, vol. 41, pp. 327–334, 1960.
[14] E. F. Loftus, “Creating false memories,Scientific American, vol. 277,
no. 3, pp. 70–75, 1997.
[15] H. L. Roediger III, D. A. Balota, and J. M. Watson, “Spreading activation
and arousal of false memories.,” 2001.
[16] Y. Xu, R. Jia, L. Mou, G. Li, Y. Chen, Y. Lu, and Z. Jin, “Improved
relation classification by deep recurrent neural networks with data
augmentation,” arXiv preprint arXiv:1601.03651, 2016.
[17] S. Kobayashi, “Contextual augmentation: Data augmentation by words
with paradigmatic relations,” arXiv preprint arXiv:1805.06201, 2018.
[18] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan,
“Synthetic data augmentation using gan for improved liver lesion clas-
sification,” in 2018 IEEE 15th international symposium on biomedical
imaging (ISBI 2018), pp. 289–293, IEEE, 2018.
[19] H.-C. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L.
Senjem, J. L. Gunter, K. P. Andriole, and M. Michalski, “Medical image
synthesis for data augmentation and anonymization using generative
adversarial networks,” in International workshop on simulation and
synthesis in medical imaging, pp. 1–11, Springer, 2018.
[20] J. H. Yang, N. K. Kim, and H. K. Kim, “Se-resnet with gan-based data
augmentation applied to acoustic scene classification,” in DCASE 2018
workshop, 2018.
[21] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly,
Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: Towards end-
to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
[22] J.-T. Chien and K.-T. Peng, “Adversarial learning and augmentation for
speaker recognition.,” in Odyssey, pp. 342–348, 2018.
[23] D. Pawade, A. Sakhapara, M. Jain, N. Jain, and K. Gada, “Story
scrambler–automatic text generation using word level rnn-lstm,In-
ternational Journal of Information Technology and Computer Science
(IJITCS), vol. 10, no. 6, pp. 44–53, 2018.
[24] L. Sha, L. Mou, T. Liu, P. Poupart, S. Li, B. Chang, and Z. Sui, “Order-
planning neural text generation from structured data,” in Thirty-Second
AAAI Conference on Artificial Intelligence, 2018.
[25] D. Eck and J. Schmidhuber, “Finding temporal structure in music: Blues
improvisation with lstm recurrent networks,” in Proceedings of the 12th
IEEE workshop on neural networks for signal processing, pp. 747–756,
IEEE, 2002.
[26] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra,
“Draw: A recurrent neural network for image generation,arXiv preprint
arXiv:1502.04623, 2015.
[27] T. Senjyu, A. Yona, N. Urasaki, and T. Funabashi, “Application of recur-
rent neural network to long-term-ahead generating power forecasting for
wind power generator,” in 2006 IEEE PES Power Systems Conference
and Exposition, pp. 1260–1265, IEEE, 2006.
[28] X. Wang, S. Takaki, and J. Yamagishi, “An rnn-based quantized f0
model with multi-tier feedback links for text-to-speech synthesis.,” in
INTERSPEECH, pp. 1059–1063, 2017.
[29] S. Fern´
andez, A. Graves, and J. Schmidhuber, “An application of
recurrent neural networks to discriminative keyword spotting,” in In-
ternational Conference on Artificial Neural Networks, pp. 220–229,
Springer, 2007.
[30] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao,
D. Rybach, A. Kannan, Y. Wu, R. Pang, et al., “Streaming end-to-end
speech recognition for mobile devices,” in ICASSP 2019-2019 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 6381–6385, IEEE, 2019.
[31] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory
recurrent neural network architectures for large scale acoustic modeling,”
2014.
[32] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investi-
gating rnn-based speech enhancement methods for noise-robust text-to-
speech.,” in SSW, pp. 146–152, 2016.
[33] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-
to-speech system based on deep convolutional networks with guided
attention,” in 2018 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 4784–4788, IEEE, 2018.
[34] D. Harwath and J. Glass, “Deep multimodal semantic embeddings for
speech and images,” in 2015 IEEE Workshop on Automatic Speech
Recognition and Understanding (ASRU), pp. 237–244, IEEE, 2015.
[35] E. Rothauser, “Ieee recommended practice for speech quality measure-
ments,” IEEE Trans. on Audio and Electroacoustics, vol. 17, pp. 225–
246, 1969.
[36] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with
recurrent neural networks,” in Proceedings of the 28th international
conference on machine learning (ICML-11), pp. 1017–1024, 2011.
[37] A. Graves, “Generating sequences with recurrent neural networks,arXiv
preprint arXiv:1308.0850, 2013.
[38] Z. Xiong, R. Radhakrishnan, A. Divakaran, and T. S. Huang, “Compar-
ing MFCC and mpeg-7 audio features for feature extraction, maximum
likelihood HMM and entropic prior HMM for sports audio classifi-
cation,” in IEEE Int. Conference on Acoustics, Speech, and Signal
Processing (ICASSP)., vol. 5, 2003.
[39] J. J. Bird, E. Wanner, A. Ek´
art, and D. R. Faria, “Phoneme aware
speech recognition through evolutionary optimisation,” in Proceedings
of the Genetic and Evolutionary Computation Conference Companion,
pp. 362–363, 2019.
[40] L. Muda, M. Begam, and I. Elamvazuthi, “Voice recognition algorithms
using mel frequency cepstral coefficient (MFCC) and dynamic time
warping (DTW) techniques,” arXiv preprint arXiv:1003.4083, 2010.
[41] M. Sahidullah and G. Saha, “Design, analysis and experimental eval-
uation of block based transformation in mfcc computation for speaker
recognition,” Speech Communication, vol. 54, no. 4, pp. 543–565, 2012.
[42] S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the
measurement of the psychological magnitude pitch,” The Journal of the
Acoustical Society of America, vol. 8, no. 3, pp. 185–190, 1937.
[43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,
arXiv preprint arXiv:1412.6980, 2014.
... These gates determine the most informative part of the input to make a prediction in the future. One of the applications exploiting RNN to generate synthetic data is the contribution of Bird et al. [52], where a character-level RNN is exploited to generate audio sentences for speaker identification. In addition, RNNs are employed for generating deep fakes, where these architecture render continuous realistic flow in audio or video [102]. ...
... As a means of augmenting the training dataset, we generate training samples synthetically, using a 3D morphable face model (3DMM) [37], [38], [41], [52]. We captured 163 3D face scans [53], [54] and registered them using the Iterative Multi-resolution Dense 3D Registration (IMDR) approach [55]. ...
Preprint
Full-text available
Deep neural networks have become prevalent in human analysis, boosting the performance of applications, such as biometric recognition, action recognition, as well as person re-identification. However, the performance of such networks scales with the available training data. In human analysis, the demand for large-scale datasets poses a severe challenge, as data collection is tedious, time-expensive, costly and must comply with data protection laws. Current research investigates the generation of \textit{synthetic data} as an efficient and privacy-ensuring alternative to collecting real data in the field. This survey introduces the basic definitions and methodologies, essential when generating and employing synthetic data for human analysis. We conduct a survey that summarises current state-of-the-art methods and the main benefits of using synthetic data. We also provide an overview of publicly available synthetic datasets and generation models. Finally, we discuss limitations, as well as open research problems in this field. This survey is intended for researchers and practitioners in the field of human analysis.
... There have been few studies focusing on the analysis of audio features obtained from social robots. Bird et al. [123] mentioned that Pepper and Nao robots are suitable for general speech recognition; however, they do not have the advanced capability of classifying speakers. Shen et al. [124] proposed a framework for personality trait identification through HRI using various features, including vocal features such as MFCCs, pitch, and energy. ...
Article
Full-text available
Speech comprehension can be challenging due to multiple factors, causing inconvenience for both the speaker and the listener. In such situations, using a humanoid robot, Pepper, can be beneficial as it can display the corresponding text on its screen. However, prior to that, it is essential to carefully assess the accuracy of the audio recordings captured by Pepper. Therefore, in this study, an experiment is conducted with eight participants with the primary objective of examining Pepper’s speech recognition system with the help of audio features such as Mel-Frequency Cepstral Coefficients, spectral centroid, spectral flatness, the Zero-Crossing Rate, pitch, and energy. Furthermore, the K-means algorithm was employed to create clusters based on these features with the aim of selecting the most suitable cluster with the help of the speech-to-text conversion tool Whisper. The selection of the best cluster is accomplished by finding the maximum accuracy data points lying in a cluster. A criterion of discarding data points with values of WER above 0.3 is imposed to achieve this. The findings of this study suggest that a distance of up to one meter from the humanoid robot Pepper is suitable for capturing the best speech recordings. In contrast, age and gender do not influence the accuracy of recorded speech. The proposed system will provide a significant strength in settings where subtitles are required to improve the comprehension of spoken statements.
... Generation of synthetic data has become a staple of biometric research over the past couple of years. A method that has recently attracted a significant amount of interest and has been utilized for generating such synthetic biometric data involves the use of deep neural networks (DNNs) [6], including convolutional neural networks (CNNs) and recurrent neural networks (RNNs), by creating face images through pre-trained models. These networks can capture complex patterns and dependencies in the data, enabling the generation of realistic and diverse synthetic samples. ...
Article
Full-text available
Biometric authentication plays a vital role in various everyday applications with increasing demands for reliability and security. However, the use of real biometric data for research raises privacy concerns and data scarcity issues. A promising approach using synthetic biometric data to address the resulting unbalanced representation and bias, as well as the limited availability of diverse datasets for the development and evaluation of biometric systems, has emerged. Methods for a parameterized generation of highly realistic synthetic data are emerging and the necessary quality metrics to prove that synthetic data can compare to real data are open research tasks. The generation of 3D synthetic face data using game engines’ capabilities of generating varied realistic virtual characters is explored as a possible alternative for generating synthetic face data while maintaining reproducibility and ground truth, as opposed to other creation methods. While synthetic data offer several benefits, including improved resilience against data privacy concerns, the limitations and challenges associated with their usage are addressed. Our work shows concurrent behavior in comparing semi-synthetic data as a digital representation of a real identity with their real datasets. Despite slight asymmetrical performance in comparison with a larger database of real samples, a promising performance in face data authentication is shown, which lays the foundation for further investigations with digital avatars and the creation and analysis of fully synthetic data. Future directions for improving synthetic biometric data generation and their impact on advancing biometrics research are discussed.
... These techniques increase the size of the training set by making slightly modified copies of already-existing instances or by creating new, synthetic ones. Such augmented data have proven to be beneficial to the training of models in a wide variety of contexts, from computer vision (Shorten and Khoshgoftaar, 2019) to speech recognition (Bird et al., 2020), to NLP (Feng et al., 2021, as it acts as a regularizer and helps reduce overfitting (Krizhevsky et al., 2012). For dealing with textual data, a suite of augmentation techniques exists. ...
... Biometrics are systems that recognise an individual based on a given input. For example, recognition of an individual based on their fingerprint [10], speech patterns [11,12], EEG signals [13], ECG signals [14], or the iris [15] to name a few. ...
Article
Full-text available
Forgery of a signature with the aim of deception is a serious crime. Machine learning is often employed to detect real and forged signatures. In this study, we present results which argue that robotic arms and generative models can overcome these systems and mount false-acceptance attacks. Convolutional neural networks and data augmentation strategies are tuned, producing a model of 87.12% accuracy for the verification of 2,640 human signatures. Two approaches are used to successfully attack the model with false-acceptance of forgeries. Robotic arms (Line-us and iDraw) physically copy real signatures on paper, and a conditional Generative Adversarial Network (GAN) is trained to generate signatures based on the binary class of 'genuine' and 'forged'. The 87.12% error margin is overcome by all approaches; prevalence of successful attacks is 32% for iDraw 2.0, 24% for Line-us, and 40% for the GAN. Fine-tuning with examples show that false-acceptance is preventable. We find attack success reduced by 24% for iDraw, 12% for Line-us, and 36% for the GAN. Results show exclusive behaviours between human and robotic forgers, suggesting training wholly on human forgeries can be attacked by robots, thus we argue in favour of fine-tuning systems with robotic forgeries to reduce their prevalence.
... Moreover, researchers in Inoue et al. (2019), Yuji et al. (2018) andHongyi et al. (2018) mixed two sound sources within the same class or from different classes to generate a new sound. Furthermore, in the study of Bird et al. (2020), the authors investigated the feasibility of increasing speaker identification by enhancing the dataset with artificial data generated by training a character-level recurrent neural network on a short clip of five spoken sentences. ...
Article
Full-text available
The use of machine learning in automatic speaker identification and localization systems has recently seen significant advances. However, this progress comes at the cost of using complex models, computations, and increasing the number of microphone arrays and training data. Therefore, in this work, we propose a new end-to-end identification and localization model based on a simple fully connected deep neural network (FC-DNN) and just two input microphones. This model can jointly or separately localize and identify an active speaker with high accuracy in single and multi-speaker scenarios by exploiting a new data augmentation approach. In this regard, we propose using a novel Mel Frequency Cepstral Coefficients (MFCC) based feature called Shuffled MFCC (SHMFCC) and its variant Difference Shuffled MFCC (DSHMFCC). In order to test our approach, we analyzed the performance of the identification and localization proposed model on the new features at different noise and reverberation conditions for single and multi-speaker scenarios. The results show that our approach achieves high accuracy in these scenarios, outperforms the baseline and conventional methods, and achieves robustness even with small-sized training data.
... These techniques increase the size of the training set by making slightly modified copies of already-existing instances or by creating new, synthetic ones. Such augmented data have proven to be beneficial to the training of models in a wide variety of contexts, from computer vision (Shorten and Khoshgoftaar, 2019) to speech recognition (Bird et al., 2020), to NLP (Feng et al., 2021, as it acts as a regularizer and helps reduce overfitting (Krizhevsky et al., 2012). For dealing with textual data, a suite of augmentation techniques exists. ...
Preprint
Full-text available
The rapid development of large pretrained language models has revolutionized not only the field of Natural Language Generation (NLG) but also its evaluation. Inspired by the recent work of BARTScore: a metric leveraging the BART language model to evaluate the quality of generated text from various aspects, we introduce DATScore. DATScore uses data augmentation techniques to improve the evaluation of machine translation. Our main finding is that introducing data augmented translations of the source and reference texts is greatly helpful in evaluating the quality of the generated translation. We also propose two novel score averaging and term weighting strategies to improve the original score computing process of BARTScore. Experimental results on WMT show that DATScore correlates better with human meta-evaluations than the other recent state-of-the-art metrics, especially for low-resource languages. Ablation studies demonstrate the value added by our new scoring strategies. Moreover, we report in our extended experiments the performance of DATScore on 3 NLG tasks other than translation.
... Recently, to develop a deep learning model with a limited amount of labeled data, the transfer learning approach that pre-training the deep learning model with generated data that has high relevance with target data, and fine-tuning the pre-trained model with a limited amount of labeled data has recently gained popularity in a variety of fields (Grundkiewicz et al., 2019;Bird et al., 2020;Kortylewski et al., 2018;Tremblay et al., 2018). In Tremblay et al. (2018), the authors proposed a deep neural network for object detection using synthetic images. ...
Article
In many spatial trajectory-based applications, it is necessary to map raw trajectory data points onto road networks in digital maps, which is commonly referred to as a map-matching process. While most previous map-matching methods have focused on using rule-based algorithms to deal with the map-matching problems, in this paper, we consider the map-matching task from the data-driven perspective, proposing a deep learning-based map-matching model. We build a Transformer-based map-matching model with a transfer learning approach. We generate trajectory data to pre-train the Transformer model and then fine-tune the model with a limited number of labeled data to minimize the model development cost and reduce the real-to-virtual gaps. Three metrics (Average Hamming Distance, F-score, and BLEU) at two levels (point and segment level) are used to evaluate the model performance. The model is tested with real-world datasets, and the results show that the proposed map-matching model outperforms other existing map-matching models. We also analyze the matching mechanisms of the Transformer in the map-matching process, which helps to interpret the input data’s internal correlation and the external relation between input data and matching results. In addition, the proposed model shows the possibility of using generated trajectories to solve the map-matching problems in the limited labeled data environment.
Article
Deep neural networks have become prevalent in human analysis, boosting the performance of applications, such as biometric recognition, action recognition, as well as person re-identification. However, the performance of such networks scales with the available training data. In human analysis, the demand for large-scale datasets poses a severe challenge, as data collection is tedious, time-expensive, costly and must comply with data protection laws. Current research investigates the generation of synthetic data as an efficient and privacy-ensuring alternative to collecting real data in the field. This survey introduces the basic definitions and methodologies, essential when generating and employing synthetic data for human analysis. We summarise current state-of-the-art methods and the main benefits of using synthetic data. We also provide an overview of publicly available synthetic datasets and generation models. Finally, we discuss limitations, as well as open research problems in this field. This survey is intended for researchers and practitioners in the field of human analysis.
Article
Full-text available
Remote sensing data is a cheap form of surficial geoscientific data, and in terms of veracity, velocity and volume, can sometimes be considered big data. Its spatial and spectral resolution continues to improve over time, and some modern satellites, such as the Copernicus Programme's Sentinel-2 remote sensing satellites, offer a spatial resolution of 10 m across many of their spectral bands. The abundance and quality of remote sensing data combined with accumulated primary geochemical data has provided an unprecedented opportunity to inferentially invert remote sensing data into geochemical data. The ability to derive geochemical data from remote sensing data would provide a form of secondary big geochemical data, which can be used for numerous downstream activities, particularly where data timeliness, volume and velocity are important. Major benefactors of secondary geochemical data would be environmental monitoring and applications of artificial intelligence and machine learning in geochemistry, which currently entirely relies on manually derived data that is primarily guided by scientific reduction. Furthermore, it permits the usage of well-established data analysis techniques from geochemistry to remote sensing that allows useable insights to be extracted beyond those typically associated with strictly remote sensing data analysis. Currently, no generally applicable and systematic method to derive chemical elemental concentrations from large-scale remote sensing data have been documented in geosciences. In this paper, we demonstrate that fusing geostatistically-augmented geochemical and remote sensing data produces an abundance of data that enables a more generalized machine learning-based geochemical data generation. We use gold grade data from a South African tailing storage facility (TSF) and data from both the Landsat-8 and Sentinel remote sensing satellites. We show that various machine learning algorithms can be used given the abundance of training data. Consequently, we are able to produce a high resolution (10 m grid size) gold concentration map of the TSF, which demonstrates the potential of our method to be used to guide extraction planning, online resource exploration, environmental monitoring and resource estimation.
Article
Full-text available
The objective of this work is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual dataset collected from open source media using a fully automated pipeline. Most existing datasets for speaker identification contain samples obtained under quite constrained conditions, and usually require manual annotations, hence are limited in size. We propose a pipeline based on computer vision techniques to create the dataset from open-source media. Our pipeline involves obtaining videos from YouTube; performing active speaker verification using a two-stream synchronization Convolutional Neural Network (CNN), and confirming the identity of the speaker using CNN based facial recognition. We use this pipeline to curate VoxCeleb which contains contains over a million ‘real-world’ utterances from over 6000 speakers. This is several times larger than any publicly available speaker recognition dataset. Second, we develop and compare different CNN architectures with various aggregation methods and training loss functions that can effectively recognise identities from voice under various conditions. The models trained on our dataset surpass the performance of previous works by a significant margin.
Conference Paper
Full-text available
Phoneme awareness provides the path to high resolution speech recognition to overcome the difficulties of classical word recognition. Here we present the results of a preliminary study on Artificial Neural Network (ANN) and Hidden Markov Model (HMM) methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet, with a specific focus on evolutionary optimisation of bio-inspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico. For each recording, the data were pre-processed, using Mel-Frequency Cepstral Coefficients (MFCC) at a sliding window of 200ms per data object, as well as a further MFCC timeseries format for forecast-based models, to produce the dataset. We found that an evolutionary optimised deep neural network achieves 90.77% phoneme classification accuracy as opposed to the best HMM of 150 hidden units achieving 86.23% accuracy. Many of the evolutionary solutions take substantially longer to train than the HMM, however one solution scoring 87.5% (+1.27%) requires fewer resources than the HMM.
Conference Paper
I do not have permission to share the manuscript here. Please see following links: Published version @ IEEE Xplore [https://doi.org/10.1109/ICASSP.2018.8461829] Preprint @ arXiv [https://arxiv.org/abs/1710.08969]
Chapter
Data diversity is critical to success when training deep learning models. Medical imaging data sets are often imbalanced as pathologic findings are generally rare, which introduces significant challenges when training deep learning models. In this work, we propose a method to generate synthetic abnormal MRI images with brain tumors by training a generative adversarial network using two publicly available data sets of brain MRI. We demonstrate two unique benefits that the synthetic images provide. First, we illustrate improved performance on tumor segmentation by leveraging the synthetic images as a form of data augmentation. Second, we demonstrate the value of generative models as an anonymization tool, achieving comparable tumor segmentation results when trained on the synthetic data versus when trained on real subject data. Together, these results offer a potential solution to two of the largest challenges facing machine learning in medical imaging, namely the small incidence of pathological findings, and the restrictions around sharing of patient data.