Conference PaperPDF Available

Overcoming Data Scarcity in Speaker Identification: Dataset Augmentation with Synthetic MFCCs via Character-level RNN

Authors:

Abstract and Figures

Autonomous speaker identification suffers issues of data scarcity due to it being unrealistic to gather hours of speaker audio to form a dataset, which inevitably leads to class imbalance in comparison to the large data availability from non-speakers since large-scale speech datasets are available online. In this study, we explore the possibility of improving speaker recognition by augmenting the dataset with synthetic data produced by training a Character-level Recurrent Neural Network on a short clip of five spoken sentences. A deep neural network is trained on a selection of the Flickr8k dataset as well as the real and synthetic speaker data (all in the form of MFCCs) as a binary classification problem in order to discern the speaker from the Flickr speakers. Ranging from 2,500 to 10,000 synthetic data objects, the network weights are then transferred to the original dataset of only Flickr8k and the real speaker data, in order to discern whether useful rules can be learnt from the synthetic data. Results for all three subjects show that fine-tune learning from datasets augmented with synthetic speech improve the classification accuracy, F1 score, precision, and the recall when applied to the scarce real data vs non-speaker data. We conclude that even with just five spoken short sentences, data augmentation via synthetic speech data generated by a Char-RNN can improve the speaker classification process. Accuracy and related metrics are shown to improve from around 93% to 99% for three subjects classified from thousands of others when fine-tuning from exposure to 2500-1000 synthetic data points. High F1 scores, precision and recall also show that issues due to class imbalance are also solved.
Content may be subject to copyright.
Overcoming Data Scarcity in Speaker Identification:
Dataset Augmentation with Synthetic MFCCs via
Character-level RNN
Jordan J. Bird1Diego R. Faria2Cristiano Premebida3Anik´
o Ek´
art4Pedro P. S. Ayrosa5
Abstract—Autonomous speaker identification suffers issues of
data scarcity due to it being unrealistic to gather hours of speaker
audio to form a dataset, which inevitably leads to class imbalance
in comparison to the large data availability from non-speakers
since large-scale speech datasets are available online. In this
study, we explore the possibility of improving speaker recognition
by augmenting the dataset with synthetic data produced by
training a Character-level Recurrent Neural Network on a short
clip of five spoken sentences. A deep neural network is trained
on a selection of the Flickr8k dataset as well as the real and
synthetic speaker data (all in the form of MFCCs) as a binary
classification problem in order to discern the speaker from
the Flickr speakers. Ranging from 2,500 to 10,000 synthetic
data objects, the network weights are then transferred to the
original dataset of only Flickr8k and the real speaker data, in
order to discern whether useful rules can be learnt from the
synthetic data. Results for all three subjects show that fine-tune
learning from datasets augmented with synthetic speech improve
the classification accuracy, F1 score, precision, and the recall
when applied to the scarce real data vs non-speaker data. We
conclude that even with just five spoken short sentences, data
augmentation via synthetic speech data generated by a Char-
RNN can improve the speaker classification process. Accuracy
and related metrics are shown to improve from around 93% to
99% for three subjects classified from thousands of others when
fine-tuning from exposure to 2500-1000 synthetic data points.
High F1 scores, precision and recall also show that issues due to
class imbalance are also solved.
Index Terms—Data Augmentation, Speaker Identification,
Speech Recognition, Generative Models, Human-robot Interac-
tion, Autonomous Systems
I. INTRODUCTION
Although there is a large amount of speech-related audio
data available in the form of public datasets, autonomous
speaker classification suffers issues of data scarcity due to
users understandably unwilling to provide multiple hours of
speech to form a dataset which inevitably leads to a heavy
class imbalance. Methods such as weighting of classes during
the learning process often help with the issues posed by
unbalanced datasets, but this can also be detrimental depending
1and 2are with ARVIS Lab, Aston University, Birmingham, UK. Emails:
{birdj1, d.faria}@aston.ac.uk
3is with the Institute of Systems and Robotics, Department of Electrical
and Computer Engineering, University of Coimbra, Coimbra, Portugal. Email:
cpremebida@isr.uc.pt
4is with the School of Engineering and Applied Science, Aston University,
Birmingham, UK. Email: a.ekart@aston.ac.uk
5is with the Department of Computer Science, Universidade Estadual de
Londrina, Londrina, Brazil. Email: ayrosa@uel.br
Fig. 1: Pepper (left) and Nao (right) are state-of-the-art robots
that can perform general speech recognition but not speaker
classification.
on the severity of the underrepresented classes. This is one of
the reasons that autonomous machines, for example Pepper
and Nao shown in Fig. 1 often have the ability of speech
recognition in the form of transcription of words and phrases
but do not possess the ability to classify a specific speaker in
the form of a biometric measurement.
A relatively new idea based on the development of
generative models is that of dataset augmentation. That is,
to learn rules within data in order to be able to produce
new data that bares similarity. The most famous example of
this, at the time of writing, is the field of ‘AI Art’ where
models such as the Generative Adversarial Network learn to
generalise a set of artworks in order to produce new images.
Though experiments like this are the most famous, dataset
augmentation is a rapidly growing line of thought in multiple
fields, the question asked is “can the synthetic data produced
by a generative model aid in the classification process of the
original data?”. If this is possible, then problems encountered
due to class imbalance and under-representation may possibly
be mitigated by exposing algorithms to synthetic data that
has been produced by a generative model, based on learning
from a limited set of scarce data points. In this work, in
terms of contribution, we perform the first benchmarks
of data augmentation fine-tune learning of Mel-Frequency
Cepstral Coefficients (MFCCs) for speaker classification1.
1to the best of our knowledge and based on literature review.
These original findings and results support the hypothesis that
synthetic MFCCs are useful in improving the classification of
an unbalanced speaker classification dataset when knowledge
gained from deep learning is transferred from the augmented
data to the original set in question. This is attempted for
2,500, 5,000, 7,500 and 10,000 synthetic data objects for
three subjects from the United Kingdom, Republic of Ireland,
and the United States of America.
The remainder of this work is as follows. Firstly, Section II
explores the background philosophy of the processes followed
by this work and related experiments as well as discussing the
state-of-the-art in the field. Section III outlines the Method
followed including data collection, synthetic data generation,
feature extraction to form datasets, and the learning processes
to discern whether augmentation of MFCCs improves speaker
recognition. The results of the experiments are then discussed
in Section IV before future work is outlined and conclusions
drawn in Section V.
II. BACKGROU ND A ND RE LATE D WORK
A. Speaker Identification
Speaker Identification, also known as recognition or veri-
fication, is a pattern recognition classification task in which
an individual’s voice data is classified on a personal level i.e.,
”is person A speaking?” [1]. The task is useful in multiple
domains, for Human-robot Interaction [2], Forensics [3] and
Biometrics [4]. Identifying speakers has shown to be a rel-
atively easy problem when a small sample size are defined,
for example it is possible to perfectly classify database of
21 speakers’ MFCC data extracted from audio [5]. On the
other hand, researchers have pointed out an open issue in the
state of the art where far more data is present, noting the then
much more difficult speaker identification problem [6]–[8]. In
this work, we attempt to improve the classification process
of a speaker when many thousands of alternative speakers
are present, through pattern matching against a large-scale
dataset. The idea behind this as well as related work are further
explored in Section II-B.
B. Dataset Augmentation through Synthesis
Many philosophical, psychological, and sociological
studies have explored the idea of learning from imagined
situations and actions [9]–[12]. Researchers argue that the
ability of imagination is paramount in the learning process
by improving abilities and skills through visualisation and
logical dissemination. Research has also shown that imagined
situations are not a perfect reflection their counterparts in
reality [13]–[15]. The conclusion thus is that humans regularly
learn from imagined data that does not truly reflect reality, and
yet, this process is important for effective learning regardless
of how realistic the imagination is, or most importantly, isn’t.
The idea of data synthesis and dataset augmentation for
fine-tune learning on basis data is generally inspired by
the above psychological phenomenon. This is the process
of generating (or ‘imagining’) new data that is inspired
by the real data. Although the data is not a reflection of
the basis dataset, since it does not technically exist, the
idea is that patterns and rules within the basis data will be
further reflected and explored in a more abstract sense in
the synthetic data, and that this exercise can further improve
learning processes when applied to the original basis dataset
prior to any synthesis or augmentation. This idea in machine
learning is a very young field of thought, becoming prominent
only during the latter part of the 2010’s where success has
been shown in several preliminary experiments.
Though the state-of-the-art is young, there are multiple
published works that show the philosophy behind this
experiment in action. In Xu et al., researchers found that data
augmentation leads to an overall best F-1 score for relation
classification of the SemEval dataset when implemented
as part of the training process for a Recurrent Neural
Network [16]. Augmentation was performed by leveraging
the direction of relation. A related experiment shows that
NLP-based word augmentation helps to improve classification
of sentences by both CNN and RNN models [17]. Of the
small number of works in the field, many focus on medical
data since many classification experiments suffer from an
extreme lack of data. In Frid-Adar et al., researchers showed
that the classification of liver lesions could be improved
by also generating synthetic images of lesions using a
convolutional generative adversarial network [18]. Following
this, Shin et al., argued the same hypothesis for the image
classification of both Alzheimer’s Disease via neuroimaging
and multimodal brain tumour image segmentation via a set
of differing MRI images [19].
In terms of audio, related works have also argued in
favour of the hypothesis being tested in this experiment. A
closely related work showed that acoustic scene classification
of mel-spectrograms could be improved through synthetic
data augmentation [20]. Models such as Tacotron [21]
learn to produce audio spectrograms from training data in
order to perform realistic text-to-speech from either textual
representation or internationally recognised phonemes. In
terms of the problem faced by this study, speaker recognition,
limited work with i-vector representations of utterances have
shown promise in terms of classification after augmentation
has been performed via a GAN [22].
In this work, we implement a Character-level RNN in
order to generate synthetic speech data by learning from
the speaker’s utterances. Character-level RNNs have shown
to be effective when generating natural written text [23],
[24], composing music [25], and creating artwork [26].
More importantly related to this study, RNNs have also been
effective in generating accurate timeseries [27] and moreover,
also MFCC data [28]. Recurrence in deep learning has proved
to be revolutionary in the field of speech processing [29]–
[31]. Most importantly for the idea behind this work, the
Small dataset
(real data)
Char-RNN
Large dataset
(synthetic data)
Flickr8k
Dataset Comparison
of metrics
Neural network
2,500 - 10,000
synthetic data objects
Neural network
Neural network
Weight
transfer
Flickr8k
Dataset
No synthetic
data
Fig. 2: Overall diagram of the experiment in which a neural
network is compared to another which has been exposed to
synthetic data prior to weight transfer to the clean dataset.
Note that metrics regarding models trained with synthetic data
are not considered.
technology has shown to be likewise as useful for the
synthesis of new speech based on statistical rules within
audio [28], [32], [33]. The idea is that we generate synthetic
speech based on short utterances by a subject, and attempt to
increase the ability of identification by also learning from the
synthetic data generated by an RNN. Should this be possible,
it would reduce data requirements by introducing similar
speech autonomously, without the need of more extensive
audio recordings.
Though some works have considered human speech in the
related state-of-the-art, this work is the first preliminary explo-
ration of synthetic data augmentation in MFCC classification
for speaker identification, to the best of our knowledge.
III. MET HO D
In this section, the method of the experiment is described.
Figure 2 shows a diagram of the experimental method. To gain
a set of results, three networks are trained; without synthetic
data, with synthetic data, and a third without synthetic data
but with the weights transferred from training when exposed
to synthetic data. Thus, the comparison of the first and third
models are performed since they derive directly comparable
results.
A. Real and Synthetic Data Collection
The data for each experiment is split into a binary classi-
fication problem. Class 0denotes ‘not the speaker’ whereas
class 1denotes ‘the speaker’.
In order to gather a large corpus of speech for class 0,
the Flickr8k dataset is gathered [34]. The dataset contains
40,000 spoken captions of 8,000 images by many speakers
(unspecified by dataset authors). The process in subsection
III-B is followed to generate features, and 100,000 data objects
are selected at random. 50,000 of the data objects are selected
in blocks of 1,000 and the remaining 50,000 are selected at
random - this produces a set populated by lengthier spoken
text as well as short samples of many thousands of speakers
additionally.
To gather real data for class 1, three subjects as observed in
Table I were asked to speak five random Harvard Sentences,
based on the IEEE recommended practice for speech quality
measurements [35]. This short process gathers several seconds
of speech in a user-friendly manner. Users are asked to record
the data via their smartphone microphone, subjects 1 and 2
used an iPhone 7 whereas subject 3 used a Samsung Galaxy
S7. Although the same data is provided by all three subjects,
subjects 2 and 3 spoke at a much quicker pace and thus
provided far fewer data objects than subject 1.
Synthetic data for class 1 is generated by a Character level
Recurrent Neural Network (Char-RNN) [36], [37], topology
of the network for subject 1 is shown in Table II. The Char-
RNN learns to model the probability distribution of the next
character in a sequence after observing a sequence of previous
characters, where previous characters are those that the RNN
has also generated. By performing this one character at a time,
the model initially learns the CSV formatting of the dataset
(26 comma separated numerical values followed by class label
’1’ and a line break character) and then learns to form MFCC
data based on observing the provided dataset.
An RNN is trained for each individual subject’s MFCC data
(see subsection III-B) for 100 epochs before producing 10,000
synthetic data objects. This is approximately a sequence of
2,000,000 characters for each subject. An example of some
data generated can be seen in Fig. 3, what seems like sound
wave behaviour can be observed in the synthetic data but the
nature is also noticeably different. Behaviours such as the
peaks observed in the synthetic data may aid in classification
of real data through augmentation, provided there are useful
patterns within the probability distribution observed from the
real data. This argues the need for fine-tune learning rather
than transfer, since this would allow the neural network to then
discard the information in the synthetic data that is unnatural,
but could possibly carry forward useful rules within the nature
of the generative model output.
B. Feature Extraction
The non-stationary nature of audio poses a difficult classi-
fication problem when single data points are considered [38],
[39]. To overcome this, temporal statistical features are ex-
tracted from the wave. In this work, we extract the first 26
Mel-Frequency Cepstral Coefficients (MFCC) [40], [41] of the
audio clips through a set of sliding windows of 0.025 seconds
in length at a step of 0.01 seconds.
The MFCC extraction process is as follows:
TABLE I: Information regarding the data collection from the three subjects
Subject Sex Age Nationality Dialect/Accent Time Taken (s) Data Objects Captured
1M 23 British Birmingham 24 4978
2M 24 American Tampa, Florida 13 2421
3F 28 Irish Dublin 12 2542
Flickr8K 100,000
(a) 2500 Real MFCCs from Subject 1 (b) 2500 Synthetic MFCCs from Subject 1
Fig. 3: Two sets of values from 26 MFCCs for 2500 time windows for subject 1, one is real whereas the second is generated
by the Char-RNN. A difference in patterns can be seen between the two since synthetic human speech is imperfect. X axis is
temporal (each 0.025s window) and the Y axis is the MFCC value.
TABLE II: Topology of the Character-level Recurrent Neural
Network
Layer Output Parameters
Embedding (16, 64, 512) 7680
CuDNN LSTM (16, 64, 256) 788,480
Dropout (0.2) (16, 64, 256) 0
CuDNN LSTM (16, 64, 256) 526,336
Dropout (0.2) (16, 64, 256) 0
CuDNN LSTM (16, 64, 256) 526,336
Dropout (0.2) (16, 64, 256) 0
Time Distributed
(size of vocabulary) (16, 64, 15) 3855
Softmax (16, 64, 15) 0
1) The Fourier Transform (FT) of the time window data ω
is calculated:
X() = Z
−∞
x(t)ejωt dt. (1)
2) The powers from the FT are mapped to the Mel scale, the
psychological scale of audible pitch [42] via a triangular
temporal window.
3) The Mel-Frequency Cepstrum (MFC), or power spec-
trum of sound, is considered and logs of each of their
powers are taken.
4) The derived Mel-log powers are treated as a signal, and
a Discrete Cosine Transform (DCT) is measured. This
is given as:
Xk=
N1
X
n=0
xncos π
N(n+1
2)kk= 0, ..., N 1,(2)
where xis the array of length N,kis the index of
the output coefficient being calculated, where Nreal
numbers x0...xn1are transformed into the Nreal
numbers X0...Xn1by the formula.
The amplitudes of the spectrum are known as the MFCCs.
The resultant data then provides a mathematical description
of wave behaviour in terms of sounds, each data object made
of 26 attributes produced from the sliding window are then
treated as the input attributes for the neural networks.
This process is performed for all of the selected Flickr8K
data as well as the real data recorded from the subjects. The
MFCC data from each of the three subjects’ audio recordings
is used as input to the Char-RNN generative model.
C. Speaker Classification Learning Process
Datasets are organised into the following for each subject:
1) Flickr data + recorded audio
2) Flickr data + recorded audio + 2,500 synthetic data
3) Flickr data + recorded audio + 5,000 synthetic data
4) Flickr data + recorded audio + 7,500 synthetic data
5) Flickr data + recorded audio + 10,000 synthetic data
A baseline is given through the classification of set 1. Follow-
ing this, models are trained on sets 2-5 in order to produce
models that have been exposed to the base dataset as well as
synthetic data produced by the subject’s RNN model. Finally,
the results are gathered by applying the model weights trained
by models 2-5 and applying each of them individually to
set 1 through a method of fine-tune learning. Should the
classification metrics of set 1 be improved by introducing
weights trained on sets 2-5 then this supports the hypothesis
TABLE III: Classification metrics for the three subjects with
regards to fine-tune learning from synthetic data (scores are
given for transfer learning, NOT classification of synthetic
data)
Subject Synthetic
Data
Metrics
Accuracy F1 Precision Recall
1
093.57 0.94 0.93 0.93
2500 98.31 0.98 0.98 0.98
5000 98.56 0.99 0.99 0.98
7500 99.03 0.99 0.99 0.99
10000 98.33 0.98 0.98 0.98
2
095.13 0.95 0.95 0.95
2500 98.43 0.98 0.98 0.98
5000 99.19 0.99 0.99 0.99
7500 99.11 0.99 0.99 0.99
10000 97.37 0.97 0.97 0.97
3
096.58 0.97 0.97 0.97
2500 97.77 0.97 0.97 0.97
5000 97.83 0.98 0.98 0.98
7500 98.35 0.98 0.98 0.98
10000 98.83 0.99 0.99 0.99
that synthetic data allows for better classification of speaker.
This process is shown as a flow diagram in Fig. 2, note that the
synthetic data is not part of the two models that are compared
to derive results, rather, they provide weights to be transferred
to the third network. Thus, the two networks compared are
trained with identical data, and differ only in terms of starting
weight distribution for fine-tune learning. The hyperparameters
and topology of the deep neural network are selected based
on an evolutionary search approach that three deep layers
of 30, 7, and 29 hidden neurons were a strong solution for
the classification of MFCC attributes. Activation functions of
the layers are ReLu and the ADAM optimiser [43] is used.
Training is not limited to a set number of epochs, rather, early
stopping is introduced at a threshold of 25 epochs with no
improvement of ability before training is ceased. This therefore
allows all networks to stabilise to an asymptote. Classification
errors are weighted by the prominence of the class in the
dataset.
All of the deep learning experiments performed in this work
were executed on an Nvidia GTX980Ti GPU.
IV. PRELIMINARY RESULTS
The results for the three subjects can be seen in Table III. It
is shown that introducing synthetic data and then transferring
weights to the non-synthetic dataset network improves over
no data augmentations (0 in column 2) in every case for all
subjects. That is, all transfer networks outperform all non-
transfer networks for each of the subjects. In each of the 12
fine-tuning experiments, classification metrics were shown to
improve when the learnt knowledge was applied to the non-
synthetic dataset which argues in favour of the hypothesis that
the false data produced by the RNN helps to improve the
speaker classification process and overcome difficulties faced
by data scarcity and imbalance. Interestingly, the two male
subjects hit peak performance at 5,000 to 7,500 synthetic data
objects, whereas the female subject peaks at 10,000 which
suggests the possibility of difference based on either gender
or accent, which must be explored further in order to identify
the cause (providing that it is not a fluke occurrence). This
should be performed with a larger range of subjects in order to
show why statistical differences may occur for improvement of
classification ability. The performance for the first two subjects
seemingly begins to decrease at fine-tuning from exposure to
10,000 synthetic data objects, suggesting that the generative
model could be improved to prevent confusion in the model.
In terms of the best improvements to classification accuracy,
Subject 1 increased by 5.46% (93.57% to 99.03%) with the
introduction of transfer learning from 7,500 synthetic data
objects. Subject 2 increased by 4.06% (95.13% to 99.19%)
by transfer learning from 5,000 synthetic data objects, and
classification of Subject 3 was improved by 2.25% (96.58%
to 98.83%) with transfer learning from 10,000 synthetic data
objects. On average, this is a classification accuracy improve-
ment of 3.92%. Also observed in Table III are improvements
to the F1score, precision and recall metrics for each of the
models when transfer learning from synthetic datasets.
V. FUTURE WORK AND CONCLUSION
Since this work has provided argument in favour of the
hypothesis that exposing a speech classification network to
synthetic data improves speaker recognition, further work is
enabled to explore this in more detail. A large limitation to
the RNN was the time spent on simply learning the format of
the data, that is, comma separated numerical values strictly 26
in length before being followed by a new line character. Mod-
els such as a Generative Adversarial Network (GAN) could
have these rules as standard (i.e., 26 Generator outputs, 26
Discriminator inputs, 1 Discriminator output) which enables
the learning to focus purely on values of attributes and their
relationships with one another as well as the class label. As
an extension, this experiment should be repeated with data
produced by a GAN in order to compare the two methods of
data synthesis.
Additionally, more subjects should be considered in future
for a wider range of languages and locales for further compari-
son. This study was somewhat ranged with American, Irish and
English subjects but further exploration should be performed
in order to discern whether the effects of augmentation may
change for some accents, as well as the effects that are
observed should the speakers and comparison dataset speakers
use a language other than English.
On a more generalised view of augmentation for learning,
literature review revealed that the majority of work was
performed only in the latter part of the last decade. There
are many fields of machine learning in which augmentation
has either been explored only slightly or even not at all, and
as such cross-field co-operation is needed to further exploit
the possibilities of generative augmentation processes.
To conclude, all of the experiments in this work that
were augmented to any extent by synthetic data then had a
measurably better classification ability of the original dataset
when compared to the learning process on said original data.
This preliminary work enables much future exploration in
terms of both learning models and application of findings
since the issues arising from dataset imbalance are somewhat
mitigated by exposure to new data.
REFERENCES
[1] A. Poddar, M. Sahidullah, and G. Saha, “Speaker verification with
short utterances: a review of challenges, trends and opportunities,IET
Biometrics, vol. 7, no. 2, pp. 91–101, 2017.
[2] E. Mumolo and M. Nolich, “Distant talker identification by nonlinear
programming and beamforming in service robotics,” in IEEE-EURASIP
Workshop on Nonlinear Signal and Image Processing, pp. 8–11, 2003.
[3] P. Rose, Forensic speaker identification. cRc Press, 2002.
[4] N. K. Ratha, A. Senior, and R. M. Bolle, “Automated biometrics,” in
International Conference on Advances in Pattern Recognition, pp. 447–
455, Springer, 2001.
[5] M. R. Hasan, M. Jamil, M. Rahman, et al., “Speaker identification using
mel frequency cepstral coefficients,variations, vol. 1, no. 4, 2004.
[6] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale
speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
[7] S. Yadav and A. Rai, “Learning discriminative features for speaker
identification and verification.,” in Interspeech, pp. 2237–2241, 2018.
[8] H. Zeinali, S. Wang, A. Silnova, P. Matˇ
ejka, and O. Plchot, “But
system description to voxceleb speaker recognition challenge 2019,
arXiv preprint arXiv:1910.12592, 2019.
[9] K. Egan, “Memory, imagination, and learning: Connected by the story,”
Phi Delta Kappan, vol. 70, no. 6, pp. 455–459, 1989.
[10] G. Heath, “Exploring the imagination to establish frameworks for learn-
ing,” Studies in Philosophy and Education, vol. 27, no. 2-3, pp. 115–123,
2008.
[11] P. MacIntyre and T. Gregersen, “Emotions that facilitate language
learning: The positive-broadening power of the imagination,” 2012.
[12] K. Egan, Imagination in teaching and learning: The middle school years.
University of Chicago Press, 2014.
[13] D. Beres, “Perception, imagination, and reality,International Journal
of Psycho-Analysis, vol. 41, pp. 327–334, 1960.
[14] E. F. Loftus, “Creating false memories,Scientific American, vol. 277,
no. 3, pp. 70–75, 1997.
[15] H. L. Roediger III, D. A. Balota, and J. M. Watson, “Spreading activation
and arousal of false memories.,” 2001.
[16] Y. Xu, R. Jia, L. Mou, G. Li, Y. Chen, Y. Lu, and Z. Jin, “Improved
relation classification by deep recurrent neural networks with data
augmentation,” arXiv preprint arXiv:1601.03651, 2016.
[17] S. Kobayashi, “Contextual augmentation: Data augmentation by words
with paradigmatic relations,” arXiv preprint arXiv:1805.06201, 2018.
[18] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan,
“Synthetic data augmentation using gan for improved liver lesion clas-
sification,” in 2018 IEEE 15th international symposium on biomedical
imaging (ISBI 2018), pp. 289–293, IEEE, 2018.
[19] H.-C. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L.
Senjem, J. L. Gunter, K. P. Andriole, and M. Michalski, “Medical image
synthesis for data augmentation and anonymization using generative
adversarial networks,” in International workshop on simulation and
synthesis in medical imaging, pp. 1–11, Springer, 2018.
[20] J. H. Yang, N. K. Kim, and H. K. Kim, “Se-resnet with gan-based data
augmentation applied to acoustic scene classification,” in DCASE 2018
workshop, 2018.
[21] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly,
Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: Towards end-
to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
[22] J.-T. Chien and K.-T. Peng, “Adversarial learning and augmentation for
speaker recognition.,” in Odyssey, pp. 342–348, 2018.
[23] D. Pawade, A. Sakhapara, M. Jain, N. Jain, and K. Gada, “Story
scrambler–automatic text generation using word level rnn-lstm,In-
ternational Journal of Information Technology and Computer Science
(IJITCS), vol. 10, no. 6, pp. 44–53, 2018.
[24] L. Sha, L. Mou, T. Liu, P. Poupart, S. Li, B. Chang, and Z. Sui, “Order-
planning neural text generation from structured data,” in Thirty-Second
AAAI Conference on Artificial Intelligence, 2018.
[25] D. Eck and J. Schmidhuber, “Finding temporal structure in music: Blues
improvisation with lstm recurrent networks,” in Proceedings of the 12th
IEEE workshop on neural networks for signal processing, pp. 747–756,
IEEE, 2002.
[26] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra,
“Draw: A recurrent neural network for image generation,arXiv preprint
arXiv:1502.04623, 2015.
[27] T. Senjyu, A. Yona, N. Urasaki, and T. Funabashi, “Application of recur-
rent neural network to long-term-ahead generating power forecasting for
wind power generator,” in 2006 IEEE PES Power Systems Conference
and Exposition, pp. 1260–1265, IEEE, 2006.
[28] X. Wang, S. Takaki, and J. Yamagishi, “An rnn-based quantized f0
model with multi-tier feedback links for text-to-speech synthesis.,” in
INTERSPEECH, pp. 1059–1063, 2017.
[29] S. Fern´
andez, A. Graves, and J. Schmidhuber, “An application of
recurrent neural networks to discriminative keyword spotting,” in In-
ternational Conference on Artificial Neural Networks, pp. 220–229,
Springer, 2007.
[30] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao,
D. Rybach, A. Kannan, Y. Wu, R. Pang, et al., “Streaming end-to-end
speech recognition for mobile devices,” in ICASSP 2019-2019 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 6381–6385, IEEE, 2019.
[31] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory
recurrent neural network architectures for large scale acoustic modeling,”
2014.
[32] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investi-
gating rnn-based speech enhancement methods for noise-robust text-to-
speech.,” in SSW, pp. 146–152, 2016.
[33] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-
to-speech system based on deep convolutional networks with guided
attention,” in 2018 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 4784–4788, IEEE, 2018.
[34] D. Harwath and J. Glass, “Deep multimodal semantic embeddings for
speech and images,” in 2015 IEEE Workshop on Automatic Speech
Recognition and Understanding (ASRU), pp. 237–244, IEEE, 2015.
[35] E. Rothauser, “Ieee recommended practice for speech quality measure-
ments,” IEEE Trans. on Audio and Electroacoustics, vol. 17, pp. 225–
246, 1969.
[36] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with
recurrent neural networks,” in Proceedings of the 28th international
conference on machine learning (ICML-11), pp. 1017–1024, 2011.
[37] A. Graves, “Generating sequences with recurrent neural networks,arXiv
preprint arXiv:1308.0850, 2013.
[38] Z. Xiong, R. Radhakrishnan, A. Divakaran, and T. S. Huang, “Compar-
ing MFCC and mpeg-7 audio features for feature extraction, maximum
likelihood HMM and entropic prior HMM for sports audio classifi-
cation,” in IEEE Int. Conference on Acoustics, Speech, and Signal
Processing (ICASSP)., vol. 5, 2003.
[39] J. J. Bird, E. Wanner, A. Ek´
art, and D. R. Faria, “Phoneme aware
speech recognition through evolutionary optimisation,” in Proceedings
of the Genetic and Evolutionary Computation Conference Companion,
pp. 362–363, 2019.
[40] L. Muda, M. Begam, and I. Elamvazuthi, “Voice recognition algorithms
using mel frequency cepstral coefficient (MFCC) and dynamic time
warping (DTW) techniques,” arXiv preprint arXiv:1003.4083, 2010.
[41] M. Sahidullah and G. Saha, “Design, analysis and experimental eval-
uation of block based transformation in mfcc computation for speaker
recognition,” Speech Communication, vol. 54, no. 4, pp. 543–565, 2012.
[42] S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the
measurement of the psychological magnitude pitch,” The Journal of the
Acoustical Society of America, vol. 8, no. 3, pp. 185–190, 1937.
[43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,
arXiv preprint arXiv:1412.6980, 2014.
... Next, these features were employed with a deep neural network (DNN) for classifying the bonafide and spoof (Chettri and Sturm, 2018) is computationally complex due to increased features computation cost. Recently, the researchers worked to detect the voice replay (Kamble and Patil, 2020;Malik et al., 2020;Lin et al., 2020;Kamble and Patil, 2021;Phapatanaburi et al., 2020) and speech synthesis attacks (Elsaeidy et al., 2020;Gritsenko et al., 2020;Krishna et al., 2020;Helali et al., 2020;Bird et al., 2020;Raju et al., 2020). However, existing voice anti-spoofing methods have certain limitations, i.e., speech samples lack intentional speaker-oriented modifications, high features computation cost, single spoofing type detectors, etc. ...
... Existing countermeasures (Elsaeidy et al., 2020;Gritsenko et al., 2020;Krishna et al., 2020;Helali et al., 2020;Bird et al., 2020;Raju et al., 2020;Hassan and Javed, 2021;Qadir et al., 2022) have also explored various spectral features for LA attacks detection. Gritsenko et al. (2020) explored the energy difference between diffusions of cloned and bonafide speech signal. ...
Article
Full-text available
Automatic Speaker Verification (ASV) systems are vulnerable to a variety of voice spoofing attacks, e.g., replays, speech synthesis, etc. The imposters/fraudsters often use different voice spoofing attacks to fool the ASV systems to achieve certain objectives, i.e., bypass the security of someone's home or stealing money from a bank account, etc. To counter such fraudulent activities on the ASV systems, we propose a robust voice spoofing detection system capable of effectively detecting multiple types of spoofing attacks. For this purpose, we propose a novel feature descriptor Center Lop-Sided Local Binary Patterns (CLS-LBP) for audio representation. CLS-LBP effectively analyzes the audios bidirectionally to better capture the artifacts of synthetic speech, microphone distortions of replay, and dynamic speech attributes of the bonafide signal. The proposed CLS-LBP features are used to train the long short-term memory (LSTM) network for detection of both the physical-(replay) and logical-access attacks (speech synthesis, voice conversion). We employed the LSTM due to its effectiveness to better process and learn the internal representation of sequential data. More specifically, we obtained an equal error rate (EER) value of 0.06% on logical-acess (LA) while 0.58% on physical-access (PA) attacks. Additionally, the proposed system is also capable of detecting the unseen voice spoofing attacks and also robust enough to classify among the cloning algorithms used to synthesize the speech. Performance evaluation on the ASVspoof 2019 corpus signify the effectiveness of the proposed system in terms of detecting the physical-and logical-access attacks over existing state-of-the-art voice spoofing detection systems. Ó 2022 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
... Next, these features were employed with a deep neural network (DNN) for classifying the bonafide and spoof (Chettri and Sturm, 2018) is computationally complex due to increased features computation cost. Recently, the researchers worked to detect the voice replay (Kamble and Patil, 2020;Malik et al., 2020;Lin et al., 2020;Kamble and Patil, 2021;Phapatanaburi et al., 2020) and speech synthesis attacks (Elsaeidy et al., 2020;Gritsenko et al., 2020;Krishna et al., 2020;Helali et al., 2020;Bird et al., 2020;Raju et al., 2020). However, existing voice anti-spoofing methods have certain limitations, i.e., speech samples lack intentional speaker-oriented modifications, high features computation cost, single spoofing type detectors, etc. ...
... Existing countermeasures (Elsaeidy et al., 2020;Gritsenko et al., 2020;Krishna et al., 2020;Helali et al., 2020;Bird et al., 2020;Raju et al., 2020;Hassan and Javed, 2021;Qadir et al., 2022) have also explored various spectral features for LA attacks detection. Gritsenko et al. (2020) explored the energy difference between diffusions of cloned and bonafide speech signal. ...
Article
Full-text available
Fall detection in elder persons may result in long-lasting injury that can have severe consequences for the rest of their lives. Additionally, prolonged delay in emergency treatment after the fall event escalates the chances of mortality. Hence, fall detection at early stage is critical in terms of providing timely aid with little complications and minimize hospitalization expenses. This work aims to provide an effective and efficient healthcare solution to determine the event of fall detection for elderly persons. We aim to address this fall detection problem for elder persons living lonely and encounter issues in case they fall and are unable to call for assistance. In this paper, we present a fall detection framework by proposing a novel feature space mean absolute deviated-local ternary patterns (MAD-LTP) to examine the environmental sounds and used these features to train the BiLSTM for fall events detection. Our proposed MAD-LTP features successfully address the limitations of existing features i.e., non-robust over dynamic pattern detection, brute force optimization, intolerance over non-uniform noise, etc., for fall detection. Performance of our system is evaluated on three diverse datasets i.e., The daily sounds, A3 Fall 2.0, and our in-house developed MSP-UET fall detection dataset. We compared the performance of the proposed framework against the state-of-the-art methods. We obtained an accuracy of 93.5%, 98.29%, and 98% for the daily sounds, A3 Fall 2.0, and our in-house developed MSP-UET fall detection dataset. Experimental findings indicate the reliability of our method for fall event detection.
... One of the solutions to solve lack of data problem is to make use of data augmentation, which artificially inflates the dataset by using label preserving transformations to add additional invariant examples Taylor and Nitschke (2018). This method is popular in image classification Tremblay et al. (2018); Kortylewski et al. (2018) and speech recognition Park et al. (2019); Bird et al. (2020) and it is oftern used in these areas. Similarly, in Feng et al. (2020), DeepMM: Deep learning based map matching with data augmentation was proposed. ...
... To address the aforementioned problem of developing a deep learning model with a limited amount of real data, the transfer learning approach pre-training the deep learning model with synthetic data, and fine-tuning the pre-trained model with a limited amount of real data has recently gained popularity in a variety of fields Grundkiewicz et al. (2019); Bird et al. (2020); Kortylewski et al. (2018); Tremblay et al. (2018). Transfer learning is a machine learning approach in which a model generated for a task is reused as the basis for a model on a different task Pan and Yang (2009). ...
Preprint
Full-text available
In many trajectory-based applications, it is necessary to map raw GPS trajectories onto road networks in digital maps, which is commonly referred to as a map-matching process. While most previous map-matching methods have focused on using rule-based algorithms to deal with the map-matching problems, in this paper, we consider the map-matching task from the data perspective, proposing a deep learning-based map-matching model. We build a Transformer-based map-matching model with a transfer learning approach. We generate synthetic trajectory data to pre-train the Transformer model and then fine-tune the model with a limited number of ground-truth data to minimize the model development cost and reduce the real-to-virtual gap. Three metrics (Average Hamming Distance, F-score, and BLEU) at two levels (point and segment level) are used to evaluate the model performance. The results indicate that the proposed model outperforms existing models. Furthermore, we use the attention weights of the Transformer to plot the map-matching process and find how the model matches the road segments correctly.
... Recently, to develop a deep learning model with a limited amount of labeled data, the transfer learning approach that pre-training the deep learning model with generated data that has high relevance with target data, and fine-tuning the pre-trained model with a limited amount of labeled data has recently gained popularity in a variety of fields (Grundkiewicz et al., 2019;Bird et al., 2020;Kortylewski et al., 2018;Tremblay et al., 2018). In Tremblay et al. (2018), the authors proposed a deep neural network for object detection using synthetic images. ...
Article
Full-text available
In many spatial trajectory-based applications, it is necessary to map raw trajectory data points onto road networks in digital maps, which is commonly referred to as a map-matching process. While most previous map-matching methods have focused on using rule-based algorithms to deal with the map-matching problems, in this paper, we consider the map-matching task from the data-driven perspective, proposing a deep learning-based map-matching model. We build a Transformer-based map-matching model with a transfer learning approach. We generate trajectory data to pre-train the Transformer model and then fine-tune the model with a limited number of labeled data to minimize the model development cost and reduce the real-to-virtual gaps. Three metrics (Average Hamming Distance, F-score, and BLEU) at two levels (point and segment level) are used to evaluate the model performance. The model is tested with real-world datasets, and the results show that the proposed map-matching model outperforms other existing map-matching models. We also analyze the matching mechanisms of the Transformer in the map-matching process, which helps to interpret the input data’s internal correlation and the external relation between input data and matching results. In addition, the proposed model shows the possibility of using generated trajectories to solve the map-matching problems in the limited labeled data environment.
... Biometrics are systems that recognise an individual based on a given input. For example, recognition of an individual based on their fingerprint [8], speech patterns [9,10], EEG [11] and ECG [12] signals, or the iris of their eye [13] to name a few. Signature verification is a biometric for the recognition and verification of an individual based on the way that they sign their name [14]. ...
Preprint
Full-text available
This study explores how robots and generative approaches can be used to mount successful false-acceptance adversarial attacks on signature verification systems. Initially, a convolutional neural network topology and data augmentation strategy are explored and tuned, producing an 87.12% accurate model for the verification of 2,640 human signatures. Two robots are then tasked with forging 50 signatures, where 25 are used for the verification attack, and the remaining 25 are used for tuning of the model to defend against them. Adversarial attacks on the system show that there exists an information security risk; the Line-us robotic arm can fool the system 24% of the time and the iDraw 2.0 robot 32% of the time. A conditional GAN finds similar success, with around 30% forged signatures misclassified as genuine. Following fine-tune transfer learning of robotic and generative data, adversarial attacks are reduced below the model threshold by both robots and the GAN. It is observed that tuning the model reduces the risk of attack by robots to 8% and 12%, and that conditional generative adversarial attacks can be reduced to 4% when 25 images are presented and 5% when 1000 images are presented.
... Image recognition tasks for Convolutional Neural Network image classification are affected by data scarcity due to their data requirements (Andriyanov and Andriyanov, 2020;Bloice et al., 2017), where many generative models have been recommended to alleviate such issues (Nalepa et al., 2019;Tran et al., 2021). Generative models have also been noted to positively impact biological signal classification (Anicet Zanini and Luna Colombini, 2020;Bird et al., 2021), semantic Image-to-Image Translation (Arantes et al., 2020), speech processing (Bird et al., 2020;Qian et al., 2019), and Human Activity Recognition (Alnujaim et al., 2019;Erol et al., 2019) among many others. In this work, we use a Conditional GAN for data augmentation, which are described in the following section. ...
Article
Contemporary Artificial Intelligence technologies allow for the employment of Computer Vision to discern good crops from bad, providing a step in the pipeline of selecting healthy fruit from undesirable fruit, such as those which are mouldy or damaged. State-of-the-art works in the field report high accuracy results on small datasets (<1000 images), which are not representative of the population regarding real-world usage. The goals of this study are to further enable real-world usage by improving generalisation with data augmentation as well as to reduce overfitting and energy usage through model pruning. In this work, we suggest a machine learning pipeline that combines the ideas of fine-tuning, transfer learning, and generative model-based training data augmentation towards improving fruit quality image classification. A linear network topology search is performed to tune a VGG16 lemon quality classification model using a publicly-available dataset of 2690 images. We find that appending a 4096 neuron fully connected layer to the convolutional layers leads to an image classification accuracy of 83.77%. We then train a Conditional Generative Adversarial Network on the training data for 2000 epochs, and it learns to generate relatively realistic images. Grad-CAM analysis of the model trained on real photographs shows that the synthetic images can exhibit classifiable characteristics such as shape, mould, and gangrene. A higher image classification accuracy of 88.75% is then attained by augmenting the training with synthetic images, arguing that Conditional Generative Adversarial Networks have the ability to produce new data to alleviate issues of data scarcity. Finally, model pruning is performed via polynomial decay, where we find that the Conditional GAN-augmented classification network can retain 81.16% classification accuracy when compressed to 50% of its original size.
... Image recognition tasks for Convolutional Neural Network image classification are affected by data scarcity due to their data requirements [26,27], where many generative models have been recommended to alleviate such issues [28,29]. Generative models have also been noted to positively impact biological signal classification [30,31], semantic Imageto-Image Translation [32], speech processing [33,34], and Human Activity Recognition [35,36] among many others. ...
Preprint
Full-text available
Contemporary Artificial Intelligence technologies allow for the employment of Computer Vision to discern good crops from bad, providing a step in the pipeline of selecting healthy fruit from undesirable fruit, such as those which are mouldy or gangrenous. State-of-the-art works in the field report high accuracy results on small datasets (<1000 images), which are not representative of the population regarding real-world usage. The goals of this study are to further enable real-world usage by improving generalisation with data augmentation as well as to reduce overfitting and energy usage through model pruning. In this work, we suggest a machine learning pipeline that combines the ideas of fine-tuning, transfer learning, and generative model-based training data augmentation towards improving fruit quality image classification. A linear network topology search is performed to tune a VGG16 lemon quality classification model using a publicly-available dataset of 2690 images. We find that appending a 4096 neuron fully connected layer to the convolutional layers leads to an image classification accuracy of 83.77%. We then train a Conditional Generative Adversarial Network on the training data for 2000 epochs, and it learns to generate relatively realistic images. Grad-CAM analysis of the model trained on real photographs shows that the synthetic images can exhibit classifiable characteristics such as shape, mould, and gangrene. A higher image classification accuracy of 88.75% is then attained by augmenting the training with synthetic images, arguing that Conditional Generative Adversarial Networks have the ability to produce new data to alleviate issues of data scarcity. Finally, model pruning is performed via polynomial decay, where we find that the Conditional GAN-augmented classification network can retain 81.16% classification accuracy when compressed to 50% of its original size.
... This work extends upon a previous conference paper that explored the hypothesis "can the synthetic data produced by a generative model aid in speaker recognition?" [9] within which a Character-Level Recurrent Neural Network (RNN), as a generative model, produced synthetic data useful for learning from in order to recognise the speaker. This work extends upon these preliminary experiments by the following: ...
Preprint
Full-text available
In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification. In this work, we take a set of 5 spoken Harvard sentences from 7 subjects and consider their MFCC attributes. Using character level LSTMs (supervised learning) and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by learning from the data provided on a per-subject basis. A neural network is trained to classify the data against a large dataset of Flickr8k speakers and is then compared to a transfer learning network performing the same task but with an initial weight distribution dictated by learning from the synthetic data generated by the two models. The best result for all of the 7 subjects were networks that had been exposed to synthetic data, the model pre-trained with LSTM-produced data achieved the best result 3 times and the GPT-2 equivalent 5 times (since one subject had their best result from both models at a draw). Through these results, we argue that speaker classification can be improved by utilising a small amount of user data but with exposure to synthetically-generated MFCCs which then allow the networks to achieve near maximum classification scores.
Thesis
Full-text available
In modern Human-Robot Interaction, much thought has been given to accessibility regarding robotic locomotion, specifically the enhancement of awareness and lowering of cognitive load. On the other hand, with social Human-Robot Interaction considered, published research is far sparser given that the problem is less explored than pathfinding and locomotion. This thesis studies how one can endow a robot with affective perception for social awareness in verbal and non-verbal communication. This is possible by the creation of a Human-Robot Interaction framework which abstracts machine learning and artificial intelligence technologies which allow for further accessibility to non-technical users compared to the current State-of-the-Art in the field. These studies thus initially focus on individual robotic abilities in the verbal, non-verbal and multimodality domains. Multimodality studies show that late data fusion of image and sound can improve environment recognition, and similarly that late fusion of Leap Motion Controller and image data can improve sign language recognition ability. To alleviate several of the open issues currently faced by researchers in the field, guidelines are reviewed from the relevant literature and met by the design and structure of the framework that this thesis ultimately presents. The framework recognises a user's request for a task through a chatbot-like architecture. Through research in this thesis that recognises human data augmentation (paraphrasing) and subsequent classification via language transformers, the robot's more advanced Natural Language Processing abilities allow for a wider range of recognised inputs. That is, as examples show, phrases that could be expected to be uttered during a natural human-human interaction are easily recognised by the robot. This allows for accessibility to robotics without the need to physically interact with a computer or write any code, with only the ability of natural interaction (an ability which most humans have) required for access to all the modular machine learning and artificial intelligence technologies embedded within the architecture. Following the research on individual abilities, this thesis then unifies all of the technologies into a deliberative interaction framework, wherein abilities are accessed from long-term memory modules and short-term memory information such as the user's tasks, sensor data, retrieved models, and finally output information. In addition, algorithms for model improvement are also explored, such as through transfer learning and synthetic data augmentation and so the framework performs autonomous learning to these extents to constantly improve its learning abilities. It is found that transfer learning between electroencephalographic and electromyographic biological signals improves the classification of one another given their slight physical similarities. Transfer learning also aids in environment recognition, when transferring knowledge from virtual environments to the real world. In another example of non-verbal communication, it is found that learning from a scarce dataset of American Sign Language for recognition can be improved by multi-modality transfer learning from hand features and images taken from a larger British Sign Language dataset. Data augmentation is shown to aid in electroencephalographic signal classification by learning from synthetic signals generated by a GPT-2 transformer model, and, in addition, augmenting training with synthetic data also shows improvements when performing speaker recognition from human speech. Given the importance of platform independence due to the growing range of available consumer robots, four use cases are detailed, and examples of behaviour are given by the Pepper, Nao, and Romeo robots as well as a computer terminal. The use cases involve a user requesting their electroencephalographic brainwave data to be classified by simply asking the robot whether or not they are concentrating. In a subsequent use case, the user asks if a given text is positive or negative, to which the robot correctly recognises the task of natural language processing at hand and then classifies the text, this is output and the physical robots react accordingly by showing emotion. The third use case has a request for sign language recognition, to which the robot recognises and thus switches from listening to watching the user communicate with them. The final use case focuses on a request for environment recognition, which has the robot perform multimodality recognition of its surroundings and note them accordingly. The results presented by this thesis show that several of the open issues in the field are alleviated through the technologies within, structuring of, and examples of interaction with the framework. The results also show the achievement of the three main goals set out by the research questions; the endowment of a robot with affective perception and social awareness for verbal and non-verbal communication, whether we can create a Human-Robot Interaction framework to abstract machine learning and artificial intelligence technologies which allow for the accessibility of non-technical users, and, as previously noted, which current issues in the field can be alleviated by the framework presented and to what extent.
Article
Full-text available
The objective of this work is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual dataset collected from open source media using a fully automated pipeline. Most existing datasets for speaker identification contain samples obtained under quite constrained conditions, and usually require manual annotations, hence are limited in size. We propose a pipeline based on computer vision techniques to create the dataset from open-source media. Our pipeline involves obtaining videos from YouTube; performing active speaker verification using a two-stream synchronization Convolutional Neural Network (CNN), and confirming the identity of the speaker using CNN based facial recognition. We use this pipeline to curate VoxCeleb which contains contains over a million ‘real-world’ utterances from over 6000 speakers. This is several times larger than any publicly available speaker recognition dataset. Second, we develop and compare different CNN architectures with various aggregation methods and training loss functions that can effectively recognise identities from voice under various conditions. The models trained on our dataset surpass the performance of previous works by a significant margin.
Conference Paper
Full-text available
Phoneme awareness provides the path to high resolution speech recognition to overcome the difficulties of classical word recognition. Here we present the results of a preliminary study on Artificial Neural Network (ANN) and Hidden Markov Model (HMM) methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet, with a specific focus on evolutionary optimisation of bio-inspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico. For each recording, the data were pre-processed, using Mel-Frequency Cepstral Coefficients (MFCC) at a sliding window of 200ms per data object, as well as a further MFCC timeseries format for forecast-based models, to produce the dataset. We found that an evolutionary optimised deep neural network achieves 90.77% phoneme classification accuracy as opposed to the best HMM of 150 hidden units achieving 86.23% accuracy. Many of the evolutionary solutions take substantially longer to train than the HMM, however one solution scoring 87.5% (+1.27%) requires fewer resources than the HMM.
Conference Paper
I do not have permission to share the manuscript here. Please see following links: Published version @ IEEE Xplore [https://doi.org/10.1109/ICASSP.2018.8461829] Preprint @ arXiv [https://arxiv.org/abs/1710.08969]