Conference PaperPDF Available

Phoneme Aware Speech Synthesis via Fine Tune Transfer Learning with a Tacotron Spectrogram Prediction Network

Authors:

Abstract and Figures

The implications of realistic human speech imitation are both promising but potentially dangerous. In this work, a pre-trained Tacotron Spectrogram Feature Prediction Network is fine tuned with two 1.6 hour speech datasets for 100,000 learning iterations, producing two individual models. The two Speech datasets are completely identical in content other than their textual representation, one follows the standard English language, whereas the second is an English phonetic representation in order to study the effects on the learning processes. To test imitative abilities post-training, thirty lines of speech are recorded from a human to be imitated. The models then attempt to produce these voice lines themselves, and the acoustic fingerprint of the outputs are compared to the real human speech. On average, English notation achieves 27.36%, whereas Phonetic English notation achieves 35.31% similarity to a human being. This suggests that representation of English through the International Phonetic Alphabet serves as more useful data than written English language. Thus, it is suggested from these experiments that a phonetic-aware paradigm would improve the abilities of speech synthesis similarly to its effects in the field of speech recognition.
Content may be subject to copyright.
Phoneme Aware Speech Synthesis via Fine Tune
Transfer Learning with a Tacotron Spectrogram
Prediction Network
Jordan J. Bird1, Anik´o Ek´art2, and Diego R. Faria3
School of Engineering and Applied Science
Aston University, Birmingham, UK
{birdj11, a.ekart2, d.faria3}@aston.ac.uk
Abstract. The implications of realistic human speech imitation are
both promising but potentially dangerous. In this work, a pre-trained
Tacotron Spectrogram Feature Prediction Network is fine tuned with
two 1.6 hour speech datasets for 100,000 learning iterations, producing
two individual models. The two Speech datasets are completely identical
in content other than their textual representation, one follows the stan-
dard English language, whereas the second is an English phonetic repre-
sentation in order to study the effects on the learning processes. To test
imitative abilities post-training, thirty lines of speech are recorded from a
human to be imitated. The models then attempt to produce these voice
lines themselves, and the acoustic fingerprint of the outputs are com-
pared to the real human speech. On average, English notation achieves
27.36%, whereas Phonetic English notation achieves 35.31% similarity to
a human being. This suggests that representation of English through the
International Phonetic Alphabet serves as more useful data than written
English language. Thus, it is suggested from these experiments that a
phonetic-aware paradigm would improve the abilities of speech synthesis
similarly to its effects in the field of speech recognition.
Keywords: Speech Synthesis, Fine Tune Learning, Phonetic Aware-
ness, Fingerprint Analysis, Tacotron
1 Introduction
Are there imaginable digital
computers which would do well in
the imitation game?
Alan M. Turing
1950
Artificial Intelligence researchers often seek the goal of intelligent human im-
itation. In 1950, Alan Turing proposed the Turing Test, or as he famously called
it, the ‘Imitation Game’ [1]. In the seven decades since this paradigm-altering
query, Computer Scientists continue to seek improved methods of true imitation
2 Bird et al.
of the multi-faceted human nature. In this paper, work explores a new method
towards the imitation of human speech in an audial sense. In this competition
of two differing data representation methods, rather than a human judge, sta-
tistical analyses work to distinguish the differences between real and artificial
voices. The ultimate goal of such thinking is to discover new methods of artificial
speech synthesis in order to fool a judge when discerning between it and a real
human being, and thus, explore new strategies of winning an Imitation Game.
Speech Synthesis is a rapidly growing field of artificial data generation not
only for its usefulness in modern society, but for its forefront in computa-
tional complexity. The algorithm resource usage for training and synthesising
human-like speech is taxing for even the most powerful hardware available to
the consumer today. When hyper-realistic human speech synthesis technologies
are reached, implications when current security standards are considered are
somewhat grave and dangerous. In a social age where careers and lives could be
dramatically changed, or even ruined by public perception, the ability to synthe-
sise realistic speech could carry world-altering consequences. This report serves
not only as an exploration into the effects of phonetic awareness in speech syn-
thesis as an original scientific contribution, but also as a warning and suggestion
of path of thought for the information security community. To give a far less
grave example of the implications of speaker-imitative speech synthesis, there
are many examples of disease or accident that result in a person losing their
voice. For example, Motor Neurone Disease causes this through weakness in the
tongue, lips, and vocal chords[2, 3]. In this study, only 1.6 hours of data are used
for fine tune transfer learning in order to derive realistic speech synthesis, and
of course, would likely show better performance with more data. Should enough
data be collected before a person loses their ability to speak, a Text-To-Speech
(TTS) System developed following the pipeline in this study could potentially
offer a second chance by artificially augmenting a digital voice which closely
sounded to the voice that was unfortunately lost.
This project presents a preliminary state-of-the-art contribution in the field
of speech synthesis for human-machine interaction through imitation. In this
paper, two differing methods are presented for data preprocessing before a deep
neural network in the form of Tacotron learns to synthesise speech from the data.
Firstly, the standard English text format is benchmarked, and then compared
to a method of representation via the International Phonetic Alphabet (IPA)
in order to explore the effects on the overall data. State-of-the-art implementa-
tions of Speech Synthesis often base learning on datasets of raw text via speech
dictation, this study presents preliminary explorations into the new suggested
paradigm of phonetic translations of the original English text, rather than raw
text.
Demonstrations are available at
http://jordanjamesbird.com/tacotron/tacotrontest.html
Phonetic Awareness in Speech Synthesis 3
The remainder of this paper is as follows. Firstly, a background of Phonetic
English, the Tacotron model and statistical acoustic fingerprinting are explored
in Section 2. Secondly, the method outlining this experiment is described in
Section 3; the method is comprised of preprocessing, training, and statistical
validation. Though secondary scientific research is reviewed within the back-
ground section, due to the practices involved, citations are also given within the
Method section where appropriate. Finally, Section 5 presents future research
projects based on the findings of this work, implications of said findings, and a
final conclusion.
2 Background
2.1 English Language and its Phonetics
The English language in its modern form is an amalgamation of largely Old
English and Old French which were mostly spoken in their respective countries
until the Anglo-Saxon invasion and subsequent conquering of England in 1066.
The language of the upper classes of Anglo-Saxon Britain, and thus the most of
the literary works, were a form of Anglo-Norman. The Anglo-Norman language,
with influence from surrounding European Nations, then began to form into the
English language that is spoken today[4]. Research suggests that although the
language has changed greatly in the following 953 years, the phonetic structure
has undergone relatively slight changes[5], thus showing that phonetics are a
better representation of language than the written word. This is known as the
study of phonology[6]. All of the spoken phonemes were formally presented in
1888 through the IPA, or International Phonetic Alphabet.
The biological limitations of the human voice thus limit the number of sounds
that can be produced. These are the methods of Labial, Dental, Alveolar, Post-
alveolar, Palatal, Velar, or Glottal. The methods can then in turn be affected as
Nasal, Plosive, Fricative, or Approximant[7]. Such categories contain all sounds
that make up all human languages since the discovery of the method of spoken
language[8]. In the International and English Phonetic Alphabets, sounds are
denoted by 44 unique symbols representing each of the different sounds in the
British dialect.
Previous work found success in replacing the nominal outputs of a speech
recognition system with phonetic representation[9]. This research project ex-
plores the effects of phonetic representation but for the synthesis of speech rather
than its recognition.
2.2 Tacotron
Tacotron is a Spectrogram Feature Prediction Deep Learning Network[10, 11] in-
spired by the architectures of Recurrent Neural Networks (RNN) in the form of
4 Bird et al.
Long Short-term Memory (LSTM). The Tacotron model uses character embed-
ding to represent a text, as well as the spectrogram of the audio wave. Recurrent
architectures are utilised due to their ability of temporal awareness, since speech
is a temporal activity[12, 13]. That is, where frame ndoes not occur at the start
or end of the wave, it is directly influenced and thus has predictive ability both
to and for frames n1 and n+ 1. Since audio may possibly be lengthy, a nature
in which recurrence tends to fail, ‘attentionis modelled in order to allow for
long sequences in temporal learning and as such its representation[14].
Actual speech synthesis, the translation of spectrogram to audio data, is per-
formed via the Griffin-Lim algorithm[15]. This algorithm performs the task of
signal estimation via its Short-time Fourier Transform (STFT) by iteratively
minimising the Mean Squared Error (MSE) between estimated STFT and mod-
ified STFT. STFT is a Fourier-transform in which the sinusoidal frequency of
content of local sections of a signal are determined[16].
Alternate notations of English through encoding and flagging have been
shown to provide more understanding of various speech artefacts. A recent work
by researchers at Google found that spoken prosody could be produced[17]. The
work’s notation allowed for the patterns of stress and intonation in a language.
The implementation of a Wave Network[18] has shown to produce similarity
gains of 50% when use in addition to the Tacotron architecture.
2.3 Acoustic Fingerprint
Acoustic Fingerprinting is the process of producing a summary of an audio signal
in order to identify or locate similar samples of audio data[19]. To produce simi-
larity, alignment of audio is performed and subsequently the two time-frequency
graphs (spectrograms) have the distance between their statistical properties such
as peaks measured. This process is performed in order to produce a percentage
similarity between a pair of audio clips.
Fingerprint similarity measures allow for the identification of data from a
large library, the algorithm operated by music search engine Shazam allows for
the detection of a song from a database of many millions[20]. Detection in many
cases was succesfully performed with only a few milliseconds of search data.
Though this algorithm is often used for plagiarism detection and search engines
within the entertainment industries, to spoof a similarity would argue that the
artificial data closely matches that of real data. This is performed in this exper-
iment by comparing the fingerprint similarities of audio produced by a human
versus the audio produced by the Griffin-Lim algorithm on the spectrographic
prediction of the Tacotron networks.
Phonetic Awareness in Speech Synthesis 5
3 Method
3.1 Data Collection and Preprocessing
An original dataset of 950 megabytes (1.6 hours, 902 .wav clips) of audio was col-
lected and preprocessed for the following experiments. This subsection describes
the processes involved. Due to security concerns, the dataset is not available and
is thus described in greater detail within this section.
The ‘Harvard Sentences’1were suggested within the IEEE Recommended
Practices for Speech Quality Measurements in 1969[21]. The set of 720 sentences
and their important phonetic structures are derived from the IEEE Recom-
mended Practices and are often used as a measurement of quality for Voice over
Internet Protocol (VoIP) services[22, 23]. All 720 sentences are recorded by the
subject, as well as tense or subject alternatives where available ie. sentence 9
”Four hours of steady work faced us” was also recorded as ”We were faced with
four hours of steady work”.
The aforementioned IEEE best practices were based on ranges of phonetic
pangrams. A sentence or phrase that contains all of the letters of the alphabet
is known as a pangram. For example, ”The quick brown fox jumps over the lazy
dog” contains all of the English alphabetical characters at least once. A phonetic
pangram, on the other hand, is a sentence or phrase which contains examples
of all of the phonetic sounds of the language. For example, the phrase ”that
quick beige fox jumped in the air over each thin dog. Look out, I shout, for he’s
foiled you again, creating chaos” requires the pronunciation of every one of the
45 phonetic sounds that make up British English. 100 British-English phonetic
pangrams are recorded.
The final step of data collection was performed in order to extend the approx-
imately 500MB of data closer to the 1GB mark, random articles are chosen from
Wikipedia, and random sentences from said articles are recorded. Ultimately,
all of the data was finally transcribed into either raw English text or phonetic
structure (where lingual sounds are replaced by IPA symbols), in order to pro-
vide a text input for every audio data. From this the two datasets are produced,
in order to compare the two pre-processing approaches. All of the training takes
place via the 2816 CUDA cores of an Nvidia GTX 980Ti Graphics Processing
Unit, with the exception of the Griffin-Lim algorithm which is executed on an
AMD FX8320 8-Core Central Processing Unit at a clock speed of 3500MHz.
3.2 Fine Tune Training and Statistical Validation
The initial network is trained on the LJ Speech Dataset2for 700,000 iterations.
The dataset contains 13,100 clips of a speaker reading from non-fiction books
1https://www.cs.columbia.edu/ hgs/audio/harvard.html
2https://keithito.com/LJ-Speech-Dataset/
6 Bird et al.
along with a transcription. The longest clip is 10.1 seconds, the shortest is 1.1
seconds, and the average duration of clips are 6.5 seconds. The speech is made
up of 13,821 unique words at which there are an average of 17 per clip. Following
this, the two datasets of English language and English phonetics are introduced
and fine tune training occurs for two different models for 100,000 iterations each.
Thus, in total, 800,000 learning iterations have been performed where the final
12.5% of the learning has been with the two differing representations of English.
Table 1. Ten Strings for Benchmark Testing which are Comprised of all English Sounds
ID String
1”Hello, how are you?”
2”John bought three apples with his own money.”
3”Working at a University is an enlightening experience.”
4”My favourite colour is beige, what’s yours?”
5”The population of Birmingham is over a million people.”
6”Dinosaurs first appeared during the Triassic period.”
7”The sea shore is a relaxing place to spend One’s time.”
8”The waters of the Loch impressed the French Queen”
9”Arthur noticed the bright blue hue of the sky.”
10 ”Thank you for listening!”
For comparison of the two models, statistical fingerprint similarity is per-
formed. This is due to model outputs being of an opinionated quality, ie. how
realistic the speech sounds from a human point of view. This is not presented in
the benchmarking of models, and thus comparing the loss of the two model train-
ing processes would yield no opiniative measurement. To perform this, natural
human speech is recorded by the subject that the model is trained to imitate.
The two models both also produce these phrases, and the fingerprint similar-
ity of the models and real human are compared. A higher similarity suggests a
better ability of imitation, and thus better quality speech produced by the model.
A set of 10 strings are presented in Table 1. Overall, this data includes all
sounds within the English language at least once. This validation data is recorded
by the human subject to be imitated, as well as the speech synthesis models.
Each of the phrases are recorded three times by the subject, and comparisons
are given between the model and each of the three tests, comprising thirty tests
per model.
Figures 1-3 show examples of spectrographic representations when both a
human being and a Tacotron network speak the sentence ”Working at a Univer-
sity is an Enlightening Experience”. Though frequencies are slightly mismatched
in that the network seems to be predicting higher frequencies than those in the
human speak, the peaks within the data discerning individually-spoken words
Phonetic Awareness in Speech Synthesis 7
Fig. 1. Spectrogram of ”Working at a University is an Enlightening Experience” when
spoken by a human being.
Fig. 2. Spectrogram of ”Working at a University is an Enlightening Experience” when
predicted by the English Written Text Tacotron network.
Fig. 3. Spectrogram of ”Working at a University is an Enlightening Experience” when
predicted by the Phonetically Aware Tacotron network.
8 Bird et al.
are closely matched by the Tacotron prediction. Though the two predictions
look similar, the fingerprint similarity of the phonetically aware prediction is far
closer to a human than otherwise, this is due to the fingerprint consideration of
most important features rather than simply the distance between two matrices
of values. Additionally, the timings of values are not considered, the algorithm
produces a best alignment of the pair of waves before analysing their similar-
ity. For example, the largest peak is the first syllable of the word ”University”,
and thus those two peaks would be compared, rather than the differing data if
alignment had not been performed. Therefore, silence before and after a spoken
phrase is not considered, rather, only the phrase from its initial inception to final
termination.
4 Preliminary Results
Within this section, the preliminary results are presented. Firstly, the acoustic
fingerprint similarities of the models and human voices are compared. Finally,
the average results for the two models are compared with one another.
Table 2. Thirty Similarity Tests Performed on the Raw English Speech Synthesis
Model with Averages of Sentences and Overall Average Scores. Failures are denoted
by F. Overall average is given as the average of experiments 1, 2 and 3.
Phrase Experiment
1 2 3 Avg.
1F F F 0 (3F)
222.22 22.22 66.67 37.02
356.6 56.7 75.4 62.9
40 51.28 0 17.09 (2F)
56244
620 41.67 62.5 41.39
755.56 18.52 22.43 32.17
824.39 24.39 48.78 32.55
922.72 22.72 22.72 22.72
10 F 71.4 F 23.8 (2F)
Avg. 20.74 31.09 30.25 27.36
Table 2 shows the results for the tests on the raw English speech dataset.
Of the thirty experiments, 23% (7/30) were failures and had no semblance of
similarity to the natural speech. One test, phrase 1, was a total failure with all
three experiments scoring zero. Overall, the generated data resembled the hu-
man data by an average of 21.07%.
Table 3 shows the results for the tests on the phonetic English speech dataset.
Of the thirty experiments, 13% (4/30) were failures and had no semblance of
similarity to the natural speech, this was slightly was lower than the raw English
Phonetic Awareness in Speech Synthesis 9
Table 3. Thirty Similarity Tests Performed on the Phonetic English Speech Synthesis
Model with Averages of Sentences and Overall Average Scores. Failures are denoted
by F. Overall average is given as the average of experiments 1, 2 and 3.
Phrase Experiment
1 2 3 Avg.
158.8 0 0 19.6 (2F)
285.7 28.57 57.14 57.14
393.7 78.12 46.88 72.9
451.28 25.64 25.64 34.19
538.46 38.46 39 38.64
635.71 35.71 17.8 29.74
734.5 34 17.2 28.57
843.4 21.7 43.48 36.19
920.4 20 26.4 22.27
10 0 41.6 0 13.89 (2F)
Avg. 46.19 32.38 27.35 35.31
dataset. This said, there did not occur an experiment with complete catastrophic
failure in which all three tests scored zero. On an average of the three experi-
ments, the human data and the generated data were 35.31% similar. Figure 4
shows the average differences between the acoustic fingerprints of human and
artificial data in each of the ten sets of three experiments.
In comparing head-to-head results, the phonetics dataset produced experi-
ments that on average outperformed the written language dataset in six out of
ten cases. This said, experiment nine was extremely close with the two mod-
els achieving 22.27% and 22.72% with a negligible difference of only 0.45%. In
the cases where the language set outperformed the phonetics set, the difference
between the two were much smaller than the vice versa outcomes. In terms of
preliminary results, the phonetic representation of language has gained the best
results in human speech imitation when comparing the acoustic fingerprint met-
rics. Often, inconsistencies occur in the similarity of human and robotics speech
(in both approaches); this is likely due to either a lack of enough data within the
training and validation sets, or an issue of there not being enough training time
in order to form a stable model that produces consistent output - or, of course,
a combination of the two. Further exploration via future experimentation could
pinpoint the cause of inconsistency.
5 Future Work and Concluding Implications
Although results suggested the phoneme aware approaches were preliminarily
more promising than raw English notation, the phonetic awareness approach was
faced with a disadvantage from the fine-tuning process. The pre-existing model
was trained with raw English language on a US-dialect, and fine tuned for raw
10 Bird et al.
1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
Average Experiment Accuracy (%)
English Language English Phonetics
Fig. 4. Comparison of the two Approaches for the Average of Ten Sets of Three Ex-
periments.
English language in GB-dialect as well as English phonetics in GB-dialect. Thus,
the phonetic model would require more training in order to overcome the disad-
vantaged starting point it faced. For a more succinct comparison, future models
should be trained from an initial random distribution of network weights for their
respective datasets. In addition to this, it must be pointed out that input data
from the English written text dataset had 26 unique alphabetic values whereas
this is extended in the second dataset since there are 44 unique phonemes that
make up the spoken English language in a British dialect. Statistical validation
through the comparison of acoustic fingerprints are considered, with similari-
ties to real speech compared on the same input sentence or phrase. Though an
acoustic fingerprint does give a concrete comparison between pairs of output
data, human opinion is still not properly reflected. For this, as the Tacotron
paper did, Mean Opinion Score (MOS) should also be performed. MOS is given
as M OS =PN
n=1
Rn
N, where R are rating scores given by a subject group of
N participants. Thus, this is simply the average rating given by the audience.
MOS requires a large audience to give their opinions, denoted by a nominal
score, in order to rate the networks in terms of human hearing; human hearing
and audial understanding is an ability a Turing Machine does not yet have. Such
MOS would allow for a second metric, real opinion, to also provide a score. A
multi-objective problem is then presented through the maximisation of acous-
tic fingerprint similarities as well as the opinion of the audience. Additionally,
Phonetic Awareness in Speech Synthesis 11
other spectrogram prediction paradigms such as Tacotron2 and DCTTS should
be studied in terms of the effects of English vs. Phonetic English.
As mentioned in the previous section, further work should also be performed
into pinpointing the cause of inconsistent output from the models. Explorations
into the effects of there being a larger dataset as well as more training time
for the model could discover the cause of inconsistency and help to produce a
stronger training paradigm for speech synthesis.
To conclude, 100,000 extra iterations of training on top of a publicly avail-
able dataset, then fine-tuned on a human dataset of only 1.6 hours worth of
speech translated to phonetic structure, has produced a network with the ability
to reproduce new speech at 35.31% accuracy. It is not out of the question what-
soever for post-processing to enable the data to be completely realistic, which
could then be ‘leaked’ to the media, the law, or otherwise. Such findings present
a dangerous situation, in which a person’s speech could be imitated to create
the illusion of evidence that they have said such things that in reality they have
not. Though this paper serves primarily as a method of maximising artificial
imitative abilities, it should also serve as a grave warning in order to minimise
the potential implications on an individual’s life. Future information security
research should, and arguably must, discover competing methods of detection of
spoof speech in order to prevent such cases. On the other hand, realistic speech
synthesis could be used in realtime for more positive means, such as an aug-
mented voice for those suffering illness that could result in the loss of the ability
of speech.
References
1. A. M. Turing, Computing Machinery and Intelligence. 1950.
2. L. Locock, S. Ziebland, and C. Dumelow, “Biographical disruption, abruption and
repair in the context of motor neurone disease,” Sociology of health & illness,
vol. 31, no. 7, pp. 1043–1058, 2009.
3. J. Yamagishi, C. Veaux, S. King, and S. Renals, “Speech synthesis technologies for
individuals with vocal disabilities: Voice banking and reconstruction,” Acoustical
Science and Technology, vol. 33, no. 1, pp. 1–5, 2012.
4. A. C. Baugh and T. Cable, A history of the English language. Routledge, 1993.
5. H. R. Loyn, Anglo Saxon England and the Norman Conquest. Routledge, 2014.
6. V. Fromkin, R. Rodman, and N. Hyams, An Introduction to Language. Cengage,
2006.
7. I. R. Titze and D. W. Martin, Principles of voice production. 1994.
8. W. Menzel, E. Atwell, P. Bonaventura, D. Herron, P. Howarth, R. Morton, and
C. Souter, “The isle corpus of non-native spoken english,” in Proceedings of LREC
2000: Language Resources and Evaluation Conference, vol. 2, pp. 957–964, Euro-
pean Language Resources Association, 2000.
9. J. J. Bird, , E. Wanner, A. Ekart, and D. R. Faria, “Phoneme aware speech recog-
nition through evolutionary optimisation,” in The Genetic and Evolutionary Com-
putation Conference, GECCO, 2019.
12 Bird et al.
10. Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang,
Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: Towards end-to-end speech synthe-
sis,” arXiv preprint arXiv:1703.10135, 2017.
11. H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech
system based on deep convolutional networks with guided attention,” in 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 4784–4788, IEEE, 2018.
12. H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural
network architectures for large scale acoustic modeling,” in Fifteenth annual con-
ference of the international speech communication association, 2014.
13. X. Li and X. Wu, “Constructing long short-term memory based deep recurrent neu-
ral networks for large vocabulary speech recognition,” in 2015 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4520–4524,
IEEE, 2015.
14. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly
learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
15. D. Griffin and J. Lim, “Signal estimation from modified short-time fourier trans-
form,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32,
no. 2, pp. 236–243, 1984.
16. E. Sejdi, I. Djurovi, and J. Jiang, “Time–frequency feature representation using
energy concentration: An overview of recent advances,” Digital Signal Processing,
vol. 19, no. 1, pp. 153–183, 2009.
17. R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J.
Weiss, R. Clark, and R. A. Saurous, “Towards end-to-end prosody transfer for
expressive speech synthesis with tacotron,” international conference on machine
learning, pp. 4693–4702, 2018.
18. M. Zhang, X. Wang, F. Fang, H. Li, and J. Yamagishi, “Joint training framework
for text-to-speech and voice conversion using multi-source tacotron and wavenet,”
arXiv preprint arXiv:1903.12389, 2019.
19. J. Bormans, J. Gelissen, and A. Perkis, “Mpeg-21: The 21st century multimedia
framework,” IEEE Signal Processing Magazine, vol. 20, no. 2, pp. 53–62, 2003.
20. A. Wang et al., “An industrial strength audio search algorithm.,” in Ismir,
vol. 2003, pp. 7–13, Washington, DC, 2003.
21. IEEE, IEEE Transactions on Audio and Electroacoustics, vol. 21. IEEE, 1973.
22. K. Yochanang, T. Daengsi, T. Triyason, and P. Wuttidittachotti, “A compara-
tive study of voip quality measurement from g. 711 and g. 729 using pesq and
thai speech,” in International Conference on Advances in Information Technology,
pp. 242–255, Springer, 2013.
23. N. Yankelovich, J. Kaplan, J. Provino, M. Wessler, and J. M. DiMicco, “Improving
audio conferencing: are two ears better than one?,” in Proceedings of the 2006
20th anniversary conference on Computer supported cooperative work, pp. 333–
342, ACM, 2006.
... Information on the subjects can be seen in Table I. Subjects speak five random Harvard Sentences sentences from the IEEE recommended practice for speech quality measurements [56], and so contain most of the spoken phonetic sounds in the English language [57]. Importantly, this is a user-friendly process, because it requires only a few short seconds of audio data. ...
Preprint
Full-text available
In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification. In this work, we take a set of 5 spoken Harvard sentences from 7 subjects and consider their MFCC attributes. Using character level LSTMs (supervised learning) and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by learning from the data provided on a per-subject basis. A neural network is trained to classify the data against a large dataset of Flickr8k speakers and is then compared to a transfer learning network performing the same task but with an initial weight distribution dictated by learning from the synthetic data generated by the two models. The best result for all of the 7 subjects were networks that had been exposed to synthetic data, the model pre-trained with LSTM-produced data achieved the best result 3 times and the GPT-2 equivalent 5 times (since one subject had their best result from both models at a draw). Through these results, we argue that speaker classification can be improved by utilising a small amount of user data but with exposure to synthetically-generated MFCCs which then allow the networks to achieve near maximum classification scores.
Thesis
Full-text available
In modern Human-Robot Interaction, much thought has been given to accessibility regarding robotic locomotion, specifically the enhancement of awareness and lowering of cognitive load. On the other hand, with social Human-Robot Interaction considered, published research is far sparser given that the problem is less explored than pathfinding and locomotion. This thesis studies how one can endow a robot with affective perception for social awareness in verbal and non-verbal communication. This is possible by the creation of a Human-Robot Interaction framework which abstracts machine learning and artificial intelligence technologies which allow for further accessibility to non-technical users compared to the current State-of-the-Art in the field. These studies thus initially focus on individual robotic abilities in the verbal, non-verbal and multimodality domains. Multimodality studies show that late data fusion of image and sound can improve environment recognition, and similarly that late fusion of Leap Motion Controller and image data can improve sign language recognition ability. To alleviate several of the open issues currently faced by researchers in the field, guidelines are reviewed from the relevant literature and met by the design and structure of the framework that this thesis ultimately presents. The framework recognises a user's request for a task through a chatbot-like architecture. Through research in this thesis that recognises human data augmentation (paraphrasing) and subsequent classification via language transformers, the robot's more advanced Natural Language Processing abilities allow for a wider range of recognised inputs. That is, as examples show, phrases that could be expected to be uttered during a natural human-human interaction are easily recognised by the robot. This allows for accessibility to robotics without the need to physically interact with a computer or write any code, with only the ability of natural interaction (an ability which most humans have) required for access to all the modular machine learning and artificial intelligence technologies embedded within the architecture. Following the research on individual abilities, this thesis then unifies all of the technologies into a deliberative interaction framework, wherein abilities are accessed from long-term memory modules and short-term memory information such as the user's tasks, sensor data, retrieved models, and finally output information. In addition, algorithms for model improvement are also explored, such as through transfer learning and synthetic data augmentation and so the framework performs autonomous learning to these extents to constantly improve its learning abilities. It is found that transfer learning between electroencephalographic and electromyographic biological signals improves the classification of one another given their slight physical similarities. Transfer learning also aids in environment recognition, when transferring knowledge from virtual environments to the real world. In another example of non-verbal communication, it is found that learning from a scarce dataset of American Sign Language for recognition can be improved by multi-modality transfer learning from hand features and images taken from a larger British Sign Language dataset. Data augmentation is shown to aid in electroencephalographic signal classification by learning from synthetic signals generated by a GPT-2 transformer model, and, in addition, augmenting training with synthetic data also shows improvements when performing speaker recognition from human speech. Given the importance of platform independence due to the growing range of available consumer robots, four use cases are detailed, and examples of behaviour are given by the Pepper, Nao, and Romeo robots as well as a computer terminal. The use cases involve a user requesting their electroencephalographic brainwave data to be classified by simply asking the robot whether or not they are concentrating. In a subsequent use case, the user asks if a given text is positive or negative, to which the robot correctly recognises the task of natural language processing at hand and then classifies the text, this is output and the physical robots react accordingly by showing emotion. The third use case has a request for sign language recognition, to which the robot recognises and thus switches from listening to watching the user communicate with them. The final use case focuses on a request for environment recognition, which has the robot perform multimodality recognition of its surroundings and note them accordingly. The results presented by this thesis show that several of the open issues in the field are alleviated through the technologies within, structuring of, and examples of interaction with the framework. The results also show the achievement of the three main goals set out by the research questions; the endowment of a robot with affective perception and social awareness for verbal and non-verbal communication, whether we can create a Human-Robot Interaction framework to abstract machine learning and artificial intelligence technologies which allow for the accessibility of non-technical users, and, as previously noted, which current issues in the field can be alleviated by the framework presented and to what extent.
Conference Paper
Full-text available
Phoneme awareness provides the path to high resolution speech recognition to overcome the difficulties of classical word recognition. Here we present the results of a preliminary study on Artificial Neural Network (ANN) and Hidden Markov Model (HMM) methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet, with a specific focus on evolutionary optimisation of bio-inspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico. For each recording, the data were pre-processed, using Mel-Frequency Cepstral Coefficients (MFCC) at a sliding window of 200ms per data object, as well as a further MFCC timeseries format for forecast-based models, to produce the dataset. We found that an evolutionary optimised deep neural network achieves 90.77% phoneme classification accuracy as opposed to the best HMM of 150 hidden units achieving 86.23% accuracy. Many of the evolutionary solutions take substantially longer to train than the HMM, however one solution scoring 87.5% (+1.27%) requires fewer resources than the HMM.
Article
Full-text available
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
Conference Paper
I do not have permission to share the manuscript here. Please see following links: Published version @ IEEE Xplore [https://doi.org/10.1109/ICASSP.2018.8461829] Preprint @ arXiv [https://arxiv.org/abs/1710.08969]
Article
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.
Article
Long Short-Term Memory (LSTM) is a specific recurrent neural network (RNN) architecture that was designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we explore LSTM RNN architectures for large scale acoustic modeling in speech recognition. We recently showed that LSTM RNNs are more effective than DNNs and conventional RNNs for acoustic modeling, considering moderately-sized models trained on a single machine. Here, we introduce the first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines. We show that a two-layer deep LSTM RNN where each LSTM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance. This architecture makes more effective use of model parameters than the others considered, converges quickly, and outperforms a deep feed forward neural network having an order of magnitude more parameters.
Article
An overview of the clinical applications of speech synthesis technologies was presented and a few selected researches were explained. The University of Edinburgh's new project, called 'Voice Banking and reconstruction' for patients suffering from degenerative diseases, such as motor neurone disease (MND) and Parkinson's disease was introduced and it was shown how speech synthesis technologies improved the quality of life for the patients. It was demonstrated that alternative augmentative communication (AAC) devices were used when individuals lost the ability to produce their own speech for neurological or other reasons. Standard text-to-speech (TTS) synthesizers such as the Klatt formant synthesizer or unit-selection synthesizers were embedded as speech output functions in such devices.
Conference Paper
This paper presents the study of VoIP quality measurements from two popular codecs, G.711 and G.729, using the methods of Perceptual Evaluation of Speech Quality (PESQ) and Thai speech. In this study, from four lists of Thai speech, it has been found that G.711 provides better voice quality than G.729 in every condition of packet loss. Also, it has been found that Objective Listening Quality - Mean Opinion Score (MOS-LQO) of male speech is slightly higher than MOS-LQO of female speech, whereas MOS of child speech is the lowest. Then, MOS-LQO values from four Thai speech lists have been compared. Next, MOS-LQO from PESQ of male and female speech at the best condition have been compared with the Subjective Listening Quality Mean Opinion Score (MOS-LQS) from ACR listening tests in another laboratory. Lastly, referring to packet loss effects, objective MOS from PESQ have been compared with subjective MOS from conversation tests. It has been found that there is no significant difference among MOS-LQO from the four Thai speech lists, but it has been found that there is a significant difference between subjective MOS and objective MOS from each codec in each condition. Therefore, one can say that this is evidence that PESQ requires intensive study with Thai speech to modify PESQ for VoIP quality measurement in Thai environments confidently.