Content uploaded by Diego R. Faria
Author content
All content in this area was uploaded by Diego R. Faria on Jul 23, 2019
Content may be subject to copyright.
Phoneme Aware Speech Synthesis via Fine Tune
Transfer Learning with a Tacotron Spectrogram
Prediction Network
Jordan J. Bird1, Anik´o Ek´art2, and Diego R. Faria3
School of Engineering and Applied Science
Aston University, Birmingham, UK
{birdj11, a.ekart2, d.faria3}@aston.ac.uk
Abstract. The implications of realistic human speech imitation are
both promising but potentially dangerous. In this work, a pre-trained
Tacotron Spectrogram Feature Prediction Network is fine tuned with
two 1.6 hour speech datasets for 100,000 learning iterations, producing
two individual models. The two Speech datasets are completely identical
in content other than their textual representation, one follows the stan-
dard English language, whereas the second is an English phonetic repre-
sentation in order to study the effects on the learning processes. To test
imitative abilities post-training, thirty lines of speech are recorded from a
human to be imitated. The models then attempt to produce these voice
lines themselves, and the acoustic fingerprint of the outputs are com-
pared to the real human speech. On average, English notation achieves
27.36%, whereas Phonetic English notation achieves 35.31% similarity to
a human being. This suggests that representation of English through the
International Phonetic Alphabet serves as more useful data than written
English language. Thus, it is suggested from these experiments that a
phonetic-aware paradigm would improve the abilities of speech synthesis
similarly to its effects in the field of speech recognition.
Keywords: Speech Synthesis, Fine Tune Learning, Phonetic Aware-
ness, Fingerprint Analysis, Tacotron
1 Introduction
Are there imaginable digital
computers which would do well in
the imitation game?
Alan M. Turing
1950
Artificial Intelligence researchers often seek the goal of intelligent human im-
itation. In 1950, Alan Turing proposed the Turing Test, or as he famously called
it, the ‘Imitation Game’ [1]. In the seven decades since this paradigm-altering
query, Computer Scientists continue to seek improved methods of true imitation
2 Bird et al.
of the multi-faceted human nature. In this paper, work explores a new method
towards the imitation of human speech in an audial sense. In this competition
of two differing data representation methods, rather than a human judge, sta-
tistical analyses work to distinguish the differences between real and artificial
voices. The ultimate goal of such thinking is to discover new methods of artificial
speech synthesis in order to fool a judge when discerning between it and a real
human being, and thus, explore new strategies of winning an Imitation Game.
Speech Synthesis is a rapidly growing field of artificial data generation not
only for its usefulness in modern society, but for its forefront in computa-
tional complexity. The algorithm resource usage for training and synthesising
human-like speech is taxing for even the most powerful hardware available to
the consumer today. When hyper-realistic human speech synthesis technologies
are reached, implications when current security standards are considered are
somewhat grave and dangerous. In a social age where careers and lives could be
dramatically changed, or even ruined by public perception, the ability to synthe-
sise realistic speech could carry world-altering consequences. This report serves
not only as an exploration into the effects of phonetic awareness in speech syn-
thesis as an original scientific contribution, but also as a warning and suggestion
of path of thought for the information security community. To give a far less
grave example of the implications of speaker-imitative speech synthesis, there
are many examples of disease or accident that result in a person losing their
voice. For example, Motor Neurone Disease causes this through weakness in the
tongue, lips, and vocal chords[2, 3]. In this study, only 1.6 hours of data are used
for fine tune transfer learning in order to derive realistic speech synthesis, and
of course, would likely show better performance with more data. Should enough
data be collected before a person loses their ability to speak, a Text-To-Speech
(TTS) System developed following the pipeline in this study could potentially
offer a second chance by artificially augmenting a digital voice which closely
sounded to the voice that was unfortunately lost.
This project presents a preliminary state-of-the-art contribution in the field
of speech synthesis for human-machine interaction through imitation. In this
paper, two differing methods are presented for data preprocessing before a deep
neural network in the form of Tacotron learns to synthesise speech from the data.
Firstly, the standard English text format is benchmarked, and then compared
to a method of representation via the International Phonetic Alphabet (IPA)
in order to explore the effects on the overall data. State-of-the-art implementa-
tions of Speech Synthesis often base learning on datasets of raw text via speech
dictation, this study presents preliminary explorations into the new suggested
paradigm of phonetic translations of the original English text, rather than raw
text.
Demonstrations are available at
http://jordanjamesbird.com/tacotron/tacotrontest.html
Phonetic Awareness in Speech Synthesis 3
The remainder of this paper is as follows. Firstly, a background of Phonetic
English, the Tacotron model and statistical acoustic fingerprinting are explored
in Section 2. Secondly, the method outlining this experiment is described in
Section 3; the method is comprised of preprocessing, training, and statistical
validation. Though secondary scientific research is reviewed within the back-
ground section, due to the practices involved, citations are also given within the
Method section where appropriate. Finally, Section 5 presents future research
projects based on the findings of this work, implications of said findings, and a
final conclusion.
2 Background
2.1 English Language and its Phonetics
The English language in its modern form is an amalgamation of largely Old
English and Old French which were mostly spoken in their respective countries
until the Anglo-Saxon invasion and subsequent conquering of England in 1066.
The language of the upper classes of Anglo-Saxon Britain, and thus the most of
the literary works, were a form of Anglo-Norman. The Anglo-Norman language,
with influence from surrounding European Nations, then began to form into the
English language that is spoken today[4]. Research suggests that although the
language has changed greatly in the following 953 years, the phonetic structure
has undergone relatively slight changes[5], thus showing that phonetics are a
better representation of language than the written word. This is known as the
study of phonology[6]. All of the spoken phonemes were formally presented in
1888 through the IPA, or International Phonetic Alphabet.
The biological limitations of the human voice thus limit the number of sounds
that can be produced. These are the methods of Labial, Dental, Alveolar, Post-
alveolar, Palatal, Velar, or Glottal. The methods can then in turn be affected as
Nasal, Plosive, Fricative, or Approximant[7]. Such categories contain all sounds
that make up all human languages since the discovery of the method of spoken
language[8]. In the International and English Phonetic Alphabets, sounds are
denoted by 44 unique symbols representing each of the different sounds in the
British dialect.
Previous work found success in replacing the nominal outputs of a speech
recognition system with phonetic representation[9]. This research project ex-
plores the effects of phonetic representation but for the synthesis of speech rather
than its recognition.
2.2 Tacotron
Tacotron is a Spectrogram Feature Prediction Deep Learning Network[10, 11] in-
spired by the architectures of Recurrent Neural Networks (RNN) in the form of
4 Bird et al.
Long Short-term Memory (LSTM). The Tacotron model uses character embed-
ding to represent a text, as well as the spectrogram of the audio wave. Recurrent
architectures are utilised due to their ability of temporal awareness, since speech
is a temporal activity[12, 13]. That is, where frame ndoes not occur at the start
or end of the wave, it is directly influenced and thus has predictive ability both
to and for frames n−1 and n+ 1. Since audio may possibly be lengthy, a nature
in which recurrence tends to fail, ‘attention’ is modelled in order to allow for
long sequences in temporal learning and as such its representation[14].
Actual speech synthesis, the translation of spectrogram to audio data, is per-
formed via the Griffin-Lim algorithm[15]. This algorithm performs the task of
signal estimation via its Short-time Fourier Transform (STFT) by iteratively
minimising the Mean Squared Error (MSE) between estimated STFT and mod-
ified STFT. STFT is a Fourier-transform in which the sinusoidal frequency of
content of local sections of a signal are determined[16].
Alternate notations of English through encoding and flagging have been
shown to provide more understanding of various speech artefacts. A recent work
by researchers at Google found that spoken prosody could be produced[17]. The
work’s notation allowed for the patterns of stress and intonation in a language.
The implementation of a Wave Network[18] has shown to produce similarity
gains of 50% when use in addition to the Tacotron architecture.
2.3 Acoustic Fingerprint
Acoustic Fingerprinting is the process of producing a summary of an audio signal
in order to identify or locate similar samples of audio data[19]. To produce simi-
larity, alignment of audio is performed and subsequently the two time-frequency
graphs (spectrograms) have the distance between their statistical properties such
as peaks measured. This process is performed in order to produce a percentage
similarity between a pair of audio clips.
Fingerprint similarity measures allow for the identification of data from a
large library, the algorithm operated by music search engine Shazam allows for
the detection of a song from a database of many millions[20]. Detection in many
cases was succesfully performed with only a few milliseconds of search data.
Though this algorithm is often used for plagiarism detection and search engines
within the entertainment industries, to spoof a similarity would argue that the
artificial data closely matches that of real data. This is performed in this exper-
iment by comparing the fingerprint similarities of audio produced by a human
versus the audio produced by the Griffin-Lim algorithm on the spectrographic
prediction of the Tacotron networks.
Phonetic Awareness in Speech Synthesis 5
3 Method
3.1 Data Collection and Preprocessing
An original dataset of 950 megabytes (1.6 hours, 902 .wav clips) of audio was col-
lected and preprocessed for the following experiments. This subsection describes
the processes involved. Due to security concerns, the dataset is not available and
is thus described in greater detail within this section.
The ‘Harvard Sentences’1were suggested within the IEEE Recommended
Practices for Speech Quality Measurements in 1969[21]. The set of 720 sentences
and their important phonetic structures are derived from the IEEE Recom-
mended Practices and are often used as a measurement of quality for Voice over
Internet Protocol (VoIP) services[22, 23]. All 720 sentences are recorded by the
subject, as well as tense or subject alternatives where available ie. sentence 9
”Four hours of steady work faced us” was also recorded as ”We were faced with
four hours of steady work”.
The aforementioned IEEE best practices were based on ranges of phonetic
pangrams. A sentence or phrase that contains all of the letters of the alphabet
is known as a pangram. For example, ”The quick brown fox jumps over the lazy
dog” contains all of the English alphabetical characters at least once. A phonetic
pangram, on the other hand, is a sentence or phrase which contains examples
of all of the phonetic sounds of the language. For example, the phrase ”that
quick beige fox jumped in the air over each thin dog. Look out, I shout, for he’s
foiled you again, creating chaos” requires the pronunciation of every one of the
45 phonetic sounds that make up British English. 100 British-English phonetic
pangrams are recorded.
The final step of data collection was performed in order to extend the approx-
imately 500MB of data closer to the 1GB mark, random articles are chosen from
Wikipedia, and random sentences from said articles are recorded. Ultimately,
all of the data was finally transcribed into either raw English text or phonetic
structure (where lingual sounds are replaced by IPA symbols), in order to pro-
vide a text input for every audio data. From this the two datasets are produced,
in order to compare the two pre-processing approaches. All of the training takes
place via the 2816 CUDA cores of an Nvidia GTX 980Ti Graphics Processing
Unit, with the exception of the Griffin-Lim algorithm which is executed on an
AMD FX8320 8-Core Central Processing Unit at a clock speed of 3500MHz.
3.2 Fine Tune Training and Statistical Validation
The initial network is trained on the LJ Speech Dataset2for 700,000 iterations.
The dataset contains 13,100 clips of a speaker reading from non-fiction books
1https://www.cs.columbia.edu/ hgs/audio/harvard.html
2https://keithito.com/LJ-Speech-Dataset/
6 Bird et al.
along with a transcription. The longest clip is 10.1 seconds, the shortest is 1.1
seconds, and the average duration of clips are 6.5 seconds. The speech is made
up of 13,821 unique words at which there are an average of 17 per clip. Following
this, the two datasets of English language and English phonetics are introduced
and fine tune training occurs for two different models for 100,000 iterations each.
Thus, in total, 800,000 learning iterations have been performed where the final
12.5% of the learning has been with the two differing representations of English.
Table 1. Ten Strings for Benchmark Testing which are Comprised of all English Sounds
ID String
1”Hello, how are you?”
2”John bought three apples with his own money.”
3”Working at a University is an enlightening experience.”
4”My favourite colour is beige, what’s yours?”
5”The population of Birmingham is over a million people.”
6”Dinosaurs first appeared during the Triassic period.”
7”The sea shore is a relaxing place to spend One’s time.”
8”The waters of the Loch impressed the French Queen”
9”Arthur noticed the bright blue hue of the sky.”
10 ”Thank you for listening!”
For comparison of the two models, statistical fingerprint similarity is per-
formed. This is due to model outputs being of an opinionated quality, ie. how
realistic the speech sounds from a human point of view. This is not presented in
the benchmarking of models, and thus comparing the loss of the two model train-
ing processes would yield no opiniative measurement. To perform this, natural
human speech is recorded by the subject that the model is trained to imitate.
The two models both also produce these phrases, and the fingerprint similar-
ity of the models and real human are compared. A higher similarity suggests a
better ability of imitation, and thus better quality speech produced by the model.
A set of 10 strings are presented in Table 1. Overall, this data includes all
sounds within the English language at least once. This validation data is recorded
by the human subject to be imitated, as well as the speech synthesis models.
Each of the phrases are recorded three times by the subject, and comparisons
are given between the model and each of the three tests, comprising thirty tests
per model.
Figures 1-3 show examples of spectrographic representations when both a
human being and a Tacotron network speak the sentence ”Working at a Univer-
sity is an Enlightening Experience”. Though frequencies are slightly mismatched
in that the network seems to be predicting higher frequencies than those in the
human speak, the peaks within the data discerning individually-spoken words
Phonetic Awareness in Speech Synthesis 7
Fig. 1. Spectrogram of ”Working at a University is an Enlightening Experience” when
spoken by a human being.
Fig. 2. Spectrogram of ”Working at a University is an Enlightening Experience” when
predicted by the English Written Text Tacotron network.
Fig. 3. Spectrogram of ”Working at a University is an Enlightening Experience” when
predicted by the Phonetically Aware Tacotron network.
8 Bird et al.
are closely matched by the Tacotron prediction. Though the two predictions
look similar, the fingerprint similarity of the phonetically aware prediction is far
closer to a human than otherwise, this is due to the fingerprint consideration of
most important features rather than simply the distance between two matrices
of values. Additionally, the timings of values are not considered, the algorithm
produces a best alignment of the pair of waves before analysing their similar-
ity. For example, the largest peak is the first syllable of the word ”University”,
and thus those two peaks would be compared, rather than the differing data if
alignment had not been performed. Therefore, silence before and after a spoken
phrase is not considered, rather, only the phrase from its initial inception to final
termination.
4 Preliminary Results
Within this section, the preliminary results are presented. Firstly, the acoustic
fingerprint similarities of the models and human voices are compared. Finally,
the average results for the two models are compared with one another.
Table 2. Thirty Similarity Tests Performed on the Raw English Speech Synthesis
Model with Averages of Sentences and Overall Average Scores. Failures are denoted
by F. Overall average is given as the average of experiments 1, 2 and 3.
Phrase Experiment
1 2 3 Avg.
1F F F 0 (3F)
222.22 22.22 66.67 37.02
356.6 56.7 75.4 62.9
40 51.28 0 17.09 (2F)
56244
620 41.67 62.5 41.39
755.56 18.52 22.43 32.17
824.39 24.39 48.78 32.55
922.72 22.72 22.72 22.72
10 F 71.4 F 23.8 (2F)
Avg. 20.74 31.09 30.25 27.36
Table 2 shows the results for the tests on the raw English speech dataset.
Of the thirty experiments, 23% (7/30) were failures and had no semblance of
similarity to the natural speech. One test, phrase 1, was a total failure with all
three experiments scoring zero. Overall, the generated data resembled the hu-
man data by an average of 21.07%.
Table 3 shows the results for the tests on the phonetic English speech dataset.
Of the thirty experiments, 13% (4/30) were failures and had no semblance of
similarity to the natural speech, this was slightly was lower than the raw English
Phonetic Awareness in Speech Synthesis 9
Table 3. Thirty Similarity Tests Performed on the Phonetic English Speech Synthesis
Model with Averages of Sentences and Overall Average Scores. Failures are denoted
by F. Overall average is given as the average of experiments 1, 2 and 3.
Phrase Experiment
1 2 3 Avg.
158.8 0 0 19.6 (2F)
285.7 28.57 57.14 57.14
393.7 78.12 46.88 72.9
451.28 25.64 25.64 34.19
538.46 38.46 39 38.64
635.71 35.71 17.8 29.74
734.5 34 17.2 28.57
843.4 21.7 43.48 36.19
920.4 20 26.4 22.27
10 0 41.6 0 13.89 (2F)
Avg. 46.19 32.38 27.35 35.31
dataset. This said, there did not occur an experiment with complete catastrophic
failure in which all three tests scored zero. On an average of the three experi-
ments, the human data and the generated data were 35.31% similar. Figure 4
shows the average differences between the acoustic fingerprints of human and
artificial data in each of the ten sets of three experiments.
In comparing head-to-head results, the phonetics dataset produced experi-
ments that on average outperformed the written language dataset in six out of
ten cases. This said, experiment nine was extremely close with the two mod-
els achieving 22.27% and 22.72% with a negligible difference of only 0.45%. In
the cases where the language set outperformed the phonetics set, the difference
between the two were much smaller than the vice versa outcomes. In terms of
preliminary results, the phonetic representation of language has gained the best
results in human speech imitation when comparing the acoustic fingerprint met-
rics. Often, inconsistencies occur in the similarity of human and robotics speech
(in both approaches); this is likely due to either a lack of enough data within the
training and validation sets, or an issue of there not being enough training time
in order to form a stable model that produces consistent output - or, of course,
a combination of the two. Further exploration via future experimentation could
pinpoint the cause of inconsistency.
5 Future Work and Concluding Implications
Although results suggested the phoneme aware approaches were preliminarily
more promising than raw English notation, the phonetic awareness approach was
faced with a disadvantage from the fine-tuning process. The pre-existing model
was trained with raw English language on a US-dialect, and fine tuned for raw
10 Bird et al.
1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
Average Experiment Accuracy (%)
English Language English Phonetics
Fig. 4. Comparison of the two Approaches for the Average of Ten Sets of Three Ex-
periments.
English language in GB-dialect as well as English phonetics in GB-dialect. Thus,
the phonetic model would require more training in order to overcome the disad-
vantaged starting point it faced. For a more succinct comparison, future models
should be trained from an initial random distribution of network weights for their
respective datasets. In addition to this, it must be pointed out that input data
from the English written text dataset had 26 unique alphabetic values whereas
this is extended in the second dataset since there are 44 unique phonemes that
make up the spoken English language in a British dialect. Statistical validation
through the comparison of acoustic fingerprints are considered, with similari-
ties to real speech compared on the same input sentence or phrase. Though an
acoustic fingerprint does give a concrete comparison between pairs of output
data, human opinion is still not properly reflected. For this, as the Tacotron
paper did, Mean Opinion Score (MOS) should also be performed. MOS is given
as M OS =PN
n=1
Rn
N, where R are rating scores given by a subject group of
N participants. Thus, this is simply the average rating given by the audience.
MOS requires a large audience to give their opinions, denoted by a nominal
score, in order to rate the networks in terms of human hearing; human hearing
and audial understanding is an ability a Turing Machine does not yet have. Such
MOS would allow for a second metric, real opinion, to also provide a score. A
multi-objective problem is then presented through the maximisation of acous-
tic fingerprint similarities as well as the opinion of the audience. Additionally,
Phonetic Awareness in Speech Synthesis 11
other spectrogram prediction paradigms such as Tacotron2 and DCTTS should
be studied in terms of the effects of English vs. Phonetic English.
As mentioned in the previous section, further work should also be performed
into pinpointing the cause of inconsistent output from the models. Explorations
into the effects of there being a larger dataset as well as more training time
for the model could discover the cause of inconsistency and help to produce a
stronger training paradigm for speech synthesis.
To conclude, 100,000 extra iterations of training on top of a publicly avail-
able dataset, then fine-tuned on a human dataset of only 1.6 hours worth of
speech translated to phonetic structure, has produced a network with the ability
to reproduce new speech at 35.31% accuracy. It is not out of the question what-
soever for post-processing to enable the data to be completely realistic, which
could then be ‘leaked’ to the media, the law, or otherwise. Such findings present
a dangerous situation, in which a person’s speech could be imitated to create
the illusion of evidence that they have said such things that in reality they have
not. Though this paper serves primarily as a method of maximising artificial
imitative abilities, it should also serve as a grave warning in order to minimise
the potential implications on an individual’s life. Future information security
research should, and arguably must, discover competing methods of detection of
spoof speech in order to prevent such cases. On the other hand, realistic speech
synthesis could be used in realtime for more positive means, such as an aug-
mented voice for those suffering illness that could result in the loss of the ability
of speech.
References
1. A. M. Turing, Computing Machinery and Intelligence. 1950.
2. L. Locock, S. Ziebland, and C. Dumelow, “Biographical disruption, abruption and
repair in the context of motor neurone disease,” Sociology of health & illness,
vol. 31, no. 7, pp. 1043–1058, 2009.
3. J. Yamagishi, C. Veaux, S. King, and S. Renals, “Speech synthesis technologies for
individuals with vocal disabilities: Voice banking and reconstruction,” Acoustical
Science and Technology, vol. 33, no. 1, pp. 1–5, 2012.
4. A. C. Baugh and T. Cable, A history of the English language. Routledge, 1993.
5. H. R. Loyn, Anglo Saxon England and the Norman Conquest. Routledge, 2014.
6. V. Fromkin, R. Rodman, and N. Hyams, An Introduction to Language. Cengage,
2006.
7. I. R. Titze and D. W. Martin, Principles of voice production. 1994.
8. W. Menzel, E. Atwell, P. Bonaventura, D. Herron, P. Howarth, R. Morton, and
C. Souter, “The isle corpus of non-native spoken english,” in Proceedings of LREC
2000: Language Resources and Evaluation Conference, vol. 2, pp. 957–964, Euro-
pean Language Resources Association, 2000.
9. J. J. Bird, , E. Wanner, A. Ekart, and D. R. Faria, “Phoneme aware speech recog-
nition through evolutionary optimisation,” in The Genetic and Evolutionary Com-
putation Conference, GECCO, 2019.
12 Bird et al.
10. Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang,
Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: Towards end-to-end speech synthe-
sis,” arXiv preprint arXiv:1703.10135, 2017.
11. H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech
system based on deep convolutional networks with guided attention,” in 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 4784–4788, IEEE, 2018.
12. H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural
network architectures for large scale acoustic modeling,” in Fifteenth annual con-
ference of the international speech communication association, 2014.
13. X. Li and X. Wu, “Constructing long short-term memory based deep recurrent neu-
ral networks for large vocabulary speech recognition,” in 2015 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4520–4524,
IEEE, 2015.
14. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly
learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
15. D. Griffin and J. Lim, “Signal estimation from modified short-time fourier trans-
form,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32,
no. 2, pp. 236–243, 1984.
16. E. Sejdi, I. Djurovi, and J. Jiang, “Time–frequency feature representation using
energy concentration: An overview of recent advances,” Digital Signal Processing,
vol. 19, no. 1, pp. 153–183, 2009.
17. R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J.
Weiss, R. Clark, and R. A. Saurous, “Towards end-to-end prosody transfer for
expressive speech synthesis with tacotron,” international conference on machine
learning, pp. 4693–4702, 2018.
18. M. Zhang, X. Wang, F. Fang, H. Li, and J. Yamagishi, “Joint training framework
for text-to-speech and voice conversion using multi-source tacotron and wavenet,”
arXiv preprint arXiv:1903.12389, 2019.
19. J. Bormans, J. Gelissen, and A. Perkis, “Mpeg-21: The 21st century multimedia
framework,” IEEE Signal Processing Magazine, vol. 20, no. 2, pp. 53–62, 2003.
20. A. Wang et al., “An industrial strength audio search algorithm.,” in Ismir,
vol. 2003, pp. 7–13, Washington, DC, 2003.
21. IEEE, IEEE Transactions on Audio and Electroacoustics, vol. 21. IEEE, 1973.
22. K. Yochanang, T. Daengsi, T. Triyason, and P. Wuttidittachotti, “A compara-
tive study of voip quality measurement from g. 711 and g. 729 using pesq and
thai speech,” in International Conference on Advances in Information Technology,
pp. 242–255, Springer, 2013.
23. N. Yankelovich, J. Kaplan, J. Provino, M. Wessler, and J. M. DiMicco, “Improving
audio conferencing: are two ears better than one?,” in Proceedings of the 2006
20th anniversary conference on Computer supported cooperative work, pp. 333–
342, ACM, 2006.