ThesisPDF Available

Novel NLP Methods for Improved Text-To-Speech Synthesis

Authors:
  • Centre de Recherche du CHU de Québec - Université Laval

Abstract and Figures

TTS (Text-to-Speech) is one of the main elements of human-machine interaction systems. As the name suggests, a text-to-speech system converts text into spoken audio and thus, a machine (such as a robot) can interact via using speech with its environment. Generally, there are two phases in a TTS system. The first phase is the text-processing phase, where the input text is transcribed into a phonetic representation with optional meta-data (e.g. stress labels). This process is based on natural language processing (NLP) methodology. The other phase is the generation of audio waveform from the phonetic representations. Some essential steps of the first phase are preprocessing, morphological analysis, contextual analysis, syntactic analysis, phonetization and prosody generation. The goal of my dissertation is to introduce novel NLP methods, which have a relation directly or indirectly to serve in improving TTS synthesis. These methods are also useful for automatic speech recognition (ASR) and dialogue systems. In my dissertation, I cover three different tasks: Grapheme-to-phoneme Conversion (G2P), Text Normalization and Intent Detection. These tasks are important for any TTS system explicitly or implicitly. As the first approach, I investigate convolutional neural networks (CNN) for G2P conversion. I propose a novel CNN-based sequence-to-sequence (seq2seq) architecture. My approach includes an end-to-end CNN G2P conversion with residual connections, furthermore, a model, which utilizes a convolutional neural network (with and without residual connections) as encoder and Bi-LSTM as a decoder. As the second approach, I investigate the application of the transformer architecture to G2P conversion and compare its performance with recurrent and convolutional neural network-based state-of-the-art approaches. Beside TTS systems, G2P conversion has also been widely adopted for other systems such as computer-assisted language learning, automatic speech recognition, speech-to-speech machine translation systems, spoken term detection, spoken document retrieval. When using a standard TTS system to read messages, many problems arise due to phenomena in messages, e.g., usage of abbreviations, emoticons, informal capitalization and punctuation. These problems also exist in other domains, such as blogs, forums, social network websites, chat rooms, message boards, and communication between players in online video game chat systems. Normalization of the text addresses this challenge. I developed a novel CNN-based model and evaluated this model on an open dataset. The performance of CNNs is compared with a variety of different Long Short-Term Memory (LSTM) and bi-directional LSTM (Bi-LSTM) architectures on the same dataset. The number of human-bot systems driven by either voice or text has increased exponentially in recent years. Intent detection forms an integral component of such dialogue systems. For intent detection, I develop novel models, which utilize end-to-end CNN architecture with residual connections and the combination of Bi-LSTM and Self-attention Network (SAN). I also evaluated these models on various datasets.
Content may be subject to copyright.
BUDAPEST UNIVERSITY OF TECHNOLOGY AND ECONOMICS
DEPARTMENT OF TELECOMMUNICATIONS AND MEDIA INFORMATICS
Novel NLP Methods
for Improved Text-To-Speech Synthesis
Sevinj Yolchuyeva, MSc
Ph.D. Dissertation
Doctoral School of Informatics
Supervisors
lint Gyires-Tóth, Ph.D.
za Németh, Ph.D.
Budapest, Hungary
February 2021
Declaration
I, Sevinj Yolchuyeva, hereby declare, that this dissertation, and all results claimed
therein are my own work, and rely solely on the references given. All segments taken
word-by-word, or in the same meaning from others have been clearly marked as
citations and included in the references.
Sevinj Yolchuyeva
February 7th, 2021
Name
Date
- 3 -
Abstract
TTS (Text-to-Speech) is one of the main elements of human-machine interaction
systems. As the name suggests, a text-to-speech system converts text into spoken
audio and thus, a machine (such as a robot) can interact via using speech with its
environment. Generally, there are two phases in a TTS system. The first phase is the
text-processing phase, where the input text is transcribed into a phonetic
representation with optional meta-data (e.g. stress labels). This process is based on
natural language processing (NLP) methodology. The other phase is the generation of
audio waveform from the phonetic representations. Some essential steps of the first
phase are preprocessing, morphological analysis, contextual analysis, syntactic
analysis, phonetization and prosody generation.
The goal of my dissertation is to introduce novel NLP methods, which have a relation
directly or indirectly to serve in improving TTS synthesis. These methods are also
useful for automatic speech recognition (ASR) and dialogue systems. In my
dissertation, I cover three different tasks: Grapheme-to-phoneme Conversion (G2P),
Text Normalization and Intent Detection. These tasks are important for any TTS
system explicitly or implicitly.
As the first approach, I investigate convolutional neural networks (CNN) for G2P
conversion. I propose a novel CNN-based sequence-to-sequence (seq2seq)
architecture. My approach includes an end-to-end CNN G2P conversion with residual
connections, furthermore, a model, which utilizes a convolutional neural network
(with and without residual connections) as encoder and Bi-LSTM as a decoder. As
the second approach, I investigate the application of the transformer architecture to
G2P conversion and compare its performance with recurrent and convolutional neural
network-based state-of-the-art approaches. Beside TTS systems, G2P conversion has
also been widely adopted for other systems such as computer-assisted language
learning, automatic speech recognition, speech-to-speech machine translation
systems, spoken term detection, spoken document retrieval.
When using a standard TTS system to read messages, many problems arise due to
phenomena in messages, e.g., usage of abbreviations, emoticons, informal
capitalization and punctuation. These problems also exist in other domains, such as
blogs, forums, social network websites, chat rooms, message boards, and
communication between players in online video game chat systems. Normalization of
the text addresses this challenge. I developed a novel CNN-based model and evaluated
this model on an open dataset. The performance of CNNs is compared with a variety
- 4 -
of different Long Short-Term Memory (LSTM) and bi-directional LSTM (Bi-LSTM)
architectures on the same dataset.
The number of human-bot systems driven by either voice or text has increased
exponentially in recent years. Intent detection forms an integral component of such
dialogue systems. For intent detection, I develop novel models, which utilize end-to-
end CNN architecture with residual connections and the combination of Bi-LSTM and
Self-attention Network (SAN). I also evaluated these models on various datasets.
- 5 -
Table of Contents
Declaration ............................................................................................................. - 3 -
Abstract .................................................................................................................. - 3 -
Table of Contents .................................................................................................. - 5 -
Chapter 1 Introduction ...................................................................................... - 7 -
1.1 Overview .................................................................................................. - 7 -
1.2 Thesis Structure ........................................................................................ - 8 -
Chapter 2 Deep Learning Background ............................................................ - 9 -
2.1 Introduction .............................................................................................. - 9 -
2.2 Deep Learning Methods ......................................................................... - 10 -
2.3 Loss Functions ........................................................................................ - 11 -
2.4 Recurrent Neural Networks (RNNs) ...................................................... - 13 -
2.5 Long-Short Term Memory (LSTM) ....................................................... - 14 -
2.6 Bidirectional Long-Short Term Memory (Bi-LSTM) ............................ - 15 -
2.7 Convolutional Neural Networks ............................................................. - 16 -
2.8 Vector-based Word Representations ...................................................... - 18 -
2.9 Sequence-to-sequence Learning ............................................................. - 20 -
2.10 End-to-end training ................................................................................ - 20 -
2.11 Attention Mechanism ............................................................................. - 21 -
2.12 Self-Attention Networks ........................................................................ - 22 -
2.13 Transformer Neural Network ................................................................. - 23 -
Chapter 3 Grapheme-to-Phoneme Conversion ............................................. - 24 -
3.1 Introduction ............................................................................................ - 24 -
3.2 Related Works ........................................................................................ - 25 -
3.3 Research Methodology ........................................................................... - 27 -
3.4 CNNs for Grapheme-to-Phoneme Conversion ...................................... - 28 -
3.5 Transformer Neural Network for Grapheme-to-Phoneme Conversion .. - 42 -
3.6 Conclusions ............................................................................................ - 48 -
Chapter 4 Text Normalization ........................................................................ - 49 -
4.1 Introduction ............................................................................................ - 49 -
4.2 Related Works ........................................................................................ - 50 -
- 6 -
4.3 Research Methodology ........................................................................... - 52 -
4.4 LSTM-, Bi-LSTM- and CNN-based model for Text Normalization ..... - 55 -
4.5 Conclusions ............................................................................................ - 63 -
Chapter 5 Intent Detection .............................................................................. - 64 -
5.1 Introduction ............................................................................................ - 64 -
5.2 Related Works ........................................................................................ - 65 -
5.3 Research Methodology ........................................................................... - 66 -
5.4 Self-Attention Networks for Intent Detection ........................................ - 69 -
5.5 Conclusions ............................................................................................ - 74 -
Chapter 6 Applicability of the results ............................................................ - 75 -
Chapter 7 Summary of the theses ................................................................... - 77 -
Acknowledgements ............................................................................................. - 80 -
List of Figures ...................................................................................................... - 81 -
List of Tables ....................................................................................................... - 82 -
List of Abbreviations .......................................................................................... - 83 -
References ............................................................................................................ - 85 -
List of publications by the author ...................................................................... - 98 -
Citations ............................................................................................................. - 100 -
- 7 -
Chapter 1
Introduction
1.1 Overview
Text-to-Speech (TTS) technology generates synthetic voice using textual information
only. Thus, it may serve as a more natural interface in human-machine interaction.
TTS is a useful tool in many application areas such as digital personal assistants,
dialogue systems, talking solutions for blind people, people who have difficulties in
spelling (dyslexics), teaching aids, text reading, talking audiobooks and toys. Over the
last years, significant research progress was achieved in this field. Generally, state-of-
the-art TTS is either based on unit selection or statistical parametric methods.
Particular attention has been paid to Deep Neural Network (DNN)-based TTS lately,
due to its advantages in flexibility, robustness and small footprint. Among the essential
properties of a speech synthesis system are naturalness and intelligibility. Naturalness
expresses to what extent the output approaches human speech, whereas intelligibility
is the easiness with which the information content can be understood. Text-to-Speech
systems may be divided into two subsystems: natural language processing-based text
processing and speech generation. Natural Language Processing (NLP) derives from
the combination of linguistic and computer sciences. It mainly contains three steps for
TTS systems: text analysis, phonetic analysis and prosodic analysis. Text analysis
includes segmentation, text normalization and Part-of-Speech (POS) tagging.
Phonetic conversion assigns phonetic transcription to each word. There are several
approaches to phonetic conversion. Two main directions are rule and dictionary-
based, or data-driven statistical and machine learning approaches. Prosodic analysis
performs intonation, amplitude, and duration modelling of speech. The NLP
subsystem has a great influence on the achievable performance of the whole TTS
system. The communicative context of the system is typically determined (domain-
specific TTS synthesis) a priori or ignored.
In this dissertation, I consider three areas of TTS: Text Normalization, Grapheme-to-
phoneme Conversion and Intent Detection. Grapheme-to-Phoneme (G2P) conversion
is the task of predicting the pronunciation of a word given its graphemic or written
form. It is a highly important part of both automatic speech recognition (ASR) and
- 8 -
text-to-speech (TTS) systems. The G2P model’s quality has a great influence on the
overall quality of speech. Inaccurate G2P conversion results in unnatural
pronunciation, or even incomprehensible synthetic speech. TTS systems need to work
with texts that contain non-standard words, including numbers, dates, currency
amounts, abbreviations and acronyms. For that reason, text normalization is an
essential task for a TTS system to convert written-form texts to spoken-form strings.
Furthermore, Intent Detection is a very prompt task for conversational assistants like
Amazon Alexa, Google Now, etc., and for dialogue systems. Significant
improvements in TTS and intent detection may improve the performance of
conversational assistant devices.
1.2 Thesis Structure
In the followings, I present my results in three thesis groups as separate chapters of
the dissertation. At the end of each chapter, the summary of the results is formed into
thesis statements. The dissertation is organized as follows:
Chapter 2 describes background material and literature review for deep
learning.
Chapter 3 presents several models for grapheme-to-phoneme (G2P)
conversion. This part of my research introduces and evaluates novel convolutional
neural network (CNN) based and Transformer architecture based G2P approaches.
The suggested methods approach the accuracy of previous state-of-the-art results in
terms of phoneme error rate.
Chapter 4 presents the investigated models for text normalization. I developed
CNN-based text normalization, and the training, inference times, accuracy, precision,
recall, and F1-score were evaluated on an open dataset. The performance of CNNs is
evaluated and compared with a variety of different Long Short-Term Memory (LSTM)
and Bi-LSTM architectures with the same dataset.
Chapter 5 presents various models for intent detection. I developed novel
models, which utilize the combination of Bi-LSTM and Self-attention Network (SAN)
for this task. Experiments on different datasets were evaluated.
Chapter 6 provides a short overview of my theses, emphasizing the most
important conclusions, and raises some possible future directions.
Chapter 7 describes the applicability of my results.
- 9 -
Chapter 2
Deep Learning Background
2.1 Introduction
Machine learning methods can work surprisingly well with adequate human-designed
representations and input features. Deep learning has become one of the main research
directions in the machine learning area in recent years. It can effectively capture the
hidden internal structures of data and use more powerful modelling capabilities to
characterize the data. Deep learning attempts to model data abstraction using multiple
hidden layers of the neural network. Deep learning has fundamentally changed the
landscape of many areas in artificial intelligence, including speech processing, image
processing, text processing, and dialogue systems. For example, with large-scale
training data, deep neural networks achieved significantly lower recognition errors
than the traditional approaches in speech recognition systems. Many areas of NLP,
including language understanding and dialogue, information retrieval, question
answering from the text, language generation, lexical analysis and parsing, and text
sentiment analysis, have also seen significant progress using deep learning.
This chapter provides the necessary background and literature review of neural
networks for the thesis. Section 2.2 and Section 2.3 present deep learning techniques
and loss functions, respectively. Section 2.4 describes recurrent neural networks. In
Section 2.5 and Section 2.6, Long-Short Term Memory (LSTM) and Bidirectional
Long-Short Term Memory (Bi-LSTM) are presented, respectively. Section 2.7
describes Convolutional neural networks. Section 2.8, Section 2.9 and Section 2.10
introduce the overview of word embedding, sequence-to-sequence learning and end-
to-end learning, respectively. Attention mechanism and one of its variants, the self-
attention network is introduced in Section 2.11 and Section 2.12. Finally, Section 2.13
present the Transformer neural network.
- 10 -
2.2 Deep Learning Methods
Various methods are used throughout my work to create robust deep learning models,
including adaptive learning rate, dropout, batch normalization, residual connections,
transfer learning, and max-pooling. The main aspects of these methods are as follows.
Adaptive learning rate: The adaptive learning rate method - is the process of
changing the learning rate to increase performance and reduce training time. The most
common adaptations of learning rate during training include techniques to reduce the
learning rate over time [145], referred to as learning rate decay or annealing.
Dropout: An option to address overfitting in deep neural networks is the dropout
technique. This method is applied by randomly dropping units and the corresponding
parameters in deep neural networks during training [143]. The result of a network with
dropout is like training an ensemble. Dropout is able to achieve better generalization.
Batch normalization: Batch normalization is a technique that normalizes activations
in intermediate layers of deep neural networks [50]. It serves to speed up training and
to make learning easier. Applied to a state-of-the-art image classification model, Batch
Normalization achieves the same accuracy with 14 times fewer training steps and beats
the original model by a significant margin, for instance [50].
Residual connections: Residual connections, blocks or units are made of a set of
stacked layers, where the inputs are added to their outputs with the aim of creating
identity mappings. These connections facilitate the training of very deep neural
networks, as the gradient is able to flow through these connections without vanishing
[48, 79].
Transfer learning: In transfer learning, a model trained on a particular task is
exploited on another related task. The knowledge obtained while solving a particular
problem can be transferred to another network, which is to be trained further on a
related problem. This allows for rapid progress and enhanced performance while
solving the second problem [142]. It can be used to accelerate the training of neural
networks as either a weight initialization scheme or feature extraction method. In some
cases, there is not enough data to train the models. Training a model from scratch with
an insufficient amount of data would result in lower performance, starting with a pre-
trained model would help get better result.
Max-Pooling: In max-pooling a filter is predefined, and this filter is applied across
the sub-regions of the input taking its maximum values. Dimensions and
computational costs can be reduced by max-pooling [144, 145].
- 11 -
2.3 Loss Functions
Loss functions define the overall error of machine learning algorithms, and
accordingly, it is possible to improve their performance. These can be grouped into
two major categories concerning the types of problems that we come across in the real
world classification and regression. In classification, the task is to predict the
respective probabilities of all classes that the problem is dealing with. In regression,
oppositely, the task is to predict the continuous value concerning a given set of
independent features to the learning algorithm [152].
The most commonly used loss functions in regression modelling are:
Mean Square Loss
It is more often used regression loss that is computed by taking the average
squared difference between actual and predicted observations. It mainly takes
into consideration the average magnitude of error, ignoring the direction.

 
Mean Absolute Error
It is computed by taking the average of the sum of absolute differences between
the true and predicted variables. Similar to MSE it also calculates magnitude
ignoring the direction.

 
Huber Loss
It is also called Smooth Mean Absolute Error. This loss function is defined as
the combination of MSE and MAE and is controlled by a hyperparameter .
The parameter defines a threshold (based on the distance between target and
prediction), making the loss function switch from a squared error to an absolute
one.





 (2.3).
Log-Cosh Loss
- 12 -
It is defined as the logarithm of the hyperbolic cosine of the prediction error.
It is another function used in regression tasks which is much smoother than
MSE Loss.

 
In classification modelling, the most commonly used loss functions can be below:
Hinge Loss
One of the loss functions for binary classification task is the hinge loss function
which was initially developed to use with the support vector machine models
[153]. It is recommended to be used where the target labels are in (-1,1) in
binary classification tasks.


 
Cross-Entropy Loss / Log Loss
This is the most common loss function used in classification problems. The
cross-entropy loss decreases as the predicted probability converges to the
actual label. It measures the performance of a classification model whose
predicted output is a probability value between 0 and 1. If there are two classes,
the cross-entropy loss is calculated by equation (2.6).

 


 (2.6)
If the number of classes is larger than two, a separate loss for each class label
per observation must be calculated and the result must be summed up (as in
equation (2.7).

In equation (2.1)-(2.7), is the ground-truth label indicator (or target value),
and
is the predicted value output of the -th sample.
Moreover,cross-entropy loss, together with softmax is arguably one of the
most common components for classification with neural networks. The
softmax is used to calculate the probability distribution of a particular class
over c different classes. It returns a range of 0 to 1 for its outputs with all
- 13 -
probabilities equalling 1 in the multi-classification tasks. The softmax is
frequently appended to the last layer of the classification model.
2.4 Recurrent Neural Networks (RNNs)
Recurrent neural networks (RNNs) have shown promising results in various NLP
tasks. They are capable of learning features and long-term dependencies from
sequential and time-series data. RNNs and their variants such as Long Short-Term
Memory (LSTM) and Gated Recurrent Unit (GRU) have presented success in various
NLP tasks, such as language modelling, sentiment analysis, relation extraction, slot
filling, semantic textual similarity and machine translation [100, 101, 102].
Furthermore, various versions of RNN are used for speech processing, image
generation [103, 104].
In this section, I will concentrate on simple RNN models for the brevity of notation.
Given input sequence  of length , a simple RNN is formed by a
repeated application of a function This generates a hidden state from the current
input and the previous for time step :

for some non-linearity  The model output can be defined as
 (2.9)
Here and (, are shared weight matrices and bias vectors
(offsets) throughout the sequence, respectively.
There are two widely known issues with properly training RNNs, the vanishing and
the exploding gradient problems. The consequence of these problems is that it is
difficult to capture long term dependencies during training. The exploding gradient
problem refers to a large increase in the norm of the gradient during training. The
vanishing gradients problem refers to the opposite behaviour when long term
components go exponentially fast to norm 0. In this case, it impossible for the model
to learn the correlation between temporally distant dependencies [105].
- 14 -
2.5 Long-Short Term Memory (LSTM)
Long Short-Term Memory networks (LSTM) are a special kind of RNN, capable of
learning long-term dependencies [3,4]. This module has a memory cell that can store
past information. An LSTM unit takes as input its previous cell and hidden states and
outputs its new cell and hidden states. More formally, the LSTM unit is composed of
four gates, interacting in a special way. The gates of an LSTM unit are computed as
follows [4]:
• input gate layer:  (2.10)
• forget gate layer:  (2.11)
• output candidate layer:  (2.12)
• cell state candidate layer:  (2.13)
Figure 2.1. A basic representation of the LSTM cell [3].
The input gate defines the degree to which the current input information is added to
the memory cell. The forget gate determines the extent to which the existing
memory is forgotten. The output gate of of each LSTM unit at time is computed
to get the output memory. Next, the information in the memory cell is updated through
partial forgetting of the information stored in the previous memory cell  via the
following processing step:

where denotes the element-wise product function of two vectors.
Lastly the output hidden state is updated based on the computed cell state :

- 15 -
Network input weights , recurrent weights  and biases  are
learnable parameters.
Compared with the standard RNN, LSTM effectively avoids the vanishing gradient
problem by introducing the gate mechanism, which is advantageous in dealing with
long-term dependencies. In other words, LSTM has a mechanism consisting of gating
units to control how to manage the flow of information.
2.6 Bidirectional Long-Short Term Memory (Bi-LSTM)
Bidirectional LSTM (Bi-LSTM) [5, 6] processes input sequences in both directions
with two sub-layers to account for the full input context. For each of the time steps,
these two sub-layers compute the forward hidden sequence
and the backward hidden
sequence
according to the following equations [6]:



(2.16)


(2.17)
In Equation (2.16) the forward layer iterates from in Equation (2.17)
the backward layer is iterated from  is an element-wise sigmoid
function.
As the next step, the hidden states of these two LSTMs are concatenated to form
an annotation sequencewhere 
 encodes
information about the  sequence with respect to all the other surrounding
sequences in the input. 
,
,
and
are weight matrixes;
denotes
the bias vectors. Generally, in all parameters, the arrow which pointed left to right and
right to left means forward and backward layer, respectively.
One drawback of Bi-LSTM is that the entire sequence must be available before it can
make predictions. For some applications such as real-time speech recognition, the
entire utterance may not be available, and thus Bi-LSTM is not adequate. But for
several NLP applications where the entire sentence is available at the same time, the
standard Bi-LSTM algorithm is effective. Moreover, Bi-LSTM is slower than LSTM
since the results of the forward pass must be available for the backward pass to
proceed.
- 16 -
2.7 Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are a special kind of neural networks for
processing data that has a temporal or spatial correlation. CNNs are used in various
fields, including image [24, 25], object [9, 10, 11] and handwriting recognition [11,
12], face verification [13], machine translation [69], speech synthesis [125].
The architecture of vanilla CNN is composed of many layer types (such as
convolutional layers, pooling layers, fully connected layers, etc.) where each layer
carries out a specific function (shown in Figure 2.2.).
Figure 2.2. General architecture of CNN.
A convolution layer is a fundamental component of the CNN architecture that slides
through the input using at least one filter (also called the kernel), performing a
convolution operation between each input area and the filter. The results will be stored
in activation maps (also called feature maps), which are the convolution layer output.
Importantly, activation maps can contain features that various kernels did extract.
Some hyperparameters have to be specified in order to generate the activation maps
of a certain size. Main attributes include [115]:
1. Size of filters (F). The filter will perform a convolution operation with a region
matching its size from the input and produce results in its activation map.
2. Stride (S). This parameter defines the distance between two successive filter
positions on the input of the convolutional layer. The common choice of a stride is 1;
however, a stride larger than 1 is sometimes used to achieve downsampling of the
activation maps.
3. Zero-padding (P). This parameter is used to specify how many zeros one wants to
pad around the border of the input. This is usually done to match the output dimension
with the input dimension of the convolutional layer.
- 17 -
These three parameters are the most common hyperparameters used for controlling
the output dimension of a convolutional layer. For an input with dimensions
, the dimension of the feature map will be 
by using the following equations:
 

 

 
In (2.18) and (2.19), is the size of the filters, and in (2.20)  the number of
filters.
A pooling layer is usually applied after a convolutional layer. The major advantage of
using the pooling technique is that it remarkably reduces the number of trainable
parameters and introduces translation invariance [115, 116]. The most common way
to do pooling is to apply a max operation to the result of each filter. However, various
types of pooling methods exist, e.g., averaging pooling, fractional max-pooling and
stochastic pooling.
The output feature maps of the final convolution or pooling layer are typically
flattened, i.e., transformed into a one-dimensional (1D) array of numbers (or vector),
and connected to fully connected layers, also known as dense layers [115]. The final
fully connected layer typically has the same number of output nodes as the number of
classes or the target dimension of a regression. In other words, the purpose of the fully
connected layer is to match the output to the modeling purpose.
Deep learning framework employs different upgrading versions of the convolutional
neural networks, i.e. 1D-CNN, 2D-CNN and 3D-CNN. The text has patterns along a
single spatial dimension and1D-CNNs are great convenient for the tasks, which use
text as inputs. Another domain that benefits from 1D-CNNs is time series modelling.
In the tasks which use images or videos as inputs, it’s more common to the apply of
2D-CNNs than 1D-CNNs and 3D-CNNs.
One of the main reasons that make convolutional neural networks superior to previous
methods is that CNNs perform a very effective representation learning, that considers
spatial and temporal relations and modelling jointly. Thus, a quasi-optimal
representation is extracted from the input data for the machine learning model. Weight
sharing in the convolutional layers is also a key element. Thus, the model becomes
spatially tolerant: similar representations are learned in different regions of the input,
and the total number of parameters can be significantly reduced.
- 18 -
In recent years, CNNs have developed rapidly in the design and calculation of natural
language processing (NLP) and achieved state-of-the-art results on various NLP tasks,
such as machine translation [14], sentence classification [15, 16], and question
answering [17]. In an NLP system, a convolution operation is typically a sliding
window function that applies a convolution filter to every possible window of words
in a sentence. Hence, the critical components of CNNs are a set of convolution filters
that compose low-level word features into higher-level representations.
2.8 Vector-based Word Representations
Word embedding is a collection of methods in NLP for making vector representations
of words where the idea is to map words into a vector space where similar words get
grouped together.
A brief introduction to the word embedding methods is as follows:
Term Frequency-Inverse Document Frequency (TF-IDF): It is one of the common
methods in NLP for converting text documents into matrix representation of vectors,
where TF denotes word frequency, that is, the frequency of a word appearing in the
document, and IDF denotes the inverse document frequency. The main idea is that if
a word or phrase appears more frequently in one document and less frequently in the
complete corpus, it is considered to have good representation ability for the document.
One-hot encoding: One of the simplest methods for word embedding is the one-hot
encoding scheme where each word is represented with a vector of the same length as
the total number of unique words in the corpus. The vector is then filled with zeros
except for one position which corresponds to the position of the word in an ordered
list of all unique words. The zero at this position is changed to one, hence the name
"one-hot". This results in a sparse vector with possibly thousands of zeros for a decent
sized corpus.
GloVe: It is a count-based model which constructs a global co-occurrence matrix
where each row of the matrix is a word while each column represents the contexts in
which the word can appear. The GloVe scores represent the frequency of co-
occurrence of a word with other words. GloVe learns its vectors after calculating the
co-occurrences using dimensionality reduction. Other benefits of GloVe are its
parallelizable implementation and ease of training over the large corpus [117].
Word2Vec: It is a word vector finding algorithm which is developed by Mikolov et
al. [22] and is composed of two pieces of algorithms (Continuous Bag-of-Words -
CBOW- and Skip-gram -SG-). CBOW and SG models are basic, still powerful
techniques for learning word vectors [22]. CBOW computes the conditional
- 19 -
probability of a target word given the context words surrounding it across a determined
window size, and the SG model does the exact opposite of the CBOW model, by
predicting the surrounding context words given the central target word [22, 23]. The
context words are assumed to be located symmetrically to the target words within a
distance equal to the window size in both directions.
FastText: It is one of most recent mayor advances in word embedding algorithms. It
was published again, by a group supervised by Tomas Mikolov, like Word2Vec, but
this time at Facebook AI Research [126]. The main contribution of FastText is to
introduce the idea of modular embeddings and to compute a vector for sub-word
components usually n-grams instead of computing an embedded vector per word.
These n-grams are later combined by a simple composition function to calculate the
final word embeddings. FastText has multiple advantages. One advantage is that the
vocabulary tends to be considerably smaller when working with large corpora, which
makes the algorithm more computationally efficient compared to the alternatives.
In pre-trained word embedding models, the word embedding tool is trained on large
corpora of texts in the given language, and it is highly useful on various NLP tasks.
Universal Sentence Encoder: One of the latest embedding methods is Universal
Sentence Encoder models [24], which is a form of transfer learning [129]. In [24], two
encoding models were introduced. One of them is based on a Transformer model
(TM), and the other one is based on Deep Averaging Network (DAN). They are pre-
trained on a large corpus and can be used in a variety of tasks (sentiment analysis,
classification, etc.). Both models take a word, sentence or a paragraph as input and
output a fixed-dimensional (e.g. 512) vector. The Transformer-based encoder model
targets high accuracy at the cost of greater model complexity and resource
consumption [24]. DAN targets efficient inference with slightly reduced accuracy.
ELMo: Embedding from Language Model (ELMo) [118] is a bidirectional Language
Model whose vectors are pretrained using a large corpus to extract multi-layered word
embeddings. ELMo learns conceptualized word representations that capture the
Syntax, Semantics and Word Sense Disambiguation (WSD). ELMo could be coupled
with existing deep learning approaches for building supervisory models for a diverse
range of complex NLP tasks to improve their performance significantly [118].
BERT: Bidirectional Encoder Representations from Transformers (BERT) is based
on the bidirectional idea of ELMo but uses a Transformer architecture [119, 52].
BERT is Pretrained to learn bidirectional representations by jointly conditioning the
contexts of the corpus in both directions for all the layers. The pre-trained vectors
could be used in complex NLP tasks and can achieve state-of-the-art results with only
one additional layer at the output [117].
- 20 -
2.9 Sequence-to-sequence Learning
Sequence-to-sequence (seq2seq) learning gains enormous attention both academically
and commercially. It has been successfully used to develop various practical and
powerful applications, such as machine translation [44, 45], speech recognition [120],
TTS [121, 122] and dialogue systems. This has been greatly advanced with the
increasing power of RNN, especially the LSTM for sequential processing.
A vanilla seq2seq framework for abstractive summarization is composed of an
encoder and a decoder [43, 44]. The encoder first maps an input sequence 
 into hidden states , and then the decoder takes
these state representations as input and generates the output 
sequence by sequence. The last hidden state representation is called context vector :
 (2.21)
In the inference, bypassing the context vector and all the previously predicted
sequences  to the decoder, the decoding process predicts the next
word . In other words, the decoder defines a probability over the output by
decomposing the joint probability as follows [43]:

 
 (2.23).
2.10 End-to-end training
End-to-end training of deep learning models with large datasets helps to achieve high
accuracy in various application domains, including natural language processing. The
purpose of end-to-end training is to combine different components in the
computational graph of the neural network and optimize it as a whole. There are
several major advantages for end-to-end training [147, 148]:
The whole model is closely related to the target since it has an overall objective
function.
It is more efficient because large computational graphs can be optimized together
by simple backpropagation in the training process.
The whole system is quite simple since there is only one input, and one output and
features are automatically learned in the end-to-end network.
- 21 -
Representations are learned, and modeling is performed jointly with representation
learning in the same computational graph.
End-to-end solutions have achieved promising results in various tasks [148-151]. In
[148], an end-to-end adversarial Text-to-Speech method was proposed. This end-to-
end adversarial TTS operates on either pure text or raw, i.e. temporally unaligned
phoneme input sequences and produces raw speech waveforms as output. These
models eliminate the typical intermediate bottlenecks present in most state-of-the-art
TTS engines by maintaining learnt intermediate feature representations throughout the
network. In [150], a novel TTS model was presented, called Tacotron, an end-to-end
generative text-to-speech model that synthesizes speech directly from characters. In
[151], the architecture of Tacotron is extended by incorporating a normalizing flow
into the autoregressive decoder. Namely, they used a text normalization pipeline and
pronunciation lexicon to map input text into a sequence of phones.
Furthermore, the end-to-end approach would be also more straightforward to integrate
into a general-purpose dialogue agent than one that relied on annotated dialogue states
[149].
In this dissertation, I present novel end-to-end models for G2P. These models consist
of combining the CNN and residual connections (Section 3.4).
2.11 Attention Mechanism
The attention mechanism has achieved great success and is commonly used in seq2seq
models for various NLP tasks [44, 25]. It addresses the limitation of modelling long
dependencies and the efficient usage of memory for computation. The vanilla attention
mechanism intervenes as an intermediate layer between the encoder and the decoder,
having the objective of capturing the information from the sequence of tokens that are
relevant to the contents of the sentence [45].
In an attention-based model, a set of attention weights is first calculated. These are
multiplied by the encoder output vectors to create a weighted combination. The result
should contain information about that specific part of the input sequence, and thus,
help the decoder select the target output symbol. Therefore, the decoder network can
use different portions of the encoder sequence as context. It can be defined as [45]:

 
- 22 -
Where  is called attention vector which is generally calculated with a softmax
function:
 


in which the  is an arbitrary function, that scores how well the input
around position and the output at position match. This is frequently realized as
learnable weight matrices.
2.12 Self-Attention Networks
Recently, as a variant of the attention model, self-attention networks (SAN) have
attracted a lot of interest due to their flexibility in parallel computation and modelling
both long-term and short-term dependencies [52, 53]. SANs have been successfully
applied to many tasks, including reading comprehension, abstractive summarization,
textual entailment, learning task-independent sentence representations, machine
translation and language understanding. SANs calculate attention weights between
each pair of tokens in a single sequence, thus can capture long-range dependency more
directly than their RNN counterpart [123].
Formally, given an input layer the hidden states in the output layer
are constructed by attending to the states of the input layer. Specifically, the input
layer  is first transformed into queries , keys , and
values :


Where , ,  are trainable parameter matrices with d being the
dimensionality of input states [123]. The output layer  is constructed by

Where  is an attention model, which can be implemented as an additive,
multiplicative, or dot-product attention [123].
- 23 -
2.13 Transformer Neural Network
The Transformer networks, shown in Figure 3.8 are based solely on attention
mechanisms and account for the representations of their input and output without
using recurrent or convolutional neural networks (CNN) [52, 53]. First, transformer
networks were applied to neural machine translation, and they achieved state-of-the-
art performance on various datasets. The results of [52] show that transformers could
be trained significantly faster than recurrent or convolutional architectures for
machine translation tasks. The remarkable performance achieved by such models
largely comes from their ability to capture long-term dependencies in sequences
[127]. However, with the absence of recurrence, positional-encoding is added to the
input and output embeddings. Similarly, to the time-step in a recurrent network, the
positional information provides the Transformer network with the order of input and
output sequences. In particular, the multi-head attention mechanism in Transformer
allows every position to be directly connected to any other positions in a sequence.
Thus, the information can flow across positions without any intermediate loss.
- 24 -
Chapter 3
Grapheme-to-Phoneme Conversion
3.1 Introduction
The process of grapheme-to-phoneme (G2P) conversion generates the phonetic
transcription from the written form of words. The spelling of the word is called
grapheme sequence (or graphemes), the phonetic form is called phoneme sequence (or
phonemes). It is essential to develop a phonemic representation in text-to-speech
(TTS) and automatic speech recognition (ASR) systems. For this purpose, G2P
techniques are used, and getting state-of-the-art performance in these systems depends
on the accuracy of G2P conversion. For instance, in ASR acoustic models, the
pronunciation lexicons and language models are critical components. Acoustic and
language models are built automatically from large corpora. Pronunciation lexicons
are the middle layer between acoustic and language models. For a new speech
recognition task, the performance of the overall system depends on the quality of the
pronunciation component. In other words, the system’s performance depends on G2P
accuracy. For example, the G2P conversion of word 'speaker' is 'S P IY K ER'.
In this chapter, I will present a novel CNN-based sequence-to-sequence (seq2seq)
architecture for G2P conversion. My approach includes an end-to-end CNN G2P
conversion with residual connections, furthermore, a model, which utilizes a
convolutional neural network (with and without residual connections) as encoder and
Bi-LSTM as a decoder. I compare the proposed approach with existing state-of-the-
art methods, including Encoder-Decoder LSTM and Encoder-Decoder Bi-LSTM.
Training and inference times, phoneme and word error rates are evaluated on the
public CMUDict dataset for US English, and the best performing convolutional neural
network-based architecture is also examined with the NetTalk dataset [130].
Furthermore, I implemented the transformer network [19] for G2P conversion. This
architecture is based on attention mechanisms. Additionally, I compare the
Transformer and CNN-based G2P methods.
This chapter is structured as follows: Section 3.2 describes previous works about G2P
conversion; Section 3.3 presents the datasets and metrics; Section 3.4 and Section 3.5
- 25 -
present CNNs- and Transformer-based models and experiments for G2P conversion;
Finally, conclusions are drawn in Section 3.6.
3.2 Related Works
G2P conversion has been studied for a long time. Rule-based G2P systems use a broad
set of grapheme-to-phoneme rules [25, 26]. Developing such a G2P system requires
linguistic expertise. Additionally, some languages (such as Chinese and Japanese)
have complex writing systems, and building the rules is labour-intensive, and it is
extremely difficult to cover most possible situations. Furthermore, these systems are
sensitive to out of vocabulary (OOV) events. Other previous solutions used joint
sequence models [27, 28]. These models create an initial grapheme-phoneme
sequence alignment, and by using this alignment, it calculates a joint n-gram language
model over sequences. The method proposed by [27] is implemented in the publicly
available tool Sequitur
1
. In one-to-one alignment, each grapheme corresponds only to
one phoneme and vice versa. An empty symbol is introduced to match grapheme
and phoneme sequences. For example, the grapheme sequence of ‘CAKE’ matches
the phoneme sequence of K EY K, and one-to-one alignment of these sequences is
C K, A EY, K K, and the last grapheme ‘E’ matches the empty symbol.
Conditional and joint maximum entropy models use this approach [29]. Later, Hidden
Conditional Random Field (HCRF) models were introduced in which the alignment
between grapheme and phoneme sequence is modelled with hidden variables [30, 31].
The HCRF models usually lead to very competitive results, however, the training of
such models is very memory and computationally intensive. A further approach
utilizes conditional random fields (CRF) and Segmentation/Tagging models (such as
linear finite-state automata or transducers, FSTs), then use them in two different
compositions [32]. The first composition is a joint-multigram combined with CRF;
the second one is a joint-multigram combined with Segmentation/Tagging. The first
approach achieved a 5.5% phoneme error rate (PER) on CMUDict.
Neural networks have also been applied for G2P conversion. They are robust
against spelling mistakes and OOV words and generalize well. Also, they can be
seamlessly integrated into end-to-end TTS/ASR systems (that are constructed entirely
of deep neural networks) [33]. In [33], a TTS system (Deep Voice) is presented, which
was constructed entirely from deep neural networks. Deep Voice lays the groundwork
for genuinely end-to-end neural speech synthesis. Thus, the G2P model is jointly
trained with further essential parts of the speech synthesizer and recognizer, which
increase the overall quality of the system.
1
https://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html, Access date: January 2021
- 26 -
Alignment based models of unidirectional LSTM with one layer and bi-
directional LSTM (Bi-LSTM) with one, two and three layers were also previously
investigated in [36]. In this work, alignment was explicitly modelled in the G2P
conversion process by the context of the grapheme. A further work, which applies
deep bi-directional LSTM with hyperparameter optimization (including the number of
hidden layers, optional linear projection layers, optional splicing window at the input)
considers various alignment schemes [37]. The best model with hyperparameter
optimization achieved 5.37% phoneme (PER) and 23.23% word error rate (WER) on
an independent test set. Multi-layer bidirectional encoder with gated recurrent units
(GRU) and deep unidirectional GRU as a decoder achieved 5.8% PER and 28.7%
WER on CMUDict [33].
Sequence-to-sequence learning or encoder-decoder type neural networks have
achieved remarkable success in various tasks, such as speech recognition, text-to-
speech synthesis, machine translation [38, 39, 40]. The encoder-decoder structure was
studied for the G2P task [36, 38] too. One of the best results for G2P conversion was
introduced by [38], which applied an attention-enabled encoder-decoder model and
achieved 4.69% PER and 20.24% WER on CMUDict. Furthermore, G2P-seq2seq
2
is
based on neural networks implemented in the TensorFlow framework with 20.6%
WER.
RNN-based models are slower to train, in general, since these are less suited for
parallel computations. To overcome this problem, several researches proposed the
utilization of CNN instead of RNN, e.g. [14, 150, 154]. Some studies have shown that
CNN-based alternative networks can be trained significantly faster and sometimes can
outperform RNN-based techniques. In [14], an idea on how to use attention
mechanism in a CNN-based seq2seq learning model was proposed, and it was shown
that the method is effective for machine translation. Furthermore, a fully CNN-based
TTS system which can be trained much faster than an RNN-based state-of-the-art
neural TTS system was presented in [150].
In sequence-to-sequence learning, the decoding stage is usually carried out
sequentially, one step at a time from left to right and the outputs from the previous
steps are used as decoder inputs. Sequential decoding can negatively influence the
results, depending on the task and the model. The non-sequential greedy decoding
(NSGD) method for G2P was studied in [154], and it was combined with a fully
convolutional encoder-decoder architecture. That model achieved 5.58% phoneme
and 24.10% word error rates on the latest released version of CMUDict US English
dataset (0.7b, released on November 19, 2014), which included multiple
pronunciations and without stress labels.
2
https://github.com/cmusphinx/g2p-seq2seq, Access date: February 2021
- 27 -
Recently, a token-level ensemble distillation for G2P conversion was proposed,
which can boost the accuracy by distilling the knowledge from additional unlabeled
data and reduce the model size but maintain the high accuracy in [107]. Transformer
model was used to boost the accuracy of G2P conversion further too. Moreover, DNN-
based G2P converter, which would be able to perform well both on languages with
irregular pronunciation and regular pronunciation languages, easily describable by a
set of transcription rules in [108]. The evaluation of this model is carried out in three
different languages English, Czech and Russian.
3.3 Research Methodology
3.3.1 Datasets
I used the CMU pronunciation
3
and NetTalk datasets [130], which have been
frequently chosen in various papers [27, 36, 38]. The training and testing splits are the
same as found in [27, 36, 38], thus, the results are comparable. CMUDict contains a
106,837-word training set and a 12,000-word test set (reference data). 2,670 words are
used as development set. There are 27 graphemes (uppercase alphabet symbols plus
the apostrophe) and 41 phonemes (AA, AE, AH, AO, AW, AY, B, CH, D, DH, EH,
ER, EY, F, G, HH, IH, IY, JH, K, L, M, N, NG, OW, OY, P, R, S, SH, T, TH, UH,
UW, V, W, Y, Z, ZH, <EP>, </EP>) in this dataset. NetTalk contains 14,851 words
for training, 4,951 words for testing and does not have a predefined validation set.
There are 26 graphemes (lowercase alphabet symbols) and 52 phonemes ('!', '#', '*',
'+', '@', 'A', 'C', 'D', 'E', 'G', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'R', 'S', 'T', 'U', 'W', 'X', 'Y', 'Z',
'^', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',<EP>,
</EP>) in this dataset.
3.3.2 Metrics
For evaluation, measurements of phoneme error rate (PER) (Equation (2.11)), and
word error rate (WER) (Equation (2.12).) were performed. PER was used to measure
the distance between the predicted phoneme sequence and reference pronunciation
divided by the number of phonemes in the reference pronunciation. Edit distance (also
known as Levenshtein distance [41]) is the minimum number of insert (I), delete (D)
and substitute (S) operations required to transform one sequence into the other. If there
are multiple pronunciation variants for a word in the reference data, the variant that
3
http://www.speech.cs.cmu.edu/cgi-bin/cmudict, Access date: February 2021
- 28 -
has the smallest Levenshtein distance [41] to the candidate is used. For WER
computation, which is only counted if the predicted pronunciation does not match any
reference pronunciation, the number of word errors is divided by the total number of
unique words in the reference. These metrics report as percentages and are calculated
as follows:


 

 

  (3.2)
In Equation (3.1), is the Levenshtein distance between the reference
 and the hypothesis , and ; in
Equation (3.2)  is the number of unique words in reference;  is the number
of word errors.
3.4 CNNs for Grapheme-to-Phoneme Conversion
Convolutional neural networks were successfully applied to various NLP tasks [14,
15, 42]. Some studies have shown that CNN-based alternative networks can be trained
much faster, and sometimes can even outperform the RNN-based techniques. In [14]
an idea was proposed on how to use attention mechanism in a CNN-based seq2seq
learning model and showed that the method is quite effective for machine translation.
These results suggest investigating the possibility of applying CNN-based sequence-
to-sequence models for G2P. I expected that the advantage of convolutional neural
networks enhances the performance of G2P conversion. As known, LSTMs read input
sequentially, the outputs for further inputs depends on the previous ones. Thus, these
networks cannot be executed in parallel. Applying CNN also moves away
computational load by using large receptive fields.
Firstly, I have implemented LSTM-based models as baseline models in Section 3.4.1
and in Section 3.4.2 I have developed novel CNN-based models for G2P conversion.
- 29 -
3.4.1 LSTM-based Encoder-Decoder for G2P conversion
The encoder-decoder structures have shown state-of-the-art results in different NLP
tasks [36, 39]. The main idea of these approaches has two steps: the first step is
mapping the input sequence to a vector; the second step is to generate the output
sequence based on the learned vector representation. Encoder-decoder models
generate an output after the complete input sequence is processed by the encoder,
which enables the decoder to learn from any part of the input without being limited to
fixed context windows. Figure 3.1. shows an example of an encoder-decoder
architecture [43]: The input of the encoder is the “CAKE” grapheme sequence, and
the decoder produces the “K EY K” as phoneme sequence. The left side is an encoder;
the right side is a decoder. The model stops making predictions after generating the
end-of-phonemes tag. As distinct from [36, 43], input data for the encoder is not
reversed in every proposed model.
Figure 3.1. Encoder-decoder architecture.
In my experiments, I used encoder-decoder architectures. Several models with
different hyperparameters were developed and tested. From a large number of
experiments, five models with the highest accuracy and diverse architectures were
selected. The first two models are based on existing solutions for comparison
purposes. I used these models as baselines. The main properties of the two models are:
1: The first model uses LSTMs for both the encoder and the decoder (called
LSTM_LSTM). The LSTM encoder reads the input sequence and creates a fixed-
dimensional vector representation. The second LSTM is the decoder, and it generates
the output. Figure 3.2.(a) shows the structure of the first model. It can be seen that
both LSTMs have 1024 units; softmax activation function is used to obtain model
predictions. This architecture is the same as a previous solution [36], while the
parameters of training (optimization method, regularization, etc.) are identical to the
settings used in case of the other four models. This way I try to ensure a fair
comparison among the models.
2: In the second model, both the encoder and the decoder are Bi-LSTMs [46, 47]
(called BI-LSTM_BI-LSTM). The structure of this model is presented in Figure
- 30 -
3.2.(b). The input is fed to the first Bi-LSTM (encoder), which combines two
unidirectional LSTM layers that process the input from left-to-right and right-to-left.
The output of the encoder is given as input for the second Bi-LSTM (decoder). Finally,
the softmax function is applied to generate the output of one-hot vectors (phonemes).
During inference, the complete input sequence is processed by the encoder, and after
that, the decoder generates the output. For predicting a phoneme, both the left and the
right contexts are considered.
Although the encoder-decoder architecture achieves competitive results on a wide
range of problems, it suffers from the constraint that all input sequences are forced to
be encoded to a fixed size latent space. To overcome this limitation, I investigated the
effects of the attention mechanism proposed by [44, 45] in LSTM_LSTM and BI-
LSTM_BI-LSTM. I applied an attention layer between the encoder and decoder
LSTMs in case of LSTM_LSTM and Bi-LSTMs for BI-LSTM_BI-LSTM. The
introduced attention layers are based on global attention [45].
a) b) c)
Figure 3.2. G2P conversion model based on encoder-decoder (a) LSTMs
(LSTM_LSTM); (b) Bi-LSTMs (BI-LSTM_BI-LSTM); (c) encoder CNN, decoder
Bi-LSTM (CNN_BI-LSTM). f, d, s is the number of the filters, length of the filters
and stride for convolutional layer.
- 31 -
3.4.2 CNN-based models for G2P conversion
I designed and developed 3 CNN-based models for G2P conversion:
1. In the first model, a convolutional neural network is introduced as encoder and
a Bi-LSTM as the decoder (CNN_BI-LSTM). This architecture is presented in Figure
3.2(c). As this figure shows the number of filters is 524, the length of the filter is 23,
the stride is 1, and the number of cells in the Bi-LSTM is 1024. In this model, the
CNN layer takes graphemes as input and performs convolution operations. For
regularization purposes, I also introduced batch normalization in this model.
2. The second model contains convolutional layers only with residual connections
(blocks) [48]. These residual connections have two rules [49]:
(1) if feature maps have the same size, then the blocks share the same
hyperparameters.
(2) each time when the feature map is halved, the number of filters is doubled.
a) b)
Figure 3.3. G2P conversion based on (a) convolutional neural network with
residual connections (CNN+RES) and (b) encoder convolutional neural
network with residual connections and decoder Bi-LSTM (CNN+RES_BI-
LSTM). f, d, s are the number of the filters, length of the filters and stride,
respectively.
- 32 -
First, I applied one convolutional layer with 64 filters to the input layer, followed
by a stack of residual blocks. Through hyperparameter optimization, the best result
was achieved by 4 residual blocks, as shown in Figure 3.3. (a) and the number of filters
in each residual block is 64, 128, 256, 512, respectively. Each residual block contains
a sequence of two convolutional layers followed by a batch normalization [50] layer
and ReLU activation. The filter size of all convolutional layers is three. After these
blocks, one more batch normalization layer and ReLU activation are applied. The
architecture ends with a fully connected layer, which uses the softmax activation
function.
I carried out experiments with the same fully convolutional models without
residual connections, however, the phoneme and word error rates were worse than
with residual connections, as expected.
3. The third model combines CNN_BI-LSTM and CNN+RES: the encoder has
the same convolutional neural network architecture with residual connections and
batch normalization, which was introduced in CNN+RES, and the decoder is a Bi-
LSTM, as in model CNN_BI-LSTM (this model called CNN+RES_BI-LSTM). The
structure of this model is presented in Figure 3.3. (b).
In all models except CNN+RES, I used stateless LSTM (or Bi-LSTM)
configurations; the internal state is reset after each batch for predictions.
3.4.3 Details of the Bidirectional Decoder
The details of the bidirectional decoder, which was used in BI-LSTM_BI-LSTM
and CNN+RES_BI-LSTM (Section 3.4.1. and 3.4.2.), are presented in this section.
Given an input sequence  the LSTM network computes the
hidden vector sequence  and output vector sequence
.
Initially, one-hot character vectors for grapheme and phoneme sequences were
created. Character vocabularies, which contain all the elements that are present in the
input and output data, are separately calculated. In other words, neither any grapheme
vector in the output vocabulary nor any phoneme vector in the input vocabulary was
used. These were the inputs to the encoder and the decoder. Padding was applied to
make all input and output sequences to have the same length, which was set to 22.
This number (22) was chosen based on the maximum length in the training database.
For G2P,  is a one-hot character vector of grapheme sequences;
 is a one-hot character vector of phoneme sequences.
- 33 -
In the BI-LSTM_BI-LSTM as an encoder, Bi-LSTM was used, and it consists of
two LSTMs: one that processes the sequence from left-to-right (forward encoder), and
one that does it in the reverse order (backward encoder). It was applied to learn the
semantic representation of the input sequences in both directions. In each of the time
steps the forward hidden sequence
and the backward hidden sequence
are iterated
by (2.16) and (2.17) equations in Section 2.6.
For the decoder, I used a bidirectional LSTM. The LSTMs can be called forward
and backward decoder, and described as , After concatenating the forward and
backward encoder LSTMs, the backward decoder performs decoding in a right-to-left
way. It was initialized with the final encoded state and reversed output (phonemes).
The forward decoder is trained to sequentially predict the next phoneme given the
phoneme sequence. This part was initialized with the final state of the encoder and all
phoneme sequences.
Each decoder output is passed through the softmax layer that will learn to classify
the correct phonemes.
For training, given the previous phonemes, the model factorizes the conditional
into a summation of individual log conditional probabilities from both directions,


(3.3)
Where 

are the left-to-right
(forward), the right-to-left (backward) conditional probability in Equation (3.3), and
calculated as:



 (3.5)
The prediction is performed on the test data as follows:


According to Equation (3.6), future output is not used during inference. The
architecture is shown in Figure 3.4.
- 34 -
Figure 3.4. Architecture of the proposed bidirectional decoder model for G2P
conversion.
3.4.4 Experiments
For evaluation, the CMU pronunciation (CMUDict) and NetTalk datasets were used.
I also used <EP> and </EP> tokens as beginning-of-graphemes and end-of-
graphemes tokens in both datasets.
For the CMUDict experiments, in all models, the size of the input layers is equal to
input: {length of the longest input (22) X number of graphemes (27)}
and the size of the output layers is equal to the
output: {length of the longest output (22) number of phonemes (41)}
In order to transform graphemes and phonemes for neural networks, I converted
inputs to 27-dimensional and outputs to 41-dimensional one-hot vector
representations. For example, the phoneme sequences of the word 'ARREST' is 'ER
EH S T'; the input and output vector of the grapheme and phoneme sequences are as
below:
Input vector of ‘ARREST’ :











- 35 -
Output vector of ‘ER EH S T':



















For NetTalk experiments, the size of input and output layers are equal to
input: {length of the longest input (19) number of graphemes (26)}
output: {length of the longest output (19) number of phonemes (52)}.
I converted inputs to 26-dimensional and outputs to 52-dimensional one-hot
vector representations as in case of CMUDict. The same model structure was used as
with the CMUDict experiments.
Moreover, the implementation of a single convolutional layer on input data is
presented in Figure 3.5. The input is a one-hot vector of ‘ARREST’; 64 filters of (input
length) *3 are applied to the input. In other words, the input is convolved with 64
feature maps, which produce the output of the convolutional layer. Zero padding was
used to ensure that the output of the convolution layer has the same dimension as the
input.
Figure 3.5. Implementation of a single convolutional layer with 64 filters of size
(input length) *3 to the input data.
- 36 -
Furthermore, I applied the Adam optimization algorithm [51] with a starting learning
rate of 0.001, with the baseline values of β1, β2 and ε (0.9, 0.999 and 1e-08, respectively)
in case of LSTMs. For batch size 128 was chosen. Weights were saved when the PER
on the validation dataset achieved a lower value than before. When the PER did not
decrease further within 100 epochs, the best model was chosen, and it was trained with
stochastic gradient descent (SGD) further. In the case of the LSTM_LSTM, BI-
LSTM_BI-LSTM and CNN_BI-LSTM models for SGD, I used 0.005 for learning
rate, 0.8 for momentum. For the CNN+RES (convolutional with residual connections)
model 0.05 (learning rate) and 0.8 (momentum) were applied, and it was trained for
142 epochs when early stopping was called. In the CNN+RES_BI-LSTM model 0.5
(learning rate) of SGD and 0.8 (momentum) was set, and when PER has stopped
improving in about 50 epochs, the learning rate was multiplied by 4/5. The number of
epochs for this model reached 147 and 135 for CMUDict and NetTalk, respectively.
In all proposed models, the patience of early stopping was set to 50 in Adam
optimizer and 30 in SGD optimizer.
As software and hardware, NVidia Titan Xp (12 GB) and NVidia Titan X (12
GB) GPU cards hosted in two i7 workstations with 32GB RAM served for training
and inference. Ubuntu 14.04 with Cuda 8.0 and cuDNN 5.0 was used as general
software architecture. For training and evaluation, the Keras
4
deep learning framework
with TensorFlow
5
backend was my environment. This environment was common for
all experiments in this dissertation.
3.4.5 Evaluation and Results
After training the models, predictions were run on the test dataset. The results of
the evaluation on the CMUDict dataset are shown in Table 3.1. The first and second
columns show the model number and the applied architecture, respectively. The third
and fourth columns show the PER and WER values. The fifth column of contains the
average sum of training and validation time of one epoch. The last two columns
present information about the size of models, which shows the number of parameters
(weights) and the number of epochs to reach minimum validation loss.
4
https://keras.io/, Access date: February 2021.
5
https://www.tensorflow.org/, Access date: February 2021
- 37 -
The proposed best model in [99] consists of the combination of the sequitur G2P
(model order 8) and seq2seq-attention (Bi-LSTM 512x3) and multitask learning
(ARPAbet/IPA), and although the WER in their case is better, CNN+RES_BI-LSTM
has the smaller PER.
Although the encoder-decoder LSTM by [36] is similar to my LSTM_LSTM, the PER
is better in my case; the WER of both models is almost the same. My BI-LSTM_BI-
LSTM is comparable with [36], in which the Bi-LSTM method was implemented,
alignment was also applied.
Table 3.1. Results on the CMUDict dataset.
Model
Method
PER
WER
Time
[s]
Number
of epochs
Model
size
LSTM_LSTM
Encoder-Decoder
LSTM
5.68
28.44
467.7
185
12.7M
LSTM_LSTM+ATT
Encoder-Decoder
LSTM with attention
layer
5.23
28.36
688.9
136
13.9M
BILSTM_BILSTM
Encoder-Decoder Bi-
LSTM
5.26
27.07
858.93
177
33.8M
BI-LSTM_BI-
LSTM+ATT
Encoder-Decoder Bi-
LSTM with attention
layer
4.86
25.67
1045.5
114
35.3M
CNN_BI-LSTM
Encoder CNN, decoder
Bi-LSTM
5.17
26.82
518.3
115
13.1M
CNN+RES
End-to-end CNN (with
res. connections)
5.84
29.74
176.1
142
7.62M
CNN+RES_BI-LSTM
Encoder CNN with
res. connections,
decoder Bi-LSTM
4.81
25.13
573.5
147
14.5M
- 38 -
Table 3.2. Comparison of best previous results of G2P models with CNN+RES_BI-
LSTM (encoder is a CNN with residual connections, Bi-LSTM decoder) on
CMUDict and NetTalk.
Data
Method
PER(%)
WER(%)
NetTalk
Joint sequence model [27]
8.26
33.67
Bi-LSTM [36]
7.38
30.77
Encoder-decoder with global attention [38]
7.14
29.20
Encoder CNN with residual connections,
decoder Bi-LSTM (CNN+RES_BI-LSTM )
5.69
30.10
CMUDict
LSTM with Full-delay [35]
9.11
30.1
Joint sequence model [27]
5.88
24.53
Encoder-decoder LSTM [36]
7.63
28.61
Bi-LSTM +Alignment [36]
5.45
23.55
Combination of sequitur G2P and seq2seq-
attention and multitask learning [99]
5.76
24.88
Ensemble of 5 [Encoder-decoder + global
attention] models [38]
4.69
20.24
Encoder-decoder with global attention [38]
5.04
21.69
Joint multi-gram + CRF [32]
5.5
23.4
Joint n-gram model [28]
7.0
28.5
Joint maximum entropy (ME) n-gram model
[29]
5.9
24.7
Encoder-Decoder GRU [33]
5.8
28.7
Encoder CNN with residual con., decoder
Bi-LSTM (fifth model)
4.81
25.13
3.4.6 Discussions
In order to analyze the connection between PER values and word length, I
categorize the word length into three classes: short (shorter than 6 characters), medium
(between 6 and 10 characters), long (more than 10 characters). According to this
categorization, there were 4306 short, 5993 medium and 1028 long words in the
CMUDict dataset. In this analysis, I ignored approximately 600 words that have
multiple pronunciation variants in the reference data.
The result of this comparison is presented in Figure 3.6. For short words, all
models show similar PERs; for medium length words, except the end-to-end CNN
model (CNN+RES), the other models resulted in similar error; for long words, encoder
CNN with residual connection, decoder Bi-LSTM (CNN+RES_BI-LSTM ) and
encoder CNN, decoder Bi-LSTM (CNN_BI-LSTM) got similar minimum errors. The
model CNN+RES showed the highest error in both medium and long length words.
- 39 -
According to Figure 3.6., the advantage of Bi-LSTM based models is clearly shown
for learning long sequences.
Moreover, errors occurring in the first half of the pronunciation (in the reference)
increases the probability of predicting incorrect phonemes in the second half. Still, a
correctly predicted first half cannot guarantee a correctly predicted second half. In my
experiments, convolutional architectures also performed well on short and on long-
range dependencies. My intuition is that the residual connections enable the network
to consider features learned by lower and higher layers - which represents shorter and
longer dependencies.
I also analysed the position of the errors in the reference pronunciation: I
investigated if the error occurred in the first or in the second half of the word. The type
of error can be insertion (I), deletion (D) and substitution (S). By using this position
information, I can analyse the distribution of these errors across the first or second half
of the word. The position of error was calculated by enumerating graphemes in the
reference. For error insertion (I), the position of the previous grapheme was taken into
account. The example below describes the process details:
word: ACKNOWLEDGEMENT
Enumeration: 0 1 2 3 4 5 6 7 8 9 10 11 12
Reference: [EP AE K N AA L IH JH M AH N T /EP]
Prediction: [EP IH K N AA L IH JH IH JH AH N T /EP]
Type of errors: S S I
Position: [1, 8, 8]
As the example shows two substitutions (S) and one insertion (I) occurred in the
CNN+RES_BI-LSTM model output. One error (S) is included in the first half part of
the pronunciation in the reference (EP AE K N AA L, the other errors (S) and (I)
are in the second half (H JH M AH N T /EP).
Figure 3.7 shows the position errors calculated for all the models on the reference
dataset. The first half of the words in all models contains more errors. Regarding the
second half, all models show a similar number of position errors, except the end-to-
end CNN model (CNN+RES). The CNN+RES_BI-LSTM resulted in the lowest
number of position errors.
Furthermore, in all models presented here, PER is better than the previous results
on CMUDict while WER is still reasonable. It means that even most of the incorrect
predictions are very close to the reference. Therefore these have small PER.
Accordingly, I need to analyze the incorrect predictions (outputs) for each model to
see how many phonemes are correct in the reference. In the CNN+RES_BI-LSTM,
25.3 % of the test data are not correct (about 3000 test samples). After the analysis of
these predictions, more than half of them have one incorrect phoneme. In particular,
the PER for 59 test samples is higher than 50% (11 test samples are greater than 60%,
and only 1 test sample is more than 70%).
- 40 -
Figure 3.6. PER depending on the length of the words.
Figure 3.7. Position of errors for all models.
- 41 -
These percentages in the other presented models are more or less the same. Generally,
the same 1000 words are incorrectly predicted by all presented models.
One can see different types of error in the generated phoneme sequences. One of
these errors is that some phonemes are unnecessarily generated multiple times. For
example, for the word YELLOWKNIFE, the reference is [ Y EH L OWN AY F], the
prediction of CNN+RES_BI-LSTM for this word is [Y EH L OW K N N F], where
the character N was generated twice. Another error type regards sequences of
graphemes that are rarely represented in the training process. For example, for the
word ZANGHI CNN+RES_BI-LSTM output is [Z AE N G], while the reference is [Z
AA N G IY].
Table 3.3. Examples of errors predicted by CNN+RES_BI-LSTM.
Word from test data
Reference of given word
Prediction of CNN+RES_BI-
LSTM
YELLOWKNIFE
Y EH L OW N AY F
Y EH L OW K N N F
ZANGHI
Z AA N G IY
Z AE N G
GDANSK
G AH D AE N S K
D AE N AE K EH K
SCICCHITANO
S IH K AH T AA N OW
S CH CH Y K IY IY
KOVACIK
K AA V AH CH IH K
K AH V AA CH IH K
LPN
EH L P IY EH N
L L N N P IY E
INES
IH N IH S
AY N Z
The graphemes ‘NGHI’ appeared only 7 times in the training data. Furthermore,
many words are of foreign origin, for example, GDANSK is a Polish city,
SCICCHITANO is an Italian name, KOVACIK is a Slovac name. Generating
phoneme sequences of abbreviations is one of the hard challenges. For example, LPN,
INES are shown with their references and the prediction form of CNN+RES_BI-
LSTM in Table 3.3:
In the proposed models, I was able to achieve smaller PERs with different
hyperparameter settings, but WERs showed different behaviour. For calculating
WER, the number of word errors is divided by the total number of unique words in
the reference. These word errors are counted only if the predicted pronunciation does
not match any reference pronunciation. So, in the generated phoneme sequences of
words that contained errors, there is at least one phoneme error.
- 42 -
3.5 Transformer Neural Network for Grapheme-to-
Phoneme Conversion
3.5.1 Architecture of the Transformer
Transformer networks have brought significant improvements to many areas of deep
learning, including machine translation, text understanding, and speech and image
processing. This novel architecture avoids the recurrence equation and maps the input
sequences into hidden states solely using attention. Using positional encodings in
conjunction with a multi-head attention mechanism allows increased parallel
computation and reduces time to convergence.
3.5.2 Transformer-based models for G2P Conversion
Encoder-decoder based sequence to sequence learning has made remarkable progress
in recent years. The main idea of these approaches is described in Section 3.4.1. For
both the encoder and the decoder, different network architectures have been
investigated [36, 43, 44].
Figure 3.8. The framework of the proposed model.
- 43 -
Transformer is organized by stacked self-attention and fully connected layers for both
the encoder and the decoder [52], as shown in the left and right halves of Figure 3.8,
respectively. Self-attention, sometimes called intra-attention, is an attention
mechanism relating different positions of a single sequence to compute its internal
representation.
Without using any recurrent layer, positional encoding is added to the input and output
embeddings [53]. The positional information provides the transformer network with
the order of input and output sequences.
The encoder is composed of a stack of identical blocks, and each block has two
layers. The first is the multi-head attention layer. Several further attention layers are
used in parallel with it. The second is a fully connected position-wise feed-forward
layer. These layers are followed by dropout and normalization layers [54]. The
decoder is composed of a stack of identical blocks, and each block has three layers.
The first layer is the masked multi-head layer [55]. This mechanism helps the model
to generate the current phoneme using only the previous phonemes. The second layer
is a multi-head attention layer. It performs the multi-head attention over the output of
the first layer. The third layer is fully connected. These layers are followed by
normalization [54] and dropout layers [56]. At the top, there is the final fully-
connected layer with linear activation, which is followed by softmax output.
An attention function can be described as mapping a query and a set of key-value
pairs to an output, where the query (Q), keys (K), values (V), and output are all vectors
[52]. A multi-head attention mechanism builds upon scaled dot-product attention,
which computes on a query Q, key K and a value V (the dimension of queries and
keys is and the values dimension is ):

(3.7)
where the scalar
is used to prevent the softmax function from getting into regions
that have minimal gradients.
Instead of performing a single attention function, multi-head attention obtains
(parallel attention layers or heads) for learning different representations, compute
scaled dot-product attention for each representation, concatenate the results, and
project the concatenation with a feedforward layer. Finally, dimensional outputs
are obtained. The multi-head attention is shown as follows [52]:


- 44 -
where the projections are trainable parameter matrices
 
Each head in multi-head attention learns individual sequence dependency, and this
allows the model to attend to information from different representation subspaces. So,
it increases the power of the attention without the computational overhead.
3.5.3 Experiments
For evaluation, the CMU pronunciation and NetTalk datasets were used. I used
<START>, <END> tokens as beginning-of-graphemes (beginning-of-phonemes),
end-of-graphemes (end-of-phonemes) tokens in both datasets. I completed shorter
input and output sequences with the <PAD> token to make their length equal in both
training and development sets. For the test set, padding was not applied.
I applied two embeddings which represent the encoder (grapheme) and decoder
(phoneme) sides, respectively. The encoder and decoder embeddings had a great
influence on the results. The size of the embeddings is 128, and the dimension of the
inner-layer is 512. I used Adam as optimizer [51]. The initial learning rate was set to
0.0002. If the performance (PER for G2P conversion) on the validation set has not
improved for 50 epochs, the learning rate was multiplied by 0.2. I applied layer
normalization and dropout in all models. The dropout rate of encoder and decoder is
set to 0.1. Batch size is 128 for CMUDict, 64 for NetTalk. I have investigated three
transformer architectures, with 3 encoder and decoder layers (it is called Transformer
3x3 in Table 3.5), 4 encoder and decoder layers (it is called Transformer 4x4 in Table
3.5) and 5 encoder and decoder layers (it is called Transformer 5x5 in Table 3.5).
Table 3.4. Training parameters
I employed h = 4 parallel attention layers in all proposed models, and Q, K and V
have the same dimension of , so that = = = 128 and
. Due to the
reduced dimension of each head, the total computational cost is similar to that of
Parameters
Number
Encoder layers ()
3/4/5
Decoder layers (
3/4/5
Params in one encoder
256
Params in one decoder
256
Dropout
0.1
Batch size
128/64
Adam optimizer

- 45 -
single-head attention with full dimensionality. Other parameters used in training are
defined in Table 3..
3.5.4 Inference
During inference, the phoneme sequence (written pronunciation form of a given
grapheme sequence) will be generated one-by-one at a time.
The sequence begins with the start token <START >, and the first phoneme with the
highest probability is generated. Then, this phoneme is fed back into the network to
generate the next phoneme. This process is continued until the end token <END> is
reached, or the maximal length terminates the procedure. Beam search was not
applied in this work [124].
3.5.5 Evaluation and Results
After training the model, predictions were carried out on the test dataset. The results
of the evaluation on CMUDict and NetTalk are shown in Table 3.5. The first and
second columns show the dataset and the applied architecture, respectively. The third
and fourth columns show the PER and WER values. The fifth column of Table 3.
contains the average sum of training and validation times of one epoch. The last
column presents information about the number of parameters (weights). According to
the results, Transformer 4x4 (4 layers encoder and 4 layers decoder) outperforms
Transformer 3x3 (3 layers encoder and 3 layers decoder). Contrary to expectations
Transformer 5x5 (5 layers encoder and 5 layers decoder) didn't outperform
Transformer 4x4 (4 layers encoder and 4 layers decoder). Increasing the numbers of
encoder-decoder layers leads to much more training parameters.
Table 3.5. Transformer results on the CMUDict and NetTalk datasets.
Dataset
Model
PER
WER
Time[s]
Model size
CMUDict
Transformer 3x3
6.56
23.9
76
1.49M
Transformer 4x4
5.23
22.1
98
1.95M
Transformer 5x5
5.97
24.6
126
2.4M
NetTalk
Transformer 3x3
7.01
30.67
33
1.50M
Transformer 4x4
6.87
29.82
39
1.96M
Transformer 5x5
7.72
31.16
48
2.4M
- 46 -
In the G2P task, similar complexity to NMT (neural machine translation) can be rarely
permitted. The high number of parameters sometimes does not even result in better
performance. In term of PER, Transformer 5x5 is better than Transformer 3x3 on
CMUDict but didn't exceed Transformer 4x4 or Transformer 3x3 in WER on both
CMUDict and NetTalk.
During the experiments, I did not observe significant performance improvements
when the number of encoder-decoder layers was further increased.
In Table 3.6, the performance of the Transformer 4x4 model with previously state-of-
the-art results is compared on both CMUDict and NetTalk databases. The first column
shows the dataset, the second column presents the method used in previous solutions
with references, PER and WER columns tell the results of the referred models, and
the last column presents information about the number of parameters (weights).
According to Table 3.6, my proposed model reached competitive results for both PER
and WER. For NetTalk, I am able to exceed previous results significantly.
Table 3.6. Comparison of best previous results of G2P models
with Transformer 4x4 on CMUDict and NetTalk.
Data
Method
PER
(%)
WER
(%)
Model
size
NetTalk
Joint sequence model [27]
8.26
33.67
N/A
Encoder-decoder with global attention [38]
7.14
29.20
N/A
CNN+RES_BI-LSTM
5.69
30.10
14.5 M
Transformer 4x4
6.87
29.82
1.95M
CMUDict
Encoder-decoder LSTM [36]
7.63
28.61
N/A
Joint sequence model [27]
5.88
24.53
N/A
Combination of sequitur G2P and seq2seq-
attention and multitask learning [99]
5.76
24.88
N/A
Deep Bi-LSTM with many-to-many
alignment [37]
5.37
23.23
N/A
Joint maximum entropy (ME) n-gram
model [28]
5.9
24.7
N/A
CNN+RES_BI-LSTM
4.81
25.13
14.5 M
CNN+RES
5.84
29.74
7.62 M
LSTM_LSTM
5.68
28.44
12.7 M
Transformer 4x4
5.23
22.1
2.4 M
It should be pointed out that the results of the Transformer 4x4 model are close to
encoder CNN with residual connections, decoder Bi-LSTM model (CNN+RES_BI-
- 47 -
LSTM) regarding PER, but WER is better in the proposed model. Moreover, the
number of parameters of the convolutional layers with residual connections as
encoder and Bi-LSTM as the decoder is 14.5M, encoder-decoder LSTM and encoder-
decoder Bi-LSTM have 12.7M and 33.8M, respectively according to Table 3.1. Both
the Transformer 4x4 and the Transformer 3x3 have fewer parameters than the
previously mentioned models.
When comparing Transformer 4x4 and the CNN+RES_BI-LSTM model (encoder
CNN, decoder Bi-LSTM), there is an interesting contravention between PER and
WER. Although PER is smaller, WER is higher in the CNN+RES_BI-LSTM model
than in Transformer 4x4. As mentioned in Section 3.4.6, there were twice as many
words with only one phoneme error than words which have two phoneme errors in
the results of the encoder CNN decoder Bi-LSTM model, and it affected the growth
of WER.
Table 3.7. Examples of errors predicted by Transformer 4x4
and CNN+RES_BI-LSTM.
In contrast, in Transformer 4x4 there are not too many words with only one phoneme
error. Regarding the types of error when generating phoneme sequences, in the
CNN+RES_BI-LSTM model, some phonemes are unnecessarily generated multiple
times. For example, for the word KORZENIEWSKI, reference is [ K AO R Z AH N
UW F S K IY], the prediction of CNN encoder, Bi-LSTM decoder for this word is [K
AO R Z N N N UW S K IY], where the character N was generated three times. But
the prediction of Transformer 4x4 for this word is [K ER Z AH N UW S K IY], where
1 failed phoneme (ER) and 2 forgotten phonemes (R, F) appear. Example 1 in Table
3.7 also shows the type of errors for Transformer 4x4 and CNN encoder Bi-LSTM
decoder.
Example 1
Example 2
Original word
NATIONALIZATION
KORZENIEWSKI
Reference
N AE SH AH N AH L AH Z EY SH AH N
K AO R Z AH N UW F S K IY
Transformer 4x4
N AE SH N AH L AH Z EY SH AH N
K ER Z AH N UW S K IY
Prediction of
CNN+RES_BI-
LSTM
N AE SH AH N AH L AH EY EY SH AH N
K AO R Z N N N UW S K IY
- 48 -
3.6 Conclusions
In this chapter, the various seq2seq-based models for the G2P task are described, and
the results are compared to previously reported state-of-the-art research. In CNN+RES
and CNN+RES_BI-LSTM, I applied CNNs with residual connections. The
CNN+RES_BI-LSTM model, which uses convolutional layers with residual
connections as encoder and Bi-LSTM as decoder outperformed most of the previous
solutions on the CMUDict and NetTalk datasets in terms of PER. Furthermore,
CNN+RES, which contains convolutional layers only, is significantly faster than other
models and still has competitive accuracy. My solution achieved these results without
explicit alignments. The experiments are conducted on a test set, which is 9.8% and
24.9% of the whole CMUDict and NetTalk databases, respectively.
I also investigated a novel transformer architecture for the G2P task. Transformer 3x3
(3 layers encoder and 3 layers decoder), Transformer 4x4 (4 layers encoder and 4
layers decoder), and Transformer 5x5 (5 layers encoder and 5 layers decoder)
architectures were presented including experiments on CMUDict and NetTalk. I
evaluated PER and WER, and the results of the proposed models are very competitive
with previous state-art results. The number of parameters (weights) of all proposed
models is less than the CNN and the recurrent models. As a result, the time
consumption of training process decreased.
The same test set is used in all cases, so I consider the results comparable. To draw
conclusions on which model is the best, the goal must be defined. If inference time is
crucial, then smaller model sizes are favourable (e.g. CNN+RES), but if lower WER
and PER are the main factors, then CNN+RES_BI-LSTM and Transformer 4x4 are
superior.
- 49 -
Chapter 4
Text Normalization
4.1 Introduction
Text normalization is a critical step in a variety of tasks involving speech and language
technologies. It is one of the vital components of Natural Language Processing (NLP),
Text-To-Speech synthesis (TTS) and Automatic Speech Recognition (ASR).
Text normalization systems convert a written representation of a text into a
representation of how that text is to be read aloud. This issue, while regularly thought
to be unremarkable, is certainly vital, and is one of the main reasons for the
degradation of perceived quality in, e.g. TTS systems.
Personal assistant applications enable users to use, e.g. mobile phones and tablets via
natural language, and these have become popular recently. Personal assistant
applications provide relevant responses according to users' queries by using NLP
techniques. There are numerous applications (e.g. SIRI, GoogleNow, Cortana, Robin)
in this domain. Normalization of the query is an essential stage for these applications
as well. For example, the processing of the sentence In 2008, Bloomberg L.P. was
valued at approximately $22.4 billion.is shown in Table 4.1.
Table 4.1. Example input-output pairs for text normalization.
Input
Output
In 2008
In two thousand eight
,
,
Bloomberg
Bloomberg
L.P.
l p
was
was
valued
valued
at
at
approximately
approximately
$22.4 billion
twenty-two point four
billion dollars
.
.
- 50 -
In the original written form, there are three nonstandard words, namely the date
expression 2008, letter L.P and money expression $22.4 billion. One of the causes of
impairment of perceived quality in TTS systems can be the problems with text
normalization. In other words, if for example ‘$22.4 billionis normalized as “dollars
twenty two point four billion” then listeners will immediately notice the imperfect
machine.
A proper text normalization system must also be able to resolve ambiguities: the same
text, which should be normalized, may be pronounced in multiple ways depending on
the context. For example:
“2011-11-11” looks like both year type data and some digit number (as the
telephone number), where it depends on the context. For year type data, it
should sound as the eleventh of November two thousand elevenwhile as a
number “two o one one sil one one sil one one” (where sil refers to silence).
727 schools”, “727 Andrey St” and “Press 727” all include “727”. In the first
case it should be normalized like “seven hundred twenty-seven”, in the second
case as “seven twenty-seven”, and in the third case as “seven two seven”.
5m can be normalized as five million”, “five minutes” or “five meters
depending on the context.
In this chapter, I present three sequence classification models for text normalization,
which are LSTM, Bi-LSTM and the proposed end-to-end CNN with residual
connections. For vector representations of the data, I used existing, commonly used
word embedding algorithms: the continuous bag-of-words model (CBOW) and the
skip-gram model (SG). I compared the performance of end-to-end CNN with CBOW
model and end-to-end CNN with SG in terms of macro precision, macro recall, macro
F1-score.
This chapter is structured as follows: Section 4.2 describes previous works about text
normalization; Section 4.3 presents important datasets and metrics for text
normalization; The details of the proposed models are described in Section 4.4.
Finally, conclusions are drawn in Section 4.5.
4.2 Related Works
As briefly discussed above, text normalization is an important preprocessing step for
NLP, TTS and ASR [58, 59, 60]. It is one of the essential parts of TTS synthesis
systems [61]. In numerous prior works, mostly hand-constructed rules are used, and
these rules are adjusted to specific domains of these systems [62]. Weighted finite-
state transducers (WFSTs) were also used [63, 64]. The maximum entropy-based
- 51 -
ranking model was applied to relevant problems of TTS and ASR systems, like stress
prediction for Russian [64]. It was presented a generic approach for text normalization
and was applied to French, English, Spanish, Vietnamese, Khmer and Chinese in
[138]. In this work, the text normalization problem was split into a set of minor sub-
problems as language-independent as possible.
The informal writing style employed by authors of social media data is problematic
for many NLP tools, which generally require clean data for training. One possible
solution to this issue is normalization, in which the informal text is converted into a
more standard formal form. For this reason, the rise of social media data has coincided
with a rise in interest in the text normalization issue [70]. There has been a lot of work
that focuses on social media, such as Twitter [71], blog posts and SMS message
normalization [72, 73] recently. Research in these fields is inspired by other areas of
NLP such as morphology, spell checking, machine translation and automatic speech
recognition [72]. A hybrid approach for SMS normalization was also presented, which
was inspirited from both the machine translation and speech recognition fields [73].
This service is highly attractive for information extraction, text and opinion mining
purposes as large volumes of data are available online and created daily. The language
used in this platform differs significantly from formally written text in that people do
not feel forced to write grammatically correct sentences. Generally, they write like
they talk or try to impress their thoughts within a limited number of characters.
Furthermore, other sources used an unsupervised noisy channel model, considering
different word formation processes [74]. The effect of normalization for the social
media domain is explored by [58], too. They test the effect of automatic normalization
on dependency parsing by using automatically derived parse trees of the normalized
sentences as reference. It is shown that the performance of the normalizer is directly
tied to the performance of a downstream dependency parser. Other work that uses
automatic normalization is [75], which compares the effect of lexical normalization
with machine translation on a manually annotated dependency treebank.
The language data consists of sequential information, such as streams of characters
and sequences of words. This property impacts the NLP approaches where
computational models can effectively deal with data. In recent years, deep learning
has achieved state-of-the-art performance in many NLP related fields [43,
67].Significantly more data might be required to exploit deep learning methods.
However, less linguistic expertise is needed to train and operate such models.
In order to foster more research in this direction, a challenge with an open dataset was
published [66]. Regardless of promising results, the authors observed that recurrent
neural networks (RNNs) tend to fail in some cases in quite a weird manner - such as
translating abbreviated hours as gigabytes, or pounds as euro. Generally, to address
this problem, two different models were introduced: a bidirectional sequence-to-
- 52 -
sequence model and an attention-based RNN sequence-to-sequence model. For the
first model, bidirectional Long Short-Term Memory LSTM is used, and two
configurations were created: one with two forward and two backward hidden layers
(shallow model) and one with three forward and three backward hidden layers. The
number of nodes in each hidden layer was 256, and the output layer utilizes
connectionist temporal classification (CTC) [76] with a softmax function.
Experiments were conducted for English and Russian datasets.
Recently, dual encoder classifiers were investigated for text normalization in [109].
They are Siamese models that consist of a pair of encoders that encode pairs of inputs
into vectors, and a model to compute the similarity between the two inputs [110].
Training such a model requires both positive and negative data, and it was utilized
from the training and test data of the 2017 Kaggle competition on text normalization
[111].
Moreover, a contextual seq2seq model was proposed, which uses a sliding-window
and RNN with attention in [112]. In this model, bi-directional GRU is used in both the
encoder and the decoder, and the context words are labelled with “<self>”, helping the
model to distinguish the nonstandard words (NSW) and the context. In [113], a
transformer-based spelling correction model is proposed to correct the outputs of a
connectionist temporal classification model in Mandarin ASR.
4.3 Research Methodology
4.3.1 Datasets
The data consists of 9918441 (9.9 million) words of English text, and it is part of the
open data set of Kaggle
6
, which is the subset of data used by [66]. That data is derived
from Wikipedia regions, and it is divided into sentences; the format of the annotated
data is the same as in Table 4.2: the first column is called ‘sentence_id’, each sentence
has a sentence_id; the second column is called ‘token_id’, each token within a
sentence has a token_id; the third column is ‘class’ which shows the class of the given
token; the column ‘before’ and ‘after’ give sample raw text (or token) and normalized
text consequently.
I use <SE> and </SE> tokens as beginning-of-sentence and end-of-sentence tokens.
Moreover, each token is attributed to one of the classes and generally, there are 16
6
https://www.kaggle.com/c/text-normalization-challenge-english-language/data, accessed: February
2021.
- 53 -
different classes. The meaning of classes are as follows: PLAIN = ordinary word;
PUNCT = punctuation; DATE= date expression; LETTERS = letter sequence;
CARDINAL = cardinal number; VERBATIM = verbatim reading of character
sequence; MEASURE = measure expression; ORDINAL = ordinal number;
DECIMAL = decimal fraction; MONEY = currency amount; DIGIT = digit sequence;
ELECTRONIC = electronic address; TELEPHONE- telephone number; TIME = time
expression; FRACTION = non-decimal fraction; ADDRESS = street address [66].
The data size of each class is shown separately in Table 4.3, and it is calculated as
follows:

 
In Equation (4.1), the number of tokens in each class is shown in the second column
of Table 4.3; the sum of these numbers is 9918441. In the third column of Table 4.3,
the data size of each class is reported as percentages.
Table 4.2. The format of the annotated data.
sentence_i
d
token_i
d
Class
Before
After
<SE>
1
0
PLAIN
I
I
1
1
PLAIN
wake
wake
1
2
PLAIN
up
up
1
3
PLAIN
at
at
1
4
TIME
9:00 AM
nine a m
1
5
PUNCT
.
.
</SE>
The number of tokens in PLAIN class dominates the other classes, and this class is
followed by PUNCT and DATE. TIME, FRACTION, and ADDRESS have the lowest
number of occurrences. Furthermore, if a word belongs to any of the following classes:
ADDRESS, ELECTRONIC, LETTERS, DATE, CARDINAL, VERBATIM,
MEASURE, ORDINAL, DECIMAL, MONEY, DIGIT, TELEPHONE, TIME,
FRACTION, then this word is non-standard, and it needs to be normalized. The words
in classes PLAIN and PUNCT are standard words; they do not need to be normalized.
I split the data set into 3 parts: training, development, and test. The number of
sentences, number of unnormalized tokens and number of normalized tokens for
training, development and test datasets are shown in Table 4.4.
- 54 -
Table 4.3. Number of tokens for each class.
Unnormalized token refers to the original form of text that I would like to normalize
(in other words, the tokens are in “Before” column of Table 4.2); normalized token
refers to the words that are already normalized and can be found in the "After" column
of Table 4.2.
Table 4.4. Corpus size statistics for the open dataset.
Data
Number of
sentences
Number of
unnormalized tokens
Number of
normalized tokens
Training
600040
8075998
9053849
Development
70560
970972
1086032
Test
77468
1063403
1188811
4.3.2 Metrics
I used accuracy, precision, recall, and F-measure as an evaluation metric and plotted
confusion matrices as heat maps. The accuracy measures the ratio of correct
predictions by each classifier over the total number of the class instances. Precision
calculates the proportion of correctly normalized words among the words for which I
applied normalization. Recall shows the number of correct normalizations over the
Name of class
Number of
tokens
Percentage
of class
PLAIN
7353693
74.141
PUNCT
1880507
18.959
DATE
258348
2.6047
LETTERS
152795
1.5405
CARDINAL
133744
1.3484
VERBATIM
78108
0.7875
MEASURE
14783
0.1490
ORDINAL
12703
0.1280
DECIMAL
9821
0.0990
MONEY
6128
0.0617
DIGIT
5442
0.0548
ELECTRONIC
5162
0.0520
TELEPHONE
4024
0.0405
TIME
1465
0.0147
FRACTION
1196
0.0120
ADDRESS
522
0.0052
- 55 -
words that require normalization. The general metric that was considered throughout
the evaluation of the systems is F1-score, which is the harmonic mean of precision
and recall [77, 78]. In addition, macro precision (also macro recall) was calculated,
which is the average of precision (recall) of each individual class. These metrics are
reported as percentages and are calculated using the confusion matrix of each model,
as follows:

 (4.2)

 (4.3)

 (4.4)
 
 (4.5)

 (4.6)
 
 (4.7)
Where  is th element of the confusion matrix in each model, as there 16
classes are then 
. In Equation (4.2) is a precision of i-th class; in
Equation (4.3)  is a recall of i-th class; In (4.4)  is F1-score of i-th class;
in (4.5)  is macro precision; in (4.6)  is a macro recall of all
classes; is macro F1-score.
4.4 LSTM-, Bi-LSTM- and CNN-based model for Text
Normalization
The proposed models for text normalization have two steps:
1. Identifying a class for each word (token)
2. Generating an output depending on its class in their fully expanded form.
In the first stage, I classified tokens between 16 classes. For this purpose, I created
three different models. Two of them are RNN based, and one is a novel convolutional
neural network-based model for text normalization. I created token sequences, which
combine three consecutive tokens for the input of the presented models. For instance,
- 56 -
the input and output of “<SE> I wake up at 9:00 AM . </SE>” is shown in Table 4.5.
The class of the middle token is the output for each token sequence. All three models
were used according to the sequence classification architecture presented in Figure
4.1.
Table 4.5. Creating input and output of “<SE> I wake up at 9:00 AM . </SE>”.
Figure 4.1. The sequence classification architecture for the text normalization task.
The sequence of “at 9:00 AM . ” is fed to the model. The output is “TIME”, which
is the class of middle token - “9:00 AM”.
From a large number of initial experiments, three models with the highest accuracy
and with diverse architectures were selected. A manual random search on
hyperparameters (hyperparameter optimization) was also performed. In the following
paragraphs, the three models are introduced:
LSTM based: In the first model, a unidirectional LSTM for text normalization task
was used, and it is referred to as LSTM_TN. LSTM reads the input sequence one token
sequence at a time and predicts one token at a time as the output sequence. During the
training process, the output sequence is given to the model. The model is trained to
maximize the cross-entropy of the correct sequence, given its context. During
inference, it predicts the class of an actual word, taking into account its context. LSTM
has 1024 units, the time step is 3, and softmax activation function is used to obtain
model predictions. The parameters of training (optimization method, regularization,
etc.) are identical to the settings used in case of the other models. This way, I try to
ensure a fair comparison among the models.
Input
Output
“<SE> I wake
PLAIN
I wake up
PLAIN
wake up at
PLAIN
up at 9:00 AM
PLAIN
at 9:00 AM .
TIME
9:00 AM . </SE>
PUNCT
- 57 -
BI-LSTM based: In the second model a Bi-LSTM (1024 units) is introduced for the
text normalization task, and it is called BI-LSTM_TN. Bi-LSTM reads input on both
directions with two sub-layers. These sublayers compute both forward and backward
hidden sequences, which are combined to compute the output sequence.
Figure 4.2. The architecture of the end-to-end CNN with residual connections. f, d,
s is the number of the filters, length of the filters and stride, respectively.
CNN based: The third model contains convolutional layers only with residual
connections [48, 49, 79]. I first apply one convolutional layer, with 512 filters to the
input layer, followed by a stack of residual blocks. Through hyperoptimization, the
best result was achieved by 3 residual blocks, as shown in Figure 4.2 and the number
of filters in each residual block is 512, 256, 128. Each residual block consists of 2
convolutional layers and followed by a batch normalization [50] layer and ReLU
activation. The filter size of all convolutional layers is 3. After these blocks, one more
batch normalization layer and ReLU activation follow, and the architecture ends with
- 58 -
a fully connected layer coupled with a softmax activation function. For this model, I
analysed both the CBOW and the SG method for creating vector representation of
words and called the model CBOW_CNN and SG_CNN, respectively.
Moreover, I carried out experiments with the same models without residual
connections, but the accuracy became worse than with residual connections, as
expected.
After the stage of sequence classification in all models, I apply rule-based methods
depending on each class, which help the verbalization of the word (token).
Verbalization or standard word generation is the process of converting non-natural
language text into standard words or natural language. If the predicted class of token
is PUNCT or PLAIN verbalization is unnecessary. In special cases, if the token of “-
” has the meaning “to” then although its class is PLAIN its verbalization is “to”. In
other cases, I applied rules to transform the input into a fully expanded form.
4.4.1 Experiments
I used the continuous-bag-of-words (CBOW) method for inducing the distributed
contextual representations for all models and investigated the SG method for the
convolutional based model, too. I used 5 for the contextual window size and 100
dimensions for the embedded vector in both methods. In all models, I applied the
Adam optimization [51] with a starting learning rate of 0.001, and with the baseline
values of and ε (0.9, 0.999 and , respectively). For the batch size, I chose
1000. I saved weights when the accuracy on the validation dataset achieved a peak.
When the accuracy did not decrease further within 100 epochs, the best model was
chosen, and it was trained further with stochastic gradient descent (SGD) optimizer.
In case of the first and second models for SGD I used 0.005 for learning rate, 0.8 for
momentum. For the third (convolutional only) model, I chose 0.005 and 0.6 for
learning rate and momentum, respectively.
4.4.2 Evaluation and Results
After training the models, I evaluated them on the test dataset, in terms of accuracy,
precision, recall, F-measure and plotted confusion matrices as heat maps.
In Table 4.6 the overall accuracy is presented for each model, in Table 4.7 the accuracy
is calculated for each class, and in Table 4.8 precision, recall, F1 score, macro
precision, macro recall and macro F1-score are presented for each model (LSTM_TN,
BI-LSTM_TN, CBOW_CNN, SG_CNN). In Table 4.6, the first and the second
- 59 -
columns show the model number and the applied architecture; the third and fourth
column shows the overall accuracy for each model and the average sum of training
time of one epoch; the last column presents information about the models’ size (with
32-bit weights).
Table 4.6. Results on the open data
The first column of Table 4.7 shows the number of tokens in each class; the other
columns show the class-based accuracy for each model (LSTM_TN, BI-LSTM_TN,
CBOW_CNN, SG_CNN) and precision, recall, F1-score for SG + end-to-end CNN
model (CNN_BI-LSTM with SG). According to the results, all three alternatives show
similar performance, but the third model has the least parameters. There are also slight
differences in class-based accuracy in Table 4.7.
Table 4.7. Results of the open dataset for each class.
Class
Number
of tokens
LSTM
(%)
BILSTM
(%)
CBOW
_CNN (%)
SG_CNN
(%)
SG_CNN (%)
Prec.
Rec.
F1-sc.
PLAIN
776586
99.7
99.7
99.7
99.8
0.99
0.99
0.99
PUNCT
197716
99.9
99.9
99.9
99.9
0.99
0.99
0.99
DATE
26646
98.72
98.76
98.90
98.99
0.99
0.98
0.98
LETTERS
15885
80.35
80.50
79.80
81.22
0.93
0.81
0.86
CARDINAL
13618
98.63
95.74
98.76
98.89
0.96
0.99
0.97
VERBATIM
8225
96.53
96.76
96.89
97.22
0.98
0.97
0.97
MEASURE
1386
93.14
88.60
93.01
91.34
0.97
0.91
0.94
ORDINAL
1566
92.46
91.76