Conference PaperPDF Available

Abstract

Neural network architectures are at the core of powerful automatic speech recognition systems (ASR). However, while recent researches focus on novel model architectures, the acoustic input features remain almost unchanged. Traditional ASR systems rely on multidimensional acoustic features such as the Mel filter bank energies alongside with the first, and second order derivatives to characterize time-frames that compose the signal sequence. Considering that these components describe three different views of the same element, neural networks have to learn both the internal relations that exist within these features, and external or global dependencies that exist between the time-frames. Quaternion-valued neural networks (QNN), recently received an important interest from researchers to process and learn such relations in multidimensional spaces. Indeed, quaternion numbers and QNNs have shown their efficiency to process multidimensional inputs as entities, to encode internal dependencies, and to solve many tasks with up to four times less learning parameters than real-valued models. We propose to investigate modern quaternion-valued models such as convolutional and recurrent quaternion neural networks in the context of speech recognition with the TIMIT dataset. The experiments show that QNNs always outperform real-valued equivalent models with way less free parameters, leading to a more efficient, compact, and expressive representation of the relevant information.
Speech Recognition
with Quaternion Neural Networks
Titouan Parcollet1,4Mirco Ravanelli2Mohamed Morchid1
Georges Linarès1Renato De Mori1,3
1LIA, Université d’Avignon, France
2MILA, Université de Montréal, Québec, Canada
3McGill University, Québec, Canada
4Orkis, Aix-en-provence, France
titouan.parcollet@alumni.univ-avignon.fr
mirco.ravanelli@gmail.com
firstname.lastname@univ-avignon.fr
rdemori@cs.mcgill.ca
Abstract
Neural network architectures are at the core of powerful automatic speech recog-
nition systems (ASR). However, while recent researches focus on novel model
architectures, the acoustic input features remain almost unchanged. Traditional
ASR systems rely on multidimensional acoustic features such as the Mel filter
bank energies alongside with the first, and second order derivatives to characterize
time-frames that compose the signal sequence. Considering that these compo-
nents describe three different views of the same element, neural networks have
to learn both the internal relations that exist within these features, and external
or global dependencies that exist between the time-frames. Quaternion-valued
neural networks (QNN), recently received an important interest from researchers
to process and learn such relations in multidimensional spaces. Indeed, quaternion
numbers and QNNs have shown their efficiency to process multidimensional inputs
as entities, to encode internal dependencies, and to solve many tasks with up to four
times less learning parameters than real-valued models. We propose to investigate
modern quaternion-valued models such as convolutional and recurrent quaternion
neural networks in the context of speech recognition with the TIMIT dataset. The
experiments show that QNNs always outperform real-valued equivalent models
with way less free parameters, leading to a more efficient, compact, and expressive
representation of the relevant information.
1 Introduction
During the last decade, deep neural networks (DNN) have encountered a wide success in automatic
speech recognition. Many architectures such as recurrent (RNN) [
34
,
15
,
1
,
31
,
13
], time-delay
(TDNN) [
39
,
28
], or convolutional neural networks (CNN) [
42
] have been proposed and achieved
better performances than traditional hidden Markov models (HMM) combined with gaussian mixtures
models (GMM) in different speech recognition tasks. However, despite such evolution of models
and paradigms, the acoustic feature representation remains almost the same. The acoustic signal
is commonly split into time-frames, for which Mel-filter banks energies, or Mel frequency scaled
cepstral coefficients (MFCC) [
8
] are extracted, alongside with the first and second order derivatives.
In fact, time-frames are characterized by
3
-dimensional features that are related by representing
three different views of the same basic element. Consequently an efficient neural networks-based
32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.
model has to learn both external dependencies between time-frames, and internal relations within the
features. Traditional real-valued architectures deal with both dependencies at the same level, due to
the lack of a dedicated mechanism to learn the internal and external relations separately.
Quaternions are hypercomplex numbers that contain a real and three separate imaginary components,
fitting perfectly to three and four dimensional feature vectors, such as for image processing and robot
kinematics [
35
,
29
,
4
]. The idea of bundling groups of numbers into separate entities is also exploited
by the recent capsule network [
33
]. Contrary to traditional homogeneous representations, capsule
and quaternion neural networks bundle sets of features together. Thereby, quaternion numbers allow
neural models to code latent inter-dependencies between groups of input features during the learning
process with up to four times less parameters than real-valued neural networks, by taking advantage
of the Hamilton product as the equivalent of the dot product between quaternions. Early applications
of quaternion-valued backpropagation algorithms [
3
,
2
] have efficiently solved quaternion functions
approximation tasks. More recently, neural networks of complex and hypercomplex numbers have
received an increasing attention [
16
,
38
,
7
,
40
], and some efforts have shown promising results in
different applications. In particular, a deep quaternion network [
24
,
25
,
37
], and a deep quaternion
convolutional network [
5
,
27
] have been successfully employed for challenging tasks such as images
and language processing.
Contributions:
This paper proposes to evaluate previously investigated quaternion-valued models in
two different realistic conditions of speech recognition, to see whether the quaternion encoding of the
signal, alongside with the quaternion algebra and the important parameter reduction help to better
capture the acoustic signal nature, leading to a more expressive representation of the information.
Based on the TIMIT [
10
] phoneme recognition task, a quaternion convolutional neural network
(QCNN) is compared to a real-valued CNN in a end-to-end framework, and a quaternion recurrent
neural network (QRNN) is compared to an RNN within a more traditional HMM-based system. In
the end-to-end approach, the experiments show that the QCNN outperforms the CNN with a phoneme
error rate (PER) of
19.5%
against the
20.6%
achieved for CNNs. Moreover, the QRNN outperforms
the RNN with a PER of
18.5%
against
19.0%
for the RNN. Furthermore, such results are observed
with a maximum reduction factor of the number of neural network parameters of 3.96 times.
2 Motivations
A major challenge of current machine learning models is to obtain efficient representations of relevant
information for solving a specific task. Consequently, a good model has to efficiently code both
the relations that occur at the feature level, such as between the Mel filter energies, the first, and
second order derivatives values of a single time-frame, and at a global level, such as a phonemes or
words described by a group of time-frames. Moreover, to avoid overfitting, better generalize, and
to be more efficient, such models also have to be as small as possible. Nonetheless, real-valued
neural networks usually require a huge set of parameters to well-perform on speech recognition
tasks, and hardly code internal dependencies within the features, since they are considered at the
same level as global dependencies during the learning. In the following, we detail the motivations to
employ quaternion-valued neural networks instead of real-valued ones to code inter and intra features
dependencies with less parameters.
First, a better representation of multidimensional data has to be explored to naturally capture internal
relations within the input features. For example, an efficient way to represent the information
composing an acoustic signal sequence is to consider each time-frame as being a whole entity of
three strongly related elements, instead of a group of unidimensional elements that could be related
to each others, as in traditional real-valued neural networks. Indeed, with a real-valued NN, the latent
relations between the Mel filter banks energies, and the first and second order derivatives of a given
time-frame are hardly coded in the latent space since the weight has to find out these relations among
all the time-frames composing the sequence. Quaternions are fourth dimensional entities and allow
one to build and process elements made of up to
4
elements, mitigating the above described problem.
Indeed, the quaternion algebra and more precisely the Hamilton product allows quaternion neural
network to capture these internal latent relations within the features of a quaternion. It has been
shown that QNNs are able to restore the spatial relations within
3
D coordinates [
20
], and within
color pixels [
17
], while real-valued NNs failed. In fact, the quaternion-weight components are shared
through multiple quaternion input parts during the Hamilton product , creating relations within the
elements. Indeed, Figure 1 shows that the multiple weights required to code latent relations within a
2
Figure 1: Illustration of the input features (
Qin
) latent relations learning ability of a quaternion-valued
layer (right) due to the quaternion weight sharing of the Hamilton product (Eq. 5), compared to a
standard real-valued layer (left).
feature are considered at the same level as for learning global relations between different features,
while the quaternion weight
w
codes these internal relations within a unique quaternion
Qout
during
the Hamilton product (right).
Second, quaternion neural networks make it possible to deal with the same signal dimension than
real-valued NN, but with
4
times less neural parameters. Indeed, a
4
-number quaternion weight
linking two 4-number quaternion units only has
4
degrees of freedom, whereas a standard neural
net parametrization have
4×4 = 16
, i.e., a 4-fold saving in memory. Therefore, the natural
multidimensional representation of quaternions alongside with their ability to drastically reduce the
number of parameters indicate that hyper-complex numbers are a better fit than real numbers to create
more efficient models in multidimensional spaces such as speech recognition.
Indeed, modern automatic speech recognition systems usually employ input sequences composed of
multidimensional acoustic features, such as log Mel features, that are often enriched with their first,
second and third time derivatives [
8
,
9
], to integrate contextual information. In standard NNs, static
features are simply concatenated with their derivatives to form a large input vector, without effectively
considering that signal derivatives represent different views of the same input. Nonetheless, it is
crucial to consider that these three descriptors represent a special state of a time-frame, and are
thus correlated. Following the above motivations and the results observed on previous works about
quaternion neural networks, we hypothesize that for acoustic data, quaternion NNs naturally provide
a more suitable representation of the input sequence, since these multiple views can be directly
embedded in the multiple dimensions space of the quaternion, leading to smaller and more accurate
models.
3 Quaternion Neural Networks
Real-valued neural networks architectures are extended to the quaternion domain to benefit from its
capacities. Therefore, this section proposes to introduce the quaternion algebra (Section 3.1), the
quaternion internal representation (section 3.2), a quaternion convolutional neural networks (QCNN,
Section 3.3) and a quaternion recurrent neural network (QRNN, Section 3.4).
3.1 Quaternion Algebra
The quaternion algebra
H
defines operations between quaternion numbers. A quaternion Q is an
extension of a complex number defined in a four dimensional space as:
Q=r1 + xi+yj+zk,(1)
3
where
r
,
x
,
y
, and
z
are real numbers, and
1
,
i
,
j
, and
k
are the quaternion unit basis. In a quaternion,
r
is the real part, while
xi+yj+zk
with
i2=j2=k2=ijk =1
is the imaginary part, or the
vector part. Such a definition can be used to describe spatial rotations. The information embedded in
the quaterion
Q
can be summarized into the following matrix of real numbers, that turns out to be
more suitable for computations:
Qmat =
rxyz
x r z y
y z r x
zy x r
.(2)
The conjugate Qof Qis defined as:
Q=r1xiyjzk.(3)
Then, a normalized or unit quaternion Q/is expressed as:
Q/=Q
pr2+x2+y2+z2.(4)
Finally, the Hamilton product between two quaternions Q1and Q2is computed as follows:
Q1Q2=(r1r2x1x2y1y2z1z2)+
(r1x2+x1r2+y1z2z1y2)i+
(r1y2x1z2+y1r2+z1x2)j+
(r1z2+x1y2y1x2+z1r2)k.(5)
The Hamilton product is used in QRNNs to perform transformations of vectors representing quater-
nions, as well as scaling and interpolation between two rotations following a geodesic over a sphere
in the R3space as shown in [21].
3.2 Quaternion internal representation
In a quaternion layer, all parameters are quaternions, including inputs, outputs, weights, and biases.
The quaternion algebra is ensured by manipulating matrices of real numbers [
27
]. Consequently,
for each input vector of size
N
, output vector of size
M
, dimensions are split into four parts: the
first one equals to
r
, the second is
xi
, the third one equals to
yj
, and the last one to
zk
to compose a
quaternion
Q=r1 + xi+yj+zk
. The inference process is based in the real-valued space on the dot
product between input features and weights matrices. In any quaternion-valued NN, this operation is
replaced with the Hamilton product (eq. 5) with quaternion-valued matrices (i.e. each entry in the
weight matrix is a quaternion).
3.3 Quaternion convolutional neural networks
Convolutional neural networks (CNN) [
19
] have been proposed to capture the high-level relations
that occur between neighbours features, such as shape and edges on an image. However, internal
dependencies within the features are considered at the same level than these high-level relations
by real-valued CNNs, and it is thus not guaranteed that they are well-captured. In this extend, a
quaternion convolutional neural network (QCNN) have been proposed by [
5
,
27
]
1
. Let
γl
ab
and
Sl
ab
,
be the quaternion output and the pre-activation quaternion output at layer
l
and at the indexes
(a, b)
of the new feature map, and
w
the quaternion-valued weight filter map of size
K×K
. A formal
definition of the convolution process is:
γl
ab =α(Sl
ab),(6)
with
Sl
ab =
K1
X
c=0
K1
X
d=0
wlγl1
(a+c)(b+d),(7)
1https://github.com/Orkis-Research/Pytorch-Quaternion-Neural-Networks
4
where αis a quaternion split activation function [41] defined as:
α(Q) = f(r) + f(x)i+f(y)j+f(z)k,(8)
with
f
corresponding to any standard activation function. The output layer of a quaternion neural
network is commonly either quaternion-valued such as for quaternion approximation [
2
], or real-
valued to obtains a posterior distribution based on a softmax function following the split approach of
Eq. 8. Indeed, target classes are often expressed as real numbers. Finally, the full derivation of the
backpropagation algorithm for quaternion valued neural networks can be found in [23].
3.4 Quaternion recurrent neural networks
Despite the fact that CNNs are efficient to detect and learn patterns in an input volume, recurrent
neural networks (RNN) are more adapted to represent sequential data. Indeed, recurrent neural
networks obtained state-of-the-art results on many tasks related to speech recognition [
32
,
12
].
Therefore, a quaternary version of the RNN called QRNN have been proposed by [
26
]
2
. Let us
define a QRNN with an hidden state composed of
H
neurons. Then, let
whh
,
w
, and
wγh
be the
hidden to hidden, input to hidden, and hidden to output weight matrices respectively, and
bl
n
be the
bias at neuron
n
and layer
l
. Therefore, and with the same parameters as for the QCNN, the hidden
state ht,l
nof the neuron nat timestep tand layer lcan be computed as:
ht,l
n=α(
H
X
m=0
wt,l
nm,hh ht1,l
m+
Nl1
X
m=0
wt,l
nm,hγ γt,l1
m+bl
n),(9)
with αany split activation function. Finally, the output of the neuron nis computed following:
γt,l
n=β(
Nl1
X
m=0
wt,l
nm,γh ht,l1
n+bl
n),(10)
with
β
any split activation function. The full derivation of the backpropagation trough time of the
QRNN can be found in [26].
4 Experiments on TIMIT
Quaternion-valued models are compared to their real-valued equivalents on two different benchmarks
with the TIMIT phoneme recognition task [
10
]. First, an end-to-end approach is investigated based
on QCNNs compared to CNNs in Section 4.2. Then, a more traditional and powerful method based
on QRNNs compared to RNNs alongside with HMM decoding is explored in Section 4.3. During
both experiments, the training process is performed on the standard
3,696
sentences uttered by
462
speakers, while testing is conducted on
192
sentences uttered by
24
speakers of the TIMIT dataset. A
validation set composed of
400
sentences uttered by
50
speakers is used for hyper-parameter tuning.
All the results are from an average of
3
runs (3-folds) to alleviate any variation due to the random
initialization.
4.1 Acoustic quaternions
End-to-end and HMM based experiments share the same quaternion input vector extracted from
the acoustic signal. The raw audio is first transformed into
40
-dimensional log Mel-filter-bank
coefficients using the pytorch-kaldi
3
toolkit and the Kaldi s5 recipes [
30
]. Then, the first, second, and
third order derivatives are extracted. Consequently, an acoustic quaternion
Q(f, t)
associated with a
frequency fand a time-frame tis formed as:
Q(f, t) = e(f , t) + e(f, t)
∂t i+2e(f , t)
2tj+3e(f, t)
3tk.(11)
2https://github.com/Orkis-Research/Pytorch-Quaternion-Neural-Networks
3https://github.com/mravanelli/pytorch-kaldi
5
Q(f, t)
represents multiple views of a frequency
f
at time frame
t
, consisting of the energy
e(f, t)
in the filter band at frequency
f
, its first time derivative describing a slope view, its second time
derivative describing a concavity view, and the third derivative describing the rate of change of the
second derivative. Quaternions are used to learn the spatial relations that exist between the different
views that characterize a same frequency. Thus, the quaternion input vector length is 160/4 = 40.
4.2 Toward end-to-end phonemes recognition
End-to-end systems are at the heart of modern researches in the speech recognition domain [
42
].
The task is particularly difficult due to the differences that exists between the raw or pre-processed
acoustic signal used as input features, and the word or phonemes expected at the output. Indeed, both
features are not defined at the same time-scale, and an automatic alignment method has to be defined.
This section proposes to evaluate the QCNN compared to traditional CNN in an end-to-end model
based on the connectionist temporal classification (CTC) method, to see whether the quaternion
encoding of the signal, alongside with the quaternion algebra, help to better capture the acoustic
signal nature and therefore better generalize.
4.2.1 Connectionist Temporal Classification
In the acoustic modeling part of ASR systems, the task of sequence-to-sequence mapping from an
input acoustic signal X= [x1, ..., xn]to a sequence of symbols T= [t1, ..., tm]is complex due to:
Xand Tcould be in arbitrary length.
The alignment between Xand Tis unknown in most cases.
Specially,
T
is usually shorter than
X
in terms of phoneme symbols. To alleviate these problems,
connectionist temporal classification (CTC) has been proposed [
11
]. First, a softmax is applied at
each timestep, or frame, providing a probability of emitting each symbol
X
at that timestep. This
probability results in a symbol sequences representation
P(O|X)
, with
O= [o1, ..., on]
in the latent
space
O
. A blank symbol
00
is introduced as an extra label to allow the classifier to deal with the
unknown alignment. Then,
O
is transformed to the final output sequence with a many-to-one function
g(O)defined as follows:
g(z1, z2,, z3,)
g(z1, z2, z3, z3,)
g(z1,, z2, z3, z3))= (z1, z2, z3).(12)
Consequently, the output sequence is a summation over the probability of all possible alignments
between
X
and
T
after applying the function
g(O)
. Accordingly to [
11
] the parameters of the models
are learned based on the cross entropy loss function:
XX,T train log(P(O|X)).(13)
During the inference, a best path decoding algorithm is performed. Therefore, the latent sequence
with the highest probability is obtained by performing argmax of the softmax output at each timestep.
The final sequence is obtained by applying the function g(.)to the latent sequence.
4.2.2 Model Architectures
A first
2
D convolutional layer is followed by a maxpooling layer along the frequency axis to reduce
the internal dimension. Then,
n= 10 2
D convolutional layers are included, together with
3
dense
layers of sizes
1024
and
256
respectively for real- and quaternion-valued models. Indeed, the output
of a dense quaternion-valued layer has
256 ×4 = 1024
nodes and is
4
times larger than the number
of units. The filter size is rectangular
(3,5)
, and a padding is applied to keep the sequence and signal
sizes unaltered. The number of feature maps varies from
32
to
256
for the real-valued models and
from
8
to
64
for quaternion-valued models. Indeed, the number of output feature maps is
4
times
larger in the QCNN due to the quaternion convolution, meaning
32
quaternion-valued feature maps
(FM) correspond to
128
real-valued ones. Therefore, for a fair comparison, the number of feature
maps is represented in the real-valued space (e.g., a number of real-valued FM of
256
corresponds
to
256/4 = 64
quaternion-valued neurons). The PReLU activation function is employed for both
models [14]. A dropout of 0.2and a L2regularization of 1e5are used across all the layers, except
6
the input and output ones. CNNs and QCNNs are trained with the RMSPROP learning rate optimizer
and vanilla hyperparameters [
18
] during
100
epochs. The learning rate starts at
8·104
, and is
decayed by a factor of
0.5
every time the results observed on the validation set do not improve.
Quaternion parameters including weights and biases are initialized following the adapted quaternion
initialization scheme provided in [
26
]. Finally, the standard CTC loss function defined in [
11
] and
implemented in [6] is applied.
4.2.3 Results and discussions
End-to-end results of QCNN and CNN are reported in Table 1. In agreement with our hypothesis, one
may notice an important difference in the amount of learning parameters between real and quaternion
valued CNNs. An explanation comes from the quaternion algebra. A dense layer with
1,024
input
values and
1,024
hidden units contains
1,02421
M parameters, while the quaternion equivalent
needs
2562×40.26
M parameters to deal with the same signal dimension. Such reduction in the
number of parameters have multiple positive impact on the model. First, a smaller memory footprint
for embedded and limited devices. Second, and as demonstrated in Table 1, a better generalization
ability leading to better performances. Indeed, the best PER observed in realistic conditions (w.r.t to
the development PER) is
19.5%
for the QCNN compared to
20.6%
for the CNN, giving an absolute
improvement of
0.5%
with QCNN. Such results are obtained with
32.1
M parameters for CNN, and
only 8.1Mfor QCNN, representing a reduction factor of 3.96x of the number of parameters.
Table 1: Phoneme error rate (PER%) of CNN and QCNN models on the development and test sets of
the TIMIT dataset. “Params" stands for the total number of trainable parameters, and "FM" for the
feature maps size.
Models FM Dev. Test Params
CNN
32 22.0 23.1 3.4M
64 19.6 20.7 5.4M
128 19.6 20.8 11.5M
256 19.0 20.6 32.1M
QCNN
32 22.3 23.3 0.9M
64 19.9 20.5 1.4M
128 18.9 19.9 2.9M
256 18.2 19.5 8.1M
It is worth noticing that with much fewer learning parameters for a given architecture, the QCNN
always performs better than the real-valued one. Consequently, the quaternion-valued convolutional
approach offers an alternative to traditional real-valued end-to-end models, that is more efficient, and
more accurate. However, due to an higher number of computations involved during the Hamilton
product and to the lack of proper engineered implementations, the QCNN is one time slower than
the CNN to train. Nonetheless, such behavior can be alleviated with a dedicated implementation in
CUDA of the Hamilton product . Indeed, this operation is a matrix product and can thus benefit from
the parallel computation of GPUs. In fact, a proper implementation of the Hamilton product will
leads to a higher and more efficient usage of GPUs.
4.3 HMM-based phonemes recognition
A conventional ASR pipeline based on a HMM decoding process alongside with recurrent neural
networks is also investigated to reach state-of-the-art results on the TIMIT task. While input features
remain the same as for end-to-end experiments, RNNs and QRNNs are trained to predict the HMM
states that are then decoded in the standard Kaldi recipes [
30
]. As hypothetized for the end-to-end
solution, QRNN models are expected to better generalize than RNNs due to their specific algebra.
4.3.1 Model Architectures
RNN and QRNN models are compared on a fixed number of layers
M= 4
and by varying the
number of neurons
N
from
256
to
2,048
, and
64
to
512
for the RNN and QRNN respectively.
Indeed, as demonstrated on the previous experiments the number of hidden neurons in the quaternion
and real spaces do not handle the same amount of real-number values. Tanh activations are used
7
across all the layers except for the output layer that is based on a softmax function. Models are
optimized with RMSPROP [
18
] with vanilla hyper-parameters and an initial learning rate of
8·104
.
The learning rate is progressively annealed using an halving factor of
0.5
that is applied when no
performance improvement on the validation set is observed. The models are trained during
25
epochs.
A dropout rate of
0.2
is applied over all the hidden layers [
36
] except the output. The negative
log-likelihood loss function is used as an objective function. As for QCNNs, quaternion parameters
are initialized based on [
26
]. Finally, decoding is based on Kaldi [
30
] and weighted finite state
transducers (WFST) [
22
] that integrate acoustic, lexicon and language model probabilities into a
single HMM-based search graph.
4.3.2 Results and discussions
The results of both QRNNs and RNNs alongside with an HMM decoding phase are presented in
Table 2. A best testing PER of
18.5%
is reported for QRNN compared to
19.0%
for RNN, with
respect to the best development PER. Such results are obtained with
3.8
M, and
9.4
M parameters for
the QRNN and RNN respectively, with an equal hidden dimension of
1,024
, leading to a reduction of
the number of parameters by a factor of
2.47
times. As for previous experiments the QRNN always
outperform equivalents architectures in term of PER, with significantly less learning parameters. It
is also important to notice that both models tend to overfit with larger architectures. However, such
phenomenom is lowered by the small number of free parameters of the QRNN. Indeed, a QRNN
whose hidden dimension is
2,048
only has
11.2
M parameters compared to
33.4
M for an equivalently
sized RNN, leading to less degrees of freedom, and therefore less overfitting.
Table 2: Phoneme error rate (PER%) of both QRNN, RNN models on the development and test sets
of the TIMIT dataset. “Params" stands for the total number of trainable parameters, and "Hidden
dim" represents the dimension of the hidden layers. As an example, a QRNN layer of "Hidden dim."
= 512 is equivalent to 128 hidden quaternion neurons.
Models Hidden dim. Dev. Test Params
RNN
256 22.4 23.4 1M
512 19.6 20.4 2.8M
1,024 17.9 19.0 9.4M
2,048 20.0 20.7 33.4M
QRNN
256 23.6 23.9 0.6M
512 19.2 20.1 1.4M
1,024 17.4 18.5 3.8M
2,048 17.5 18.7 11.2M
The reported results show that the QRNN is a better framework for ASR systems than real-valued RNN
when dealing with conventional multidimensional acoustic features. Indeed, the QRNN performed
better and with less parameters, leading to a more efficient representation of the information.
5 Conclusion
Summary.
This paper proposes to investigate novel quaternion-valued architectures in two different
conditions of speech recognition on the TIMIT phoneme recognition tasks. The experiments show
that quaternion approaches always outperform real-valued equivalents in both benchmarks, with a
maximum reduction factor of the number of learning parameters of
3.96
times. It has been shown
that the appropriate multidimensional quaternion representation of acoustic features, alongside with
the Hamilton product, help QCNN and QRNN to well-learn both internal and external relations
that exists within the features, leading to a better generalization capability, and to a more efficient
representation of the relevant information through significantly less free parameters than traditional
real-valued neural networks.
Future Work.
Future investigation will be to develop multi-view features that contribute to decrease
ambiguities in representing phonemes in the quaternion space. In this extend, a recent approach
based on a quaternion Fourrier transform to create quaternion-valued signal has to be investigated.
8
References
[1]
Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, and Gerald Penn. Applying
convolutional neural networks concepts to hybrid nn-hmm model for speech recognition. In
Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on,
pages 4277–4280. IEEE, 2012.
[2]
Paolo Arena, Luigi Fortuna, Giovanni Muscato, and Maria Gabriella Xibilia. Multilayer
perceptrons to approximate quaternion valued functions. Neural Networks, 10(2):335–342,
1997.
[3]
Paolo Arena, Luigi Fortuna, Luigi Occhipinti, and Maria Gabriella Xibilia. Neural networks
for quaternion-valued function approximation. In Circuits and Systems, ISCAS’94., IEEE
International Symposium on, volume 6, pages 307–310. IEEE, 1994.
[4]
Nicholas A Aspragathos and John K Dimitros. A comparative study of three methods for
robot kinematics. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on,
28(2):135–145, 1998.
[5]
Anthony Maida Chase Gaudet. Deep quaternion networks. arXiv preprint arXiv:1712.04604v2,
2017.
[6] François Chollet et al. Keras. https://github.com/keras-team/keras, 2015.
[7]
Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. Associative
long short-term memory. arXiv preprint arXiv:1602.03032, 2016.
[8]
Steven B Davis and Paul Mermelstein. Comparison of parametric representations for monosyl-
labic word recognition in continuously spoken sentences. In Readings in speech recognition,
pages 65–74. Elsevier, 1990.
[9]
Sadaoki Furui. Speaker-independent isolated word recognition based on emphasized spectral
dynamics. In Acoustics, Speech, and Signal Processing, IEEE International Conference on
ICASSP’86., volume 11, pages 1991–1994. IEEE, 1986.
[10]
John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett.
Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA
STI/Recon technical report n, 93, 1993.
[11]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist
temporal classification: labelling unsegmented sequence data with recurrent neural networks. In
Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM,
2006.
[12]
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep
recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee
international conference on, pages 6645–6649. IEEE, 2013.
[13]
Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber.
Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems,
28(10):2222–2232, 2017.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE
international conference on computer vision, pages 1026–1034, 2015.
[15]
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep
Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural
networks for acoustic modeling in speech recognition: The shared views of four research groups.
Signal Processing Magazine, IEEE, 29(6):82–97, 2012.
[16]
Akira Hirose and Shotaro Yoshida. Generalization characteristics of complex-valued feedfor-
ward neural networks in relation to signal coherence. IEEE Transactions on Neural Networks
and learning systems, 23(4):541–551, 2012.
9
[17]
Teijiro Isokawa, Tomoaki Kusakabe, Nobuyuki Matsui, and Ferdinand Peper. Quaternion neural
network and its application. In International Conference on Knowledge-Based and Intelligent
Information and Engineering Systems, pages 318–324. Springer, 2003.
[18]
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[19]
Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with
gradient-based learning. In Shape, contour and grouping in computer vision, pages 319–345.
Springer, 1999.
[20]
Nobuyuki Matsui, Teijiro Isokawa, Hiromi Kusamichi, Ferdinand Peper, and Haruhiko
Nishimura. Quaternion neural network with geometrical operators. Journal of Intelligent
& Fuzzy Systems, 15(3, 4):149–164, 2004.
[21]
Toshifumi Minemoto, Teijiro Isokawa, Haruhiko Nishimura, and Nobuyuki Matsui. Feed
forward neural network with random quaternionic neurons. Signal Processing, 136:59–68,
2017.
[22]
Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted finite-state transducers in
speech recognition. Computer Speech and Language, 16(1):69 – 88, 2002.
[23]
Tohru Nitta. A quaternary version of the back-propagation algorithm. In Neural Networks, 1995.
Proceedings., IEEE International Conference on, volume 5, pages 2753–2756. IEEE, 1995.
[24]
Titouan Parcollet, Mohamed Morchid, Pierre-Michel Bousquet, Richard Dufour, Georges
Linarès, and Renato De Mori. Quaternion neural networks for spoken language understanding.
In Spoken Language Technology Workshop (SLT), 2016 IEEE, pages 362–368. IEEE, 2016.
[25]
Titouan Parcollet, Mohamed Morchid, and Georges Linares. Deep quaternion neural networks
for spoken language understanding. In Automatic Speech Recognition and Understanding
Workshop (ASRU), 2017 IEEE, pages 504–511. IEEE, 2017.
[26]
Titouan Parcollet, Mirco Ravanelli, Mohamed Morchid, Georges Linarès, Chiheb Trabelsi,
Renato De Mori, and Yoshua Bengio. Quaternion recurrent neural networks, 2018.
[27]
Titouan Parcollet, Ying Zhang, Mohamed Morchid, Chiheb Trabelsi, Georges Linarès, Renato
De Mori, and Yoshua Bengio. Quaternion convolutional neural networks for end-to-end
automatic speech recognition. arXiv preprint arXiv:1806.07789, 2018.
[28]
Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. A time delay neural network
architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference
of the International Speech Communication Association, 2015.
[29]
Soo-Chang Pei and Ching-Min Cheng. Color image processing by using binary quaternion-
moment-preserving thresholding technique. IEEE Transactions on Image Processing, 8(5):614–
628, 1999.
[30]
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra
Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg
Stemmer, and Karel Vesely. The kaldi speech recognition toolkit. In IEEE 2011 Workshop on
Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December
2011. IEEE Catalog No.: CFP11SRW-USB.
[31]
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, and Yoshua Bengio. Improving speech
recognition by revising gated recurrent units. Proc. Interspeech 2017, 2017.
[32]
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, and Yoshua Bengio. Light gated
recurrent units for speech recognition. IEEE Transactions on Emerging Topics in Computational
Intelligence, 2(2):92–102, 2018.
[33]
Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. arXiv
preprint arXiv:1710.09829v2, 2017.
10
[34]
Ha¸sim Sak, Andrew Senior, and Françoise Beaufays. Long short-term memory recurrent neural
network architectures for large scale acoustic modeling. In Fifteenth annual conference of the
international speech communication association, 2014.
[35]
Stephen John Sangwine. Fourier transforms of colour images using quaternion or hypercomplex,
numbers. Electronics letters, 32(21):1979–1980, 1996.
[36]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine
Learning Research, 15(1):1929–1958, 2014.
[37]
Parcollet Titouan, Mohamed Morchid, and Georges Linares. Quaternion denoising encoder-
decoder for theme identification of telephone conversations. Proc. Interspeech 2017, pages
3325–3328, 2017.
[38]
Mark Tygert, Joan Bruna, Soumith Chintala, Yann LeCun, Serkan Piantino, and Arthur Szlam.
A mathematical motivation for complex-valued convolutional networks. Neural computation,
28(5):815–825, 2016.
[39]
Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and Kevin J Lang.
Phoneme recognition using time-delay neural networks. In Readings in speech recognition,
pages 393–404. Elsevier, 1990.
[40]
Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Full-capacity
unitary recurrent neural networks. In Advances in Neural Information Processing Systems,
pages 4880–4888, 2016.
[41]
D Xu, L Zhang, and H Zhang. Learning alogrithms in quaternion neural networks using ghr
calculus. Neural Network World, 27(3):271, 2017.
[42]
Ying Zhang, Mohammad Pezeshki, Philémon Brakel, Saizheng Zhang, Cesar Laurent Yoshua
Bengio, and Aaron Courville. Towards end-to-end speech recognition with deep convolutional
neural networks. arXiv preprint arXiv:1701.02720, 2017.
11
... They removed the reset gate of the GRU and used the ReLU activation function (combined with the Batch Normalization) instead of the tanh activation function. • The use of quaternions Neural Networks, from [21], for speech processing. The quaternion formulation allows the fuse of 4 dimensions into one inducing a drastic reduction of required parameters in their experiments (near 4 times). ...
Preprint
Full-text available
Most state-of-the-art speech systems are using Deep Neural Networks (DNNs). Those systems require a large amount of data to be learned. Hence, learning state-of-the-art frameworks on under-resourced speech languages/problems is a difficult task. Problems could be the limited amount of data for impaired speech. Furthermore, acquiring more data and/or expertise is time-consuming and expensive. In this paper we position ourselves for the following speech processing tasks: Automatic Speech Recognition, speaker identification and emotion recognition. To assess the problem of limited data, we firstly investigate state-of-the-art Automatic Speech Recognition systems as it represents the hardest tasks (due to the large variability in each language). Next, we provide an overview of techniques and tasks requiring fewer data. In the last section we investigate few-shot techniques as we interpret under-resourced speech as a few-shot problem. In that sense we propose an overview of few-shot techniques and perspectives of using such techniques for the focused speech problems in this survey. It occurs that the reviewed techniques are not well adapted for large datasets. Nevertheless, some promising results from the literature encourage the usage of such techniques for speech processing.
Article
Full-text available
Recurrent Neural Networks (RNNs) are known for their ability to learn relationships within temporal sequences. Gated Recurrent Unit (GRU) networks have found use in challenging time-dependent applications such as Natural Language Processing (NLP), financial analysis and sensor fusion due to their capability to cope with the vanishing gradient problem. GRUs are also known to be more computationally efficient than their variant, the Long Short-Term Memory neural network (LSTM), due to their less complex structure and as such, are more suitable for applications requiring more efficient management of computational resources. Many of such applications require a stronger mapping of their features to further enhance the prediction accuracy. A novel Quaternion Gated Recurrent Unit (QGRU) is proposed in this paper, which leverages the internal and external dependencies within the quaternion algebra to map correlations within and across multidimensional features. The QGRU can be used to efficiently capture the inter- and intra-dependencies within multidimensional features unlike the GRU, which only captures the dependencies within the sequence. Furthermore, the performance of the proposed method is evaluated on a sensor fusion problem involving navigation in Global Navigation Satellite System (GNSS) deprived environments as well as a human activity recognition problem. The results obtained show that the QGRU produces competitive results with almost 3.7 times fewer parameters compared to the GRU. The QGRU code is available at https://github.com/onyekpeu/Quarternion-Gated-Recurrent-Unit.
Article
Deep learning is a hot research topic in the field of machine learning methods and applications. Real-value neural networks (Real NNs), especially deep real networks (DRNs), have been widely used in many research fields. In recent years, the deep complex networks (DCNs) and the deep quaternion networks (DQNs) have attracted more and more attentions. The octonion algebra, which is an extension of complex algebra and quaternion algebra, can provide more efficient and compact expressions. This paper constructs a general framework of deep octonion networks (DONs) and provides the main building blocks of DONs such as octonion convolution, octonion batch normalization and octonion weight initialization; DONs are then used in image classification tasks for CIFAR-10 and CIFAR-100 data sets. Compared with the DRNs, the DCNs, and the DQNs, the proposed DONs have better convergence and higher classification accuracy. The success of DONs is also explained by multi-task learning.
Conference Paper
Full-text available
Recently, the connectionist temporal classification (CTC) model coupled with recurrent (RNN) or convolutional neural networks (CNN), made it easier to train speech recognition systems in an end-to-end fashion. However in real-valued models , time frame components such as mel-filter-bank energies and the cepstral coefficients obtained from them, together with their first and second order derivatives, are processed as individual elements, while a natural alternative is to process such components as composed entities. We propose to group such elements in the form of quaternions and to process these quaternions using the established quaternion algebra. Quaternion numbers and quaternion neural networks have shown their efficiency to process multidimensional inputs as entities, to encode internal dependencies , and to solve many tasks with less learning parameters than real-valued models. This paper proposes to integrate multiple feature views in quaternion-valued convolutional neu-ral network (QCNN), to be used for sequence-to-sequence mapping with the CTC model. Promising results are reported using simple QCNNs in phoneme recognition experiments with the TIMIT corpus. More precisely, QCNNs obtain a lower phoneme error rate (PER) with less learning parameters than a competing model based on real-valued CNNs.
Article
Full-text available
A field that has directly benefited from the recent advances in deep learning is automatic speech recognition (ASR). Despite the great achievements of the past decades, however, a natural and robust human–machine speech interaction still appears to be out of reach, especially in challenging environments characterized by significant noise and reverberation. To improve robustness, modern speech recognizers often employ acoustic models based on recurrent neural networks (RNNs) that are naturally able to exploit large time contexts and long-term speech modulations. It is thus of great interest to continue the study of proper techniques for improving the effectiveness of RNNs in processing speech signals. In this paper, we revise one of the most popular RNN models, namely, gated recurrent units (GRUs), and propose a simplified architecture that turned out to be very effective for ASR. The contribution of this work is twofold: First, we analyze the role played by the reset gate, showing that a significant redundancy with the update gate occurs. As a result, we propose to remove the former from the GRU design, leading to a more efficient and compact single-gate model. Second, we propose to replace hyperbolic tangent with rectified linear unit activations. This variation couples well with batch normalization and could help the model learn long-term dependencies without numerical issues. Results show that the proposed architecture, called light GRU, not only reduces the per-epoch training time by more than 30% over a standard GRU, but also consistently improves the recognition accuracy across different tasks, input features, noisy conditions, as well as across different ASR paradigms, ranging from standard DNN-HMM speech recognizers to end-to-end connectionist temporal classification models.
Article
Full-text available
Speech recognition is largely taking advantage of deep learning, showing that substantial benefits can be obtained by modern Recurrent Neural Networks (RNNs). The most popular RNNs are Long Short-Term Memory (LSTMs), which typically reach state-of-the-art performance in many tasks thanks to their ability to learn long-term dependencies and robustness to vanishing gradients. Nevertheless, LSTMs have a rather complex design with three multiplicative gates, that might impair their efficient implementation. An attempt to simplify LSTMs has recently led to Gated Recurrent Units (GRUs), which are based on just two multiplicative gates. This paper builds on these efforts by further revising GRUs and proposing a simplified architecture potentially more suitable for speech recognition. The contribution of this work is two-fold. First, we suggest to remove the reset gate in the GRU design, resulting in a more efficient single-gate architecture. Second, we propose to replace tanh with ReLU activations in the state update equations. Results show that, in our implementation, the revised architecture reduces the per-epoch training time with more than 30% and consistently improves recognition performance across different tasks, input features, and noisy conditions when compared to a standard GRU.
Conference Paper
Full-text available
In the last decades, encoder-decoders or autoen-coders (AE) have received a great interest from researchers due to their capability to construct robust representations of documents in a low dimensional sub-space. Nonetheless, autoencoders reveal little in way of spoken document internal structure by only considering words or topics contained in the document as an isolate basic element, and tend to overfit with small corpus of documents. Therefore, Quaternion Multi-layer Per-ceptrons (QMLP) have been introduced to capture such internal latent dependencies, whereas denoising autoen-coders (DAE) are composed with different stochastic noises to better process small set of documents. This paper presents a novel autoencoder based on both hitherto-proposed DAE (to manage small corpus) and the QMLP (to consider internal latent structures) called " Quater-nion denoising encoder-decoder " (QDAE). Moreover, the paper defines an original angular Gaussian noise adapted to the specificity of hyper-complex algebra. The experiments, conduced on a theme identification task of spoken dialogues from the DECODA framework, show that the QDAE obtains the promising gains of 3% and 1.5% compared to the standard real valued denoising autoencoder and the QMLP respectively.
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Conference Paper
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates $backslash$emphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Article
Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic features for automatic speech recognition (ASR). Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models (HMMs/GMMs) have achieved the state-of-the-art in various benchmarks. Meanwhile, Connectionist Temporal Classification (CTC) with Recurrent Neural Networks (RNNs), which is proposed for labeling unsegmented sequences, makes it feasible to train an end-to-end speech recognition system instead of hybrid settings. However, RNNs are computationally expensive and sometimes difficult to train. In this paper, inspired by the advantages of both CNNs and the CTC approach, we propose an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections. By evaluating the approach on the TIMIT phoneme recognition task, we show that the proposed model is not only computationally efficient, but also competitive with the existing baseline systems. Moreover, we argue that CNNs have the capability to model temporal correlations with appropriate context information.
Article
A quaternionic extension of feed forward neural network, for processing multi-dimensional signals, is proposed in this paper. This neural network is based on the three layered network with random weights, called Extreme Learning Machines (ELMs), in which iterative least-mean-square algorithms are not required for training networks. All parameters and variables in the proposed network are encoded by quaternions and operations among them follow the quaternion algebra. Neurons in the proposed network are expected to operate multi-dimensional signals as single entities, rather than real-valued neurons deal with each element of signals independently. The performances for the proposed network are evaluated through two types of experiments: classifications and reconstructions for color images in the CIFAR-10 dataset. The experimental results show that the proposed networks are superior in terms of classification accuracies for input images than the conventional (real-valued) networks with similar degrees of freedom. The detailed investigations for operations in the proposed networks are conducted.