Content uploaded by Titouan Parcollet

Author content

All content in this area was uploaded by Titouan Parcollet on Nov 06, 2018

Content may be subject to copyright.

BIDIRECTIONAL QUATERNION LONG-SHORT TERM MEMORY

RECURRENT NEURAL NETWORKS FOR SPEECH RECOGNITION

Titouan Parcollet1,3, Mohamed Morchid1, Georges Linarès1, and Renato De Mori1,2

1Université d’Avignon, LIA, France

2McGill University, Montréal, Canada

3Orkis, Aix en provence, France

ABSTRACT

Recurrent neural networks (RNN) are at the core of

modern automatic speech recognition (ASR) systems. In

particular, long-short term memory (LSTM) recurrent neu-

ral networks have achieved state-of-the-art results in many

speech recognition tasks, due to their efﬁcient representa-

tion of long and short term dependencies in sequences of

inter-dependent features. Nonetheless, internal dependencies

within the element composing multidimensional features are

weakly considered by traditional real-valued representations.

We propose a novel quaternion long-short term memory (QL-

STM) recurrent neural network that takes into account both

the external relations between the features composing a se-

quence, and these internal latent structural dependencies with

the quaternion algebra. QLSTMs are compared to LSTMs

during a memory copy-task and a realistic application of

speech recognition on the Wall Street Journal (WSJ) dataset.

QLSTM reaches better performances during the two experi-

ments with up to 2.8times less learning parameters, leading

to a more expressive representation of the information.

Index Terms—Quaternion long-short term memory, re-

current neural networks, speech recognition

1. INTRODUCTION

During the last decade, deep neural networks (DNN) have

encountered a wide success in numerous domain applica-

tions. In particular, automatic speech recognition systems

(ASR) performances have been remarkably improved with

the emergence of DNNs. Among them, recurrent neural

networks [1] (RNN) have been shown to effectively encode

input sequences, increasing the accuracy of neural network

based ASR systems [2]. Nonetheless, vanilla RNNs suffer

from vanishing/exploding issues [3], or the lack of a mem-

ory mechanism to remember patterns in very-long or short

sequences. These problems have been alleviated by the intro-

duction of long-short term memory (LSTM) RNN [4] with

gates mechanism that allows the model to update or forget in-

formation in memory cells, and to select the content cell state

to expose in a network hidden state. LSTMs have reached

state-of-the art performances in many benchmarks [4, 5], and

are widely employed in recent ASR models, with the almost

unchanged acoustic input features used in previous systems.

Traditional ASR systems rely on multidimensional acous-

tic features such as the Mel ﬁlter bank energies alongside with

the ﬁrst, and second order time derivatives to characterize

time-frames that compose the signal sequence. Considering

that these components describe three different views of the

same element, neural networks have to learn both the internal

relations that exist within these views, and external or global

dependencies that exist between the time-frames. Such con-

cerns are partially addressed by increasing the learning capac-

ity of neural network architectures. Nonetheless, even with a

huge set of free parameters, it is not certain that both local and

global dependencies are properly represented. To address this

problem, new quaternion-valued neural networks, based on a

high-dimensional algebra, are proposed in this paper.

Quaternions are hyper-complex numbers that contain a

real and three separate imaginary components, ﬁtting per-

fectly to three and four dimensional feature vectors, such

as for image processing and robot kinematics [6, 7]. The

idea of bundling groups of numbers into separate entities

is also exploited by the recent capsule network [8]. With

quaternion numbers, LSTMs are conceived to encode latent

inter-dependencies between groups of input features during

the learning process with less parameters than real-valued

LSTMs, by taking advantage of the use of the quaternion

Hamilton product as the counterpart of the dot product.

Early applications of quaternion-valued backpropagation al-

gorithms [9, 10] have efﬁciently shown that quaternion neu-

ral networks can approximate quaternion-valued functions.

More recently, neural networks of hyper-complex numbers

have received an increasing attention, and some efforts have

shown promising results in different applications. In partic-

ular, a deep quaternion network [11, 12], a deep quaternion

convolutional network [13, 14], or a quaternion recurrent

neural network [15] have been successfully employed for

challenging tasks such as images, speech and language pro-

cessing. For speech recognition, in [14], quaternions with

only three internal features have been used to encode input

speech. An additional internal feature is proposed in this

paper to obtain a richer representation with the same number

of model parameters.

Based on all the above considerations, the contributions of

this paper can be summarized as follows: 1) The introduction

of a novel model, called bidirectional quaternion long-short

term memory neural network (QLSTM)1, that avoids known

RNN problems also present in quaternion RNNs, and shows

that QLSTMs achieve top of the line results on speech recog-

nition; 2) The introduction of a novel input quaternion that

integrates four views of speech time frames. The model is

ﬁrst evaluated on a synthetic memory copy-task to ensure that

the introduction of quaternion into the LSTM model does not

alter the basic properties of RNNs. Then, QLSTMs are com-

pared to real-valued LSTMs on a realistic speech recognition

task with the Wall Street Journal (WSJ) dataset. The reported

results show that the QLSTM outperforms the LSTM in both

tasks with a higher long-memory capability on the memory

task, a better generalization performance with better word er-

ror rates (WER), and a maximum reduction of the number

of neural paramaters of 2.8times compared to real-valued

LSTM.

2. QUATERNION ALGEBRA

The quaternion algebra Hdeﬁnes operations between quater-

nion numbers. A quaternion Q is an extension of a complex

number deﬁned in a four dimensional space as:

Q=r1 + xi+yj+zk,(1)

where r,x,y, and zare real numbers, and 1,i,j, and kare

the quaternion unit basis. In a quaternion, ris the real part,

while xi+yj+zkwith i2=j2=k2=ijk =−1is the

imaginary part, or the vector part. Such a deﬁnition can be

used to describe spatial rotations. The Hamilton product ⊗

between two quaternions Q1and Q2is computed as follows:

Q1⊗Q2=(r1r2−x1x2−y1y2−z1z2)+

(r1x2+x1r2+y1z2−z1y2)i+

(r1y2−x1z2+y1r2+z1x2)j+

(r1z2+x1y2−y1x2+z1r2)k.(2)

The Hamilton product is used in QLSTMs to perform trans-

formations of vectors representing quaternions, as well as

scaling and interpolation between two rotations following a

geodesic over a sphere in the R3space as shown in [16].

3. QUATERNION LONG-SHORT TERM MEMORY

NEURAL NETWORKS

Based on the quaternion algebra and with the previously

described motivations, we introduce the quaternion long-

1Code is available at https://github.com/Orkis-Research/Pytorch-

Quaternion-Neural-Networks

short term memory (QLSTM) recurrent neural network. In

a quaternion dense layer, all parameters are quaternions, in-

cluding inputs, outputs, weights and biases. The quaternion

algebra is ensured by manipulating matrices of real numbers

[14] to reconstruct the Hamilton product from quaternion al-

gebra. Consequently, for each input vector of size N, output

vector of size M, dimensions are split in four parts: the ﬁrst

one equals to r, the second to xi, the third one is yj, and

the last one equals to zk. The inference process of a fully-

connected layer is deﬁned in the real-valued space by the dot

product between an input vector and a real-valued M×N

weight matrix. In a QLSTM, this operation is replaced with

the Hamilton product ’⊗’ (Eq. 2) with quaternion-valued

matrices (i.e. each entry in the weight matrix is a quaternion).

Gates are core components of the memory of LSTMs.

Based on [17], we propose to extend this mechanism to

quaternion numbers. Therefore, the gate action is charac-

terized by an independent modiﬁcation of each component

of the quaternion-valued signal following a component-wise

product (i.e. in a split fashion [18]) with the quaternion-

valued gate potential. Let ft,it,ot,ct, and htbe the forget,

input, output gates, cell states and the hidden state of a LSTM

cell at time-step t. QLSTM equations can be derived as:

ft=σ(Wf⊗xt+Rf⊗ht−1+bf),(3)

it=σ(Wi⊗xt+Ri⊗ht−1+bi),(4)

ct=ft×ct−1+it×α(Wcxt+Rcht−1+bc),(5)

ot=σ(Wo⊗xt+Ro⊗ht−1+bo),(6)

ht=ot×α(ct),(7)

with σand αthe sigmoid and tanh quaternion split activa-

tions [18, 11, 19, 10]. The quaternion weight and bias ma-

trices are initialized following the proposal of [15]. Quater-

nion bidirectional connections are equivalent to real-valued

ones [20]. Consequently, past and future contexts are added

together component-wise at each time-step. The full back-

propagtion of quaternion-valued recurrent neural network can

be found in [15].

4. EXPERIMENTS

This section provides the results for QLSTM and LSTM on

the synthetic memory copy-task (Section 4.1), and a descrip-

tion of the quaternion acoustic features (Section 4.2) that are

used as inputs during the realistic speech recognition experi-

ment with the Wall Street Journal (WSJ) corpus (Section 4.3).

4.1. Synthetic memory copy-task as a sanity check

The copy task originally introduced by [21] is a synthetic test

that highlights how RNN based models manage the long-term

memory. This characteristic makes the copy task a powerful

benchmark to demonstrate that a recurrent model can learn

long-term dependencies. It consists of an input sequence of a

length L, composed of Sdifferent symbols followed by a se-

quence of time-lags or blanks of size T, and ended by a delim-

iter that announces the beginning of the copy operation (after

which the initial input sequence should be progressively re-

constructed at the output). In this paper, the copy-task is used

as a sanity check to ensure that the introduction of quaternions

on LSTM models does not harm the basic memorization abili-

ties of the LSTM. The QLSTM is composed of 8K parameters

with one hidden layer of size 20, while the LSTM is made of

8.2K parameters with an hidden dimension of 40 neurons. It

is worth underlying that due to the nature of the task, the out-

put layer of the QLSTM is real-valued. Indeed, 9symbols

are one-hot encoded (S= 0, ..., 7for the sequence and 8for

the blank) and can not be split in four components. Different

values of T= 10,50,100 are investigated alongside with a

ﬁxed sequence size of L= 10. Models are trained with the

Adam optimizer, with an initial learning rate λ= 5·10−3, and

without employing any regularization methods. The training

is performed on 2,000 epochs with the cross-entropy used as

the loss function. At each epoch, models are fed with a batch

of 10 randomly generated sequences.

Fig. 1. Evolution of the cross entropy loss, and of the ac-

curacy of both QLSTM (Blue curves) and LSTM (Orange

curves) during the synthetic memory copy-task for time lags

or blanks Tof 10,50 and 100.

The results reported in Fig.1 highlight a slightly faster

convergence of the QLSTM over the LSTM for all sizes (T).

It is also worth noticing that real-valued LSTM failed the

copy-task with T= 100 while QLSTM succeeded. It is

easily explained by the impact of quaternion numbers dur-

ing the learning process of inter-denpendencies of input fea-

tures. Indeed, the QLSTM is a smaller (less parameters), but

more efﬁcient (dealing with higher dimensions) model than

real-valued LSTM, resulting in a higher generalization capa-

bility: 20 quaternion neurons are equivalent to 20 ×4 = 80

real-valued ones. Overall, the introduction of quaternions in

LSTMs do not alter their basics properties, but it provides a

higher long-term dependencies learning capability. We hy-

pothesis that such efﬁciency improvements alongside with a

dedicated input representation will help QLSTMs to outper-

form LSTMs in more realistic tasks, such as speech recogni-

tion.

4.2. Quaternion acoustic features

Unlike in [14], this paper proposes to use four internal fea-

tures in an input quaternion. The raw audio is ﬁrst split ev-

ery 10ms with a window of 25ms. Then 40-dimensional log

Mel-ﬁlter-bank coefﬁcients with ﬁrst, second, and third order

derivatives are extracted using the pytorch-kaldi2toolkit and

the Kaldi s5 recipes [2]. An acoustic quaternion Q(f , t)asso-

ciated with a frequency band fand a time-frame tis formed

as follows:

Q(f, t) = e(f , t) + ∂e(f, t)

∂t i+∂2e(f , t)

∂2tj+∂3e(f, t)

∂3tk.

(8)

Q(f, t)represents multiple views of a frequency band fat

time frame t, consisting of the energy e(f, t)in the ﬁlter

band at frequency f, its ﬁrst time derivative describing a

slope view, its second time derivative describing a concavity

view, and the third derivative describing the rate of change

of the second derivative. Quaternions are used to construct

latent representations of the external relations between the

views characterizing the contents of frequency bands at given

time intervals. Thus, the quaternion input vector length is

160/4 = 40. Decoding is based on Kaldi [2] and weighted ﬁ-

nite state transducers (WFST) that integrate acoustic, lexicon

and language model probabilities into a single HMM-based

search graph.

4.3. Speech recognition with the Wall Street Journal

QLSTMs and LSTMs are trained on both the 14 hour sub-

set ‘train-si84’, and the full 81 hour dataset ’train-si284’ of

the Wall Street Journal (WSJ) corpus. The ‘test-dev93’ de-

velopment set is employed for validation, while ’test-eval92’

composes the testing set. It is important to notice that eval-

uated LSTMs and QLSTMs are bidirectionals. Architecture

models vary in both number of layers and neurons. Indeed

the number of recurrent layers Lvaries from three to four,

while the number of neurons Nis included in a gap from 256

to 1,024. Then, one dense layer is stacked alongside with an

output dense layer. It is also worth noticing that the number of

quaternion units of a QLSTM layer is N/4. Indeed, QLSTM

neurons are four dimensional (i.e. a QLSTM layer that deals

with a dimension size of 1,024 has 1,024/4 = 256 effec-

tive quaternion neurons). Models are optimized with Adam,

2pytorch-kaldi is available at https://github.com/mravanelli/pytorch-kaldi

Table 1. Word error rates (WER %) obtained with both training set (WSJ14h and WSJ81h) of the Wall Street Journal corpus.

’test-dev93’ and ’test-eval92’ are used as validation and testing set respectively. Lexpresses the number of recurrent layers.

Models are bidirectional. Results are from an average of three runs.

Models WSJ14 Dev. WSJ14 Test WSJ81 Dev. WSJ81 Test Params

R-LSTM-3L-256 12.7 8.6 9.5 6.5 4.0M

H-QLSTM-3L-256 12.8 8.5 9.4 6.5 2.3M

R-LSTM-4L-256 12.1 8.3 9.3 6.4 4.8M

H-QLSTM-4L-256 11.9 8.0 9.1 6.2 2.5M

R-LSTM-3L-512 11.1 7.1 8.2 5.2 12.2M

H-QLSTM-3L-512 10.9 6.9 8.1 5.1 5.6M

R-LSTM-4L-512 11.3 7.0 8.1 5.0 15.5M

H-QLSTM-4L-512 11.1 6.8 8.0 4.9 6.5M

R-LSTM-3L-1024 11.4 7.3 7.6 4.8 41.2M

H-QLSTM-3L-1024 11.0 6.9 7.4 4.6 15.5M

R-LSTM-4L-1024 11.2 7.2 7.4 4.5 53.7M

H-QLSTM-4L-1024 10.9 6.9 7.2 4.3 18.7M

with vanilla hyper-parameters and an initial learning rate of

5·10−4. The learning rate is progressively annealed using

an halving factor of 0.5that is applied when no performance

improvement on the validation set is observed. The models

are trained during 15 epochs. All the models converged to a

minimum loss, due to the annealed learning rate. Results are

from a three folds average.

At ﬁrst, it is important to notice that reported results on

Table 1 compare favorably with equivalent architectures [5]

(WER of 11.7% on ’test-dev93’), and are competitive with

state-of-the-art and much more complex models based on bet-

ter engineered features [22] (WER of 3.8% with the 81 hours

of training data, and on ’test-eval92’). Table 1 shows that

the proposed QLSTM always outperform real-valued LSTM

on the test dataset with less neural parameters. Based on the

smallest 14 hours subset, a best WER of 6.9% is reported in

real conditions (w.r.t to the best validation set results) with a

three layered QLSTM of size 512, compared to 7.1% for an

LSTM with the same size. It is worth mentioning that a best

WER of 6.8% is obtained with a four layered QLSTM of size

512, but without consideration for the validation results. Such

performances are obtained with a reduction of the number of

parameters of 2.2times, with 5.6M parameters for the QL-

STM compared to 12.2M for the real-valued equivalent. This

is easily explained by considering the content of the quater-

nion algebra. Indeed, for a fully-connected layer with 2,048

input values and 2,048 hidden units, a real-valued RNN has

2,0482≈4.2M parameters, while, to maintain equal input

and output dimensions, the quaternion equivalent has 512

quaternions inputs and 512 quaternion hidden units. There-

fore, the number of parameters for the quaternion-valued

model is 5122×4≈1M. Such a complexity reduction turns

out to produce better results and have other advantages such

as a smaller memory footprint while saving models on bud-

get memory systems. This reduction allows the QLSTM to

make the memory more “compact” and therefore, the re-

lations between quaternion components are more robust to

unseen documents from both validation and testing data-sets.

This characteristic makes our QLSTM model particularly

suitable for speech recognition conducted on low computa-

tional power devices like smartphones. Both QLSTMs and

LSTMs produce better results with the 81 hours of training

data. As for the smaller subset, QLSTMs always outperform

LSTMs during both validation and testing phases. Indeed, a

best WER of 4.3% is reported for a four layered QLSTM of

dimension 1,024, while the best LSTM performed at 4.5%

with 2.9times more parameters, and an equivalently sized

architecture.

5. CONCLUSION

This paper proposes to process sequences of traditional and

multidimensional acoustic features with a novel quaternion

long-short term memory neural network (QLSTM). The pa-

per introduce ﬁrst a novel quaternion-valued representation

of the speech signal to better handle signal sequences depen-

dencies, and a LSTM composed with quaternions to repre-

sent in the hidden latent space inter-dependencies between

quaternion features. The proposed model has been evalu-

ated on a synthetic memory copy-task and a more realistic

speech recognition task with the large Wall Street Journal

(WSJ) dataset. The reported results support the initial intu-

itions by showing that QLSTM are more effective at learn-

ing both longer dependencies and a compact representation of

multidimensional acoustic speech features by outperforming

standard real-valued LSTMs on both experiments, with up to

2.8times less neural parameters. Therefore, and as for other

quaternion-valued architectures, the intuition that the quater-

nion algebra of the QLSTM offers a better and more compact

representation for multidimensional features, alongside with

a better learning capability of feature internal dependencies

through the Hamilton product, have been validated.

6. REFERENCES

[1] Larry R. Medsker and Lakhmi J. Jain, “Recurrent neural

networks,” Design and Applications, vol. 5, 2001.

[2] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas

Burget, Ondrej Glembek, Nagendra Goel, Mirko Han-

nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan

Silovsky, Georg Stemmer, and Karel Vesely, “The kaldi

speech recognition toolkit,” in IEEE 2011 Workshop on

Automatic Speech Recognition and Understanding. Dec.

2011, IEEE Signal Processing Society, IEEE Catalog

No.: CFP11SRW-USB.

[3] Razvan Pascanu, Tomas Mikolov, and Yoshua Ben-

gio, “On the difﬁculty of training recurrent neural net-

works,” in International Conference on Machine Learn-

ing, 2013, pp. 1310–1318.

[4] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R

Steunebrink, and Jürgen Schmidhuber, “Lstm: A search

space odyssey,” IEEE transactions on neural networks

and learning systems, vol. 28, no. 10, pp. 2222–2232,

2017.

[5] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mo-

hamed, “Hybrid speech recognition with deep bidirec-

tional lstm,” in Automatic Speech Recognition and Un-

derstanding (ASRU), 2013 IEEE Workshop on. IEEE,

2013, pp. 273–278.

[6] Stephen John Sangwine, “Fourier transforms of colour

images using quaternion or hypercomplex, numbers,”

Electronics letters, vol. 32, no. 21, pp. 1979–1980,

1996.

[7] Nicholas A Aspragathos and John K Dimitros, “A com-

parative study of three methods for robot kinematics,”

Systems, Man, and Cybernetics, Part B: Cybernetics,

IEEE Transactions on, vol. 28, no. 2, pp. 135–145,

1998.

[8] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton,

“Dynamic routing between capsules,” arXiv preprint

arXiv:1710.09829v2, 2017.

[9] Paolo Arena, Luigi Fortuna, Luigi Occhipinti, and

Maria Gabriella Xibilia, “Neural networks for

quaternion-valued function approximation,” in Circuits

and Systems, ISCAS’94., IEEE International Sympo-

sium on. IEEE, 1994, vol. 6, pp. 307–310.

[10] Paolo Arena, Luigi Fortuna, Giovanni Muscato, and

Maria Gabriella Xibilia, “Multilayer perceptrons to ap-

proximate quaternion valued functions,” Neural Net-

works, vol. 10, no. 2, pp. 335–342, 1997.

[11] Titouan Parcollet, Mohamed Morchid, Pierre-Michel

Bousquet, Richard Dufour, Georges Linarès, and Re-

nato De Mori, “Quaternion neural networks for spoken

language understanding,” in Spoken Language Technol-

ogy Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 362–

368.

[12] Titouan Parcollet, Morchid Mohamed, and Georges

Linarès, “Quaternion denoising encoder-decoder for

theme identiﬁcation of telephone conversations,” Proc.

Interspeech 2017, pp. 3325–3328, 2017.

[13] Chase J Gaudet and Anthony S Maida, “Deep quater-

nion networks,” in 2018 International Joint Conference

on Neural Networks (IJCNN). IEEE, 2018, pp. 1–8.

[14] Titouan Parcollet, Ying Zhang, Mohamed Morchid,

Chiheb Trabelsi, Georges Linarès, Renato de Mori, and

Yoshua Bengio, “Quaternion convolutional neural net-

works for end-to-end automatic speech recognition,” in

Interspeech 2018, 19th Annual Conference of the In-

ternational Speech Communication Association, Hyder-

abad, India, 2-6 September 2018., 2018, pp. 22–26.

[15] Titouan Parcollet, Mirco Ravanelli, Mohamed Morchid,

Georges Linarès, Renato De Mori, and Yoshua Bengio,

“Quaternion recurrent neural networks,” 2018.

[16] Toshifumi Minemoto, Teijiro Isokawa, Haruhiko

Nishimura, and Nobuyuki Matsui, “Feed forward neu-

ral network with random quaternionic neurons,” Signal

Processing, vol. 136, pp. 59–68, 2017.

[17] Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalch-

brenner, and Alex Graves, “Associative long short-term

memory,” arXiv preprint arXiv:1602.03032, 2016.

[18] D Xu, L Zhang, and H Zhang, “Learning alogrithms in

quaternion neural networks using ghr calculus,” Neural

Network World, vol. 27, no. 3, pp. 271, 2017.

[19] Titouan Parcollet, Mohamed Morchid, and Georges

Linares, “Deep quaternion neural networks for spoken

language understanding,” in Automatic Speech Recogni-

tion and Understanding Workshop (ASRU), 2017 IEEE.

IEEE, 2017, pp. 504–511.

[20] Alex Graves and Navdeep Jaitly, “Towards end-to-end

speech recognition with recurrent neural networks,” in

International Conference on Machine Learning, 2014,

pp. 1764–1772.

[21] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-

term memory,” Neural computation, vol. 9, no. 8, pp.

1735–1780, 1997.

[22] William Chan and Ian Lane, “Deep recurrent neu-

ral networks for acoustic modelling,” arXiv preprint

arXiv:1504.01482, 2015.