Content uploaded by Titouan Parcollet

Author content

All content in this area was uploaded by Titouan Parcollet on Jun 17, 2019

Content may be subject to copyright.

Real to H-space Encoder for Speech Recognition

Titouan Parcollet1,2, Mohamed Morchid1, Georges Linarès1, Renato De Mori3

1Avignon Université, LIA, France

2ORKIS, Aix-en-provence, France

3McGill University, Montréal, QC, Canada

titouan.parcollet@alumni.univ-avignon.fr, {firstname.lastname}@univ-avignon.fr

Abstract

Deep neural networks (DNNs) and more precisely recurrent

neural networks (RNNs) are at the core of modern automatic

speech recognition systems, due to their efﬁciency to process

input sequences. Recently, it has been shown that different in-

put representations, based on multidimensional algebras, such

as complex and quaternion numbers, are able to bring to neural

networks a more natural, compressive and powerful representa-

tion of the input signal by outperforming common real-valued

NNs. Indeed, quaternion-valued neural networks (QNNs) better

learn both internal dependencies, such as the relation between

the Mel-ﬁlter-bank value of a speciﬁc time frame and its time

derivatives, and global dependencies, describing the relations

that exist between time frames. Nonetheless, QNNs are limited

to quaternion-valued input signals, and it is difﬁcult to beneﬁt

from this powerful representation with real-valued input data.

This paper proposes to tackle this weakness by introducing a

real-to-quaternion encoder that allows QNNs to process any one

dimensional input features, such as traditional Mel-ﬁlter-banks

for automatic speech recognition.

Index Terms: quaternion neural networks, recurrent neural net-

works, speech recognition

1. Introduction

Automatic speech recognition (ASR) systems have been widely

impacted by machine learning, and more precisely by the resur-

gence of deep neural networks (DNNs). In particular, recurrent

neural networks (RNNs) have been designed to learn parame-

ters of sequence to sequence mapping, and various models were

successfully applied to ASR with a remarkable increasing in the

ASR system performance. In order to avoid parameter estima-

tion problems, RNNs with long short-term memory [1, 2], and

gated recurrent unit (GRU) [3] have been proposed to mitigate

vanishing and exploding gradients when learning long input se-

quences. Nevertheless, less attention has been paid to model

input features with multiple views of speech spectral tokens.

A noticeable exception is the use of complex-valued num-

bers in neural networks (CVNNs) to jointly represent am-

plitude and phase of spectral samples [4]. More recently,

quaternion-valued neural networks (QNNs) have been inves-

tigated to process the traditional Mel-frequency cepstral co-

efﬁcients (MFCCs), or Mel-ﬁlter-banks plus time derivatives

[5, 6, 7] as composed entities. Superior accuracy, with up

to four times less model parameters, has been observed with

quaternion-valued models compared to results obtained with

real-valued equivalent models. In fact, common real-valued

neural networks process energies and time derivatives inde-

pendently, learning both global dependencies between multiple

time frames, and local values in a speciﬁc time frame, without

considering the relations between a value and its derivatives. In-

stead, the quaternion algebra allows QNNs [5, 6, 8, 9, 10, 11]

to process time frames as composed entities, with internal re-

lations learned within the speciﬁc algebra, and global depen-

dencies learned with the neural network architecture, while re-

ducing by an important factor the number of neural parameters

needed. As a side effect, with quaternion algebra, the number

of neural parameters to be estimated is reduced by an impor-

tant factor. Nonetheless, QNNs input features must be encoded

as quaternion numbers, requiring a preliminary deﬁnition of in-

put views that cannot be modiﬁed by the learning process. In

many cases, it looks advantageous to have multiple input fea-

ture views, but there may be different choices of them and it is

not clear how to make a selection. Examples could be views

based on temporal or spectral relations. In this paper, a real-

to-quaternion encoder (R2H) is proposed to let a quaternion-

valued neural architecture learn hidden representations of input

feature views. The R2H layer acts as an encoder to train QNNs

with any real-valued input vector. Indeed, this encoder allows

the model to learn in an end-to-end architecture, a latent and

quaternion-valued representation of the input data. This repre-

sentation is then used as an input to a quaternion-valued clas-

siﬁer, to exploit the capabilities of quaternion neural networks.

For achieving this objective, the contributions of this paper are:

• Investigate different real-to-quaternion (R2H) encoders

to learn an internal representation of any real-valued in-

put data (Section 4).

• Merge the R2H with the previously introduced quater-

nion long short-term memory neural network (QLSTM,

Section 3)[6]1.

• Evaluate this approach on the TIMIT, and Librispeech

speech recognition tasks (Section 5).

Results improvements on both TIMIT and Librispeech speech

recognition tasks are reported with the introduction of a R2H

encoder in a QLSTM architecture, with input made of 40 Mel-

ﬁlter-bank coefﬁcients, and with more than three times fewer

neural parameters than with real-valued LSTMs.

2. Quaternion algebra

The quaternion algebra Hdeﬁnes operations between quater-

nion numbers. A quaternion Q is an extension of a complex

number to the hyper-complex plane deﬁned in a four dimen-

sional space as:

Q=r1 + xi+yj+zk,(1)

where r,x,y, and zare real numbers, and 1,i,j, and kare the

quaternion unit basis. In a quaternion, ris the real part, while

1Code is available at: https://github.com/Orkis-Research/Pytorch-

Quaternion-Neural-Networks

xi+yj+zkwith i2=j2=k2=ijk =−1is the imaginary

part, or the vector part. Such a deﬁnition can be used to describe

spatial rotations. A quaternion Qcan also be summarized into

the following matrix of real numbers, that turns out to be more

suitable for computations:

Qmat =

r−x−y−z

x r −z y

y z r −x

z−y x r

.(2)

The conjugate Q∗of Qis deﬁned as:

Q∗=r1−xi−yj−zk.(3)

Then, a normalized or unit quaternion Q/is expressed as:

Q/=Q

|Q|,(4)

with |Q|the norm of Q deﬁned as:

|Q|=pr2+x2+y2+z2.(5)

Finally, the Hamilton product ⊗between two quaternions Q1

and Q2is computed as follows:

Q1⊗Q2=(r1r2−x1x2−y1y2−z1z2)+

(r1x2+x1r2+y1z2−z1y2)i+

(r1y2−x1z2+y1r2+z1x2)j+

(r1z2+x1y2−y1x2+z1r2)k.(6)

The Hamilton product is used in QNNs to perform transforma-

tions of vectors representing quaternions, as well as scaling and

interpolation between two rotations following a geodesic over a

sphere in the R3space as shown in [12].

3. Quaternion long short-term memory

neural networks

Long short-term memory neural networks (LSTM) are a well-

known and investigated extension of recurrent neural networks

[1, 13]. LSTMs offer an elegant solution to the vanishing and

exploding gradient problems, alongside with a stronger learn-

ing capability of long and short-term dependencies within se-

quence. Following these strengths, a quaternion long short-term

memory neural network (QLSTM) has been proposed [5].

In a quaternion-valued layer, all parameters are quaternions,

including inputs, outputs, weights and biases. The quaternion

algebra is ensured by manipulating matrices of real numbers

[7, 11] to reconstruct the Hamilton product from quaternion al-

gebra. Consequently, for each input vector of size N, output

vector of size M, dimensions are split in four parts: the ﬁrst

one equals to r, the second to xi, the third one is yj, and the last

one equals to zk. The inference process of a fully-connected

layer is deﬁned in the real-valued space by the dot product be-

tween an input vector and a real-valued M×Nweight matrix.

In a QLSTM, this operation is replaced with the Hamilton prod-

uct ’⊗’ (Eq. 6) with quaternion-valued matrices (i.e. each entry

in the weight matrix is a quaternion).

Both LSTM and QLSTM networks rely on a gate action

[14], that allows the cell-state to retain or discard information

from the past, and the future in the case of a bidirectional

(Q)LSTM. Gates are deﬁned in the quaternion space follow-

ing [5]. Indeed, the gate mechanism implies a component-wise

product of the components of the quaternion-valued signal with

the gate potential in a split manner [15]. Let ft,it,ot,ct, and ht

be the forget, input, output gates, cell states and the hidden state

of a QLSTM cell at time-step t. QLSTM equations are deﬁned

as:

ft=σ(Wf⊗xt+Rf⊗ht−1+bf),(7)

it=σ(Wi⊗xt+Ri⊗ht−1+bi),(8)

ct=ft×ct−1+it×α(Wc⊗xt+Rc⊗ht−1+bc),(9)

ot=σ(Wo⊗xt+Ro⊗ht−1+bo),(10)

ht=ot×α(ct),(11)

with σand αthe Sigmoid and Tanh quaternion split activations

[15, 9].

Bidirectional connections allow (Q)LSTM networks to con-

sider the past and future information at a speciﬁc time step, en-

abling the model to capture a more global context [2]. Quater-

nion bidirectional connections are identical to real-valued ones.

Indeed, past and future contexts are added together component-

wise at each time step.

An adapted scheme initialization for quaternion neural net-

works parameters has been proposed in [6]. In practice, bi-

ases are set to zero while weights are sampled following a Chi-

distribution with four degrees of freedom. Finally, QRNNs re-

quire a speciﬁc backpropagation algorithm detailed in [6].

4. R2H encoder

As mentioned in the introduction, having input features repre-

sented by quaternions requires to have predeﬁned a number of

views for the same input token. This prevents the use of quater-

nion networks when prior knowledge suggests to use multiple

views whose number and type cannot be exactly deﬁned. For

example, it is known that time relations between a speech spec-

trum and its neighbor spectra may improve the classiﬁcation of

the phoneme whose utterance produced the spectrum. Never-

theless, these relations may not be limited to time derivatives of

all spectra samples in a spoken sentence. To overcome this lim-

itation, a new method is proposed. It consists in introducing a

real-valued encoder directly connected to the real-valued input

signal. The real-to-quaternion (R2H) encoder is trained jointly

to the rest of the model in an end-to-end manner, such as any

other layer. After training, the encoder is expected to allow a

mapping from the real space of the input features, to a latent in-

ternal representation meaningful for the following quaternion

layers. The trained model is thus able to directly deal with

real-valued input features, while internally processing quater-

nion numbers. The R2H encoder is a traditional dense layer

followed by a quaternion activation function and normalization.

The number of neurons contained in the layer must be a multi-

ple of four for the quaternion representation. Let W,Xand B

be the weight matrix, the real-valued input and the bias vectors

respectively. Q/

out is the unit quaternion vector obtained at the

output of the projection layer and is expressed as:

Q/

out =Qout

|Qout|,(12)

with,

Qout =α(W.X +B),(13)

and αany quaternion split activation function. In practice, Qout

and Q/

out follow the quaternion internal representation deﬁned

Figure 1: Illustration of the R2H encoder, used as an input layer to a QLSTM. Inputs are real, before being turned into quaternions,

and ﬁnally unitary quaternions within the R2H encoder.

in Section 3. Consequently, the input is split in four features

from a latent sub-space, that are interpreted as quaternion com-

ponents: the ﬁrst one equals to r, the second to xi, the third

one is yj, and the last one equals to zk, making it possible to

apply the quaternion normalization, and the activation function.

At the end of training Q/

out capture an internal latent mapping

of the real-valued input signal Xtrough a vector of unit quater-

nions. Adding the R2H encoder as an input layer to QLSTMs

or to any other QNNs allow the model to deal with real-valued

inputs, while taking the strengths of QNNs (Figure 1).

5. Experiments

Model architectures used for the experiments are presented in

Section 5.1. Then, the R2H encoder is compared to the tradi-

tional and naive quaternion representation on the TIMIT and

Librispeech speech recognition tasks (Section 5.2)

5.1. Model architectures

QLSTMs have already been investigated for speech recognition

in [5] and [6]. Consequently, and based on these previous re-

searches, QLSTMs are composed with four bidirectional QL-

STM layers with an internal real-valued size of 1,024, equiva-

lent to 256 quaternion neurons. Indeed, 256 ×4=1,024 real

numbers. The R2H encoder size varies from 256 to 1,024 to ex-

plore the best latent quaternion representation. Tanh, HardTanh

and ReLU activation functions are investigated to compare the

impact of bounded (Tanh, Hardtanh) and unbounded (ReLU)

R2H encoders. In fact, the quaternion normalization allows a

numerical reduction of the internal representation, but the ReLU

counteracts the latter effect by integrating high real and positive

values to the encoding. The ﬁnal layer is real-valued and corre-

sponds to the HMM states obtained with the Kaldi [16] toolkit.

A dropout of 0.2is applied across all the layers except the out-

put. The Adam optimizer [17] is used to train the models with

vanilla hyperparameters. The learning rate is halved every-time

the loss on the validation set is below a certain threshold ﬁxed

to 0.001 to avoid overﬁtting. Finally, models are implemented

with the Pytorch-Kaldi toolkit [18]. While the effectiveness of

QLSTM over LSTM has been demonstrated, an LSTM network

trained in the same conditions and based on [5] is considered as

a baseline. All the models are trained during 30 epochs, and

the results on both the validation and test sets are saved at this

point.

5.2. Phoneme recognition with the TIMIT corpus

The training process is based on the standard 3,696 sentences

uttered by 462 speakers, while testing is conducted on 192 sen-

tences uttered by 24 speakers of the TIMIT [19] dataset. A val-

idation set composed of 400 sentences uttered by 50 speakers

is used for hyper-parameter tuning. The raw audio is processed

with a window of 25ms and an overlap of 10ms. Then, 40-

dimensional log Mel-ﬁlter-bank coefﬁcients are extracted with

the Kaldi toolkit. In previous work with QLSTMs [6, 5], ﬁrst,

second and third time order derivatives were composed with

spectral energies to build a multidimensional input representa-

tion with quaternions. In this paper, the time derivatives are no

longer used. Instead, latent representations are directly learned

from the R2H encoder, fed with the 40 log Mel-ﬁlter bank co-

efﬁcients. For the sake of comparison, an input quaternion is

naively composed with input features from four consecutive

Mel-ﬁlter-bank coefﬁcients, before being fed to a standard QL-

STM.

Figure 2 reports the results obtained for the investigation

of the R2H encoder size and the impact of the activation layer.

Results are from an average of three runs and are not obtained

w.r.t to the validation set. Indeed, performances on the test set

are evaluated only once at the end of the training phase. It is

ﬁrst interesting to note that a layer of 1,024 neurons always

gives better results than a layer of size 256 or 512, without

even considering activation functions. In the same manner, the

Tanh activation outperforms both ReLU and Hardtanh activa-

tion function with all the layer size, with an average phoneme

error rate (PER) on the TIMIT test set of 15.6% compared to

16.7and 16% for the ReLU and HardTanh activations. It is im-

portant to note that the ReLU activation gives the worst results.

An explanation of such phenomenon is the deﬁnition interval of

the ReLU function. When dealing with ReLU, outputs of the

R2H layer are not bounded in the positive domain before being

normalised. Therefore, the dense layer can output large values

that are then squashed by the quaternion normalization, and it

can be hard for the neural network to learn such mapping. Con-

versely, both Hardtanh and Tanh functions are bounded by −1

and 1, making it easier to learn, since values of the R2H layer

before and after normalization vary on the same range. The

HardTanh function also hardly saturates at −1and 1in the same

manner as the ReLU activation for negative numbers, while the

Tanh smoothly tends to these bounds. Consequently, the Hard-

Tanh gives slightly worst results than the Tanh. Finally, a best

PER of 15.4% is obtained with a normalised R2H encoder of

size 1,024 based on the Tanh activation function, compared to

16.5% and 15.9% with ReLU and Hardtanh functions.

ReLU HardTanh Tanh

15.5

16

16.5

17 16.9

16.2

15.9

16.7

15.9

15.5

16.5

15.9

15.4

PER %

R2H Size 256 R2H Size 512 R2H Size 1024

Figure 2: Phoneme Error Rate (PER %) obtained on the test

set of the TIMIT corpus with different activation functions, and

different R2H encoder size for a QLSTM. Results are from an

average of three runs.

Table 1 presents a summary of the results observed on the

TIMIT phoneme recognition task with a QLSTM and basic

quaternion features, compared to the proposed QLSTM coupled

with the best R2H encoder from Fig. 2. For fair comparison, a

real-valued LSTM is also tested. As highlighted in [6], QL-

STMs models require less neural parameters than LSTMs due

to their internal quaternion algebra. Therefore, an LSTM with

1,024 neurons per layer is composed of 46.0million parame-

ters, while corresponding QLSTMs only need 15.5Mparam-

eters. It is ﬁrst interesting to note that the R2H encoder helps

the QLSTM to obtains the same PER as the real-valued LSTM,

while dealing with a real-valued input signal. Indeed, both mod-

els performed at 15.4% on the test set, while the QLSTM still

requires more than three times fewer neural parameters.

Table 1: Phoneme error rate (PER%) of the models on the de-

velopment and test sets of the TIMIT dataset. “Params" stands

for the total number of trainable parameters. "R2H-Norm" and

"R2H" correspond to R2H encoders with and without normal-

ization. Results are from an average of 3runs.

Models Dev. Test Params

LSTM 14.5 15.4 46.0M

QLSTM 14.9 15.9 15.5M

R2H-QLSTM 14.7 15.7 15.5M

R2H-Norm-QLSTM 14.4 15.4 15.5M

Then, it is worth underlining that the basic QLSTM without

R2H layer obtains the worst PER of all models with 15.9% on

the test set, due to the inappropriate input representation. Then,

the impact of the quaternion normalization process is investi-

gated by comparing a R2H encoder without normalization, to

a normalised one. As expected, the quaternion normalization

helps the input to ﬁt the quaternion representation, and thus

gives better results with 15.4% of PER compared to 15.7%

for the non-normalized R2H encoder. It is important to men-

tion that such results are obtained without batch-normalization,

speaker adaptation or rescoring methods.

5.3. Speech recognition with the Librispeech corpus

The experiments are extended to the larger Librispeech dataset

[20]. Librispeech is composed of three distinct training subsets

of 100,360 and 500 hours of speech respectively, represent-

ing a total training set of 960 hours of read English speech. In

our experiments, the models are trained following the setups de-

scribed in Section 5.1, and based on the train_clean_100 subset

containing 100 hours. Results are reported on the test_clean set.

Input features are the same as for the TIMIT experiments, and

the best activation function reported in Figure 2 is used (Tanh).

No regularization techniques such as batch-normalization are

used, and no rescoring methods are applied at testing time.

Table 2: Word error rate (WER%) of the models on

test_clean set of the Librispeech dataset with a training on the

train_clean_100 subset. “Params" stands for the total number

of trainable parameters. "R2H-Norm" and "R2H" correspond

to R2H encoders with and without normalization. No rescoring

technique is applied.

Models Test Params

LSTM 8.1 49.0M

QLSTM 8.5 17.7M

R2H-QLSTM 8.3 17.7M

R2H-Norm-QLSTM 8.0 17.7M

The total number of neural parameters is slightly differ-

ent when compared with the TIMIT experiments due to the in-

creased number of HMM states, and therefore neurons, of the

output layer for the Librispeech task. Nonetheless, the number

of parameters is still lowered by a factor of 3when using QL-

STM networks, compared to the real-valued LSTM. Similarly

to the TIMIT experiments, the QLSTM with a normalized R2H

layer reaches slightly better performances in term of word error

rate (WER), with 8.0% compared to 8.1% for the LSTM. More-

over, the R2H encoder allows the QLSTM WER to decrease

from 8.5% to 8.0%, representing an absolute gain of 0.5%. The

reported results on the larger Librispeech dataset demonstrate

that the R2H encoder solution scales well with more realistic

speech recognition tasks.

6. Conclusions

Summary. This paper addresses one of the major weakness

of quaternion-valued neural networks known as the inability

to process non quaternion-valued input signal. A new real-

to-quaternion (R2H) encoder is introduced, making it possi-

ble to learn in a end-to-end manner a latent quaternion rep-

resentation from any real-valued input data. Such representa-

tion is then processed with QNNs such as a quaternion LSTM.

The experiments conduced on the TIMIT phoneme recognition

task demonstrate that this new approach outperforms a naive

quaternion representation of the input signal, enabling the use

of QNNs with any type of inputs.

Future work. Split activation functions and current quaternion

gate mechanisms do not fully respect the quaternion algebra

by considering each elements as uncorrelated components. A

future work will consist on the investigation of purely quater-

nion recurrent neural networks, involving well-adapted activa-

tion functions, and proper quaternion gates.

7. References

[1] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[2] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neu-

ral networks,” IEEE Transactions on Signal Processing, vol. 45,

no. 11, pp. 2673–2681, 1997.

[3] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-

gio, “Attention-based models for speech recognition,” in Ad-

vances in neural information processing systems, 2015, pp. 577–

585.

[4] C. Trabelsi, O. Bilaniuk, D. Serdyuk, S. Subramanian, J. F. San-

tos, S. Mehri, N. Rostamzadeh, Y. Bengio, and C. J. Pal, “Deep

complex networks,” arXiv preprint arXiv:1705.09792, 2017.

[5] T. Parcollet, M. Morchid, G. Linarès, and R. De Mori, “Bidirec-

tional quaternion long short-term memory recurrent neural net-

works for speech recognition,” arXiv preprint arXiv:1811.02566,

2018.

[6] T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, C. Trabelsi,

R. D. Mori, and Y. Bengio, “Quaternion recurrent neural net-

works,” arXiv:1806.04418v2, 2018.

[7] T. Parcollet, Y. Zhang, M. Morchid, C. Trabelsi, G. Linarès,

R. de Mori, and Y. Bengio, “Quaternion convolutional neural

networks for end-to-end automatic speech recognition,” in Inter-

speech 2018, 19th Annual Conference of the International Speech

Communication Association, Hyderabad, India, 2-6 September

2018., 2018, pp. 22–26. [Online].

[8] T. Nitta, “A quaternary version of the back-propagation algo-

rithm,” in Neural Networks, 1995. Proceedings., IEEE Interna-

tional Conference on, vol. 5. IEEE, 1995, pp. 2753–2756.

[9] P. Arena, L. Fortuna, G. Muscato, and M. G. Xibilia, “Multilayer

perceptrons to approximate quaternion valued functions,” Neural

Networks, vol. 10, no. 2, pp. 335–342, 1997.

[10] T. Isokawa, N. Matsui, and H. Nishimura, “Quaternionic neural

networks: Fundamental properties and applications,” Complex-

Valued Neural Networks: Utilizing High-Dimensional Parame-

ters, pp. 411–439, 2009.

[11] C. J. Gaudet and A. S. Maida, “Deep quaternion networks,”

in 2018 International Joint Conference on Neural Networks

(IJCNN). IEEE, 2018, pp. 1–8.

[12] T. Minemoto, T. Isokawa, H. Nishimura, and N. Matsui, “Feed

forward neural network with random quaternionic neurons,” Sig-

nal Processing, vol. 136, pp. 59–68, 2017.

[13] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and

J. Schmidhuber, “Lstm: A search space odyssey,” IEEE transac-

tions on neural networks and learning systems, vol. 28, no. 10,

pp. 2222–2232, 2017.

[14] I. Danihelka, G. Wayne, B. Uria, N. Kalchbrenner, and

A. Graves, “Associative long short-term memory,” arXiv preprint

arXiv:1602.03032, 2016.

[15] D. Xu, L. Zhang, and H. Zhang, “Learning alogrithms in quater-

nion neural networks using ghr calculus,” Neural Network World,

vol. 27, no. 3, p. 271, 2017.

[16] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,

N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,

J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recog-

nition toolkit,” in IEEE 2011 Workshop on Automatic Speech

Recognition and Understanding. IEEE Signal Processing So-

ciety, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB.

[17] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-

tion,” arXiv preprint arXiv:1412.6980, 2014.

[18] M. Ravanelli, T. Parcollet, and Y. Bengio, “The pytorch-kaldi

speech recognition toolkit,” in In Proc. of ICASSP, 2019.

[19] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S.

Pallett, “Darpa timit acoustic-phonetic continous speech corpus

cd-rom. nist speech disc 1-1.1,” NASA STI/Recon technical report

n, vol. 93, 1993.

[20] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-

rispeech: an asr corpus based on public domain audio books,”

in 2015 IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.