Content uploaded by Xinchi Qiu

Author content

All content in this area was uploaded by Xinchi Qiu on May 18, 2020

Content may be subject to copyright.

Quaternion Neural Networks for Multi-channel Distant Speech Recognition

Xinchi Qiu1, Titouan Parcollet1, Mirco Ravanelli3, Nicholas Lane1,2, Mohamed Morchid4

1University of Oxford, United-Kingdom

2Samsung AI, Cambridge, United-Kingdom

3Mila, Universit´

e de Montr´

eal, Canada

4LIA, Avignon University, France

xinchi.qiu@wolfson.ox.ac.uk

Abstract

Despite the signiﬁcant progress in automatic speech recogni-

tion (ASR), distant ASR remains challenging due to noise and

reverberation. A common approach to mitigate this issue con-

sists of equipping the recording devices with multiple micro-

phones that capture the acoustic scene from different perspec-

tives. These multi-channel audio recordings contain speciﬁc in-

ternal relations between each signal. In this paper, we propose

to capture these inter- and intra- structural dependencies with

quaternion neural networks, which can jointly process multiple

signals as whole quaternion entities. The quaternion algebra

replaces the standard dot product with the Hamilton one, thus

offering a simple and elegant way to model dependencies be-

tween elements. The quaternion layers are then coupled with

a recurrent neural network, which can learn long-term depen-

dencies in the time domain. We show that a quaternion long-

short term memory neural network (QLSTM), trained on the

concatenated multi-channel speech signals, outperforms equiv-

alent real-valued LSTM on two different tasks of multi-channel

distant speech recognition.

Index Terms: distant speech recognition, quaternion neural

networks, multi-microphone speech recognition.

1. Introduction

State-of-the-art speech recognition systems perform reasonably

well in close-talking conditions. However, their performance

degrades signiﬁcantly in more realistic distant-talking scenar-

ios, since the signals are corrupted with noise and reverbera-

tion [1–3]. A common approach to improve the robustness of

distant speech recognizers relies on the adoption of multiple

microphones [4, 5]. Multiple microphones, either in the form

of arrays or distributed networks, capture different views of an

acoustic scene that are combined to improve robustness.

A common practice is to combine the microphones using

signal processing techniques such as beamforming [6]. The

goal of beamforming is to achieve spatial selectivity (i.e., priv-

ilege the areas where a target speaker is speaking), limiting the

effects of both noise and reverberation. One way to perform

spatial ﬁltering is provided by the delay-and-sum beamform-

ing, which simply performs a time alignment followed by a

sum of the recorded signals [7]. More sophisticated techniques

are ﬁlter-and-sum beamforming [8], that ﬁlters the signal before

summing them up, and super-directive beamforming [9], which

further enhances the target speech by suppressing the contribu-

tions of the noise sources from other directions.

An alternative that is gaining signiﬁcant popularity is End-

to-end (E2E) multi-channel ASR [10–15]. Here, the core idea is

to replace the signal processing part with an end-to-end differ-

entiable neural network, that is jointly trained with the speech

recognizer. It will make the speech processing pipeline sig-

niﬁcantly simpler, and different modules composing the whole

system match better with each other. The most straightforward

approach is concatenating the speech features of the different

microphones and feeding them to a neural network [16]. How-

ever, this approach forces the network to deal with very high-

dimensional data, and might thus make learning the complex

relationships between microphones difﬁcult due to numerous

independent neural parameters. To mitigate this issue, it is

common to inject prior knowledge or inductive biases into the

model. For instance, [12] suggested an adaptive neural beam-

former that performs ﬁlter-and-sum beamforming using learned

ﬁlters. Similar techniques have been proposed in [13, 14]. In all

aforementioned works, the microphone combination is not im-

plemented with an arbitrary function, but a restricted pool of

functions like beamforming ones. This introduces a regulariza-

tion effect that helps the convergence of the speech recognizer.

In this paper, we propose a novel approach to model the

complex inter- and intra- microphone dependencies that occur

in multi-microphone ASR. Our inductive bias relies on the use

of quaternion algebra. Quaternions extend complex numbers

and deﬁne four-dimensional vectors composed of a real part

and three imaginary components. The standard dot product is

replaced with the Hamilton product that offers a simple and ele-

gant way to learn dependencies across input channels by sharing

weights across them. More precisely, Quaternion Neural Net-

works (QNN) have recently been the object of several research

efforts focusing on image processing [17–19], 3D sound event

detection [20] and single-channel speech recognition [21]. To

the best of our knowledge, our work is the ﬁrst that proposes

the use of quaternions in a multi-microphone speech process-

ing scenario, which is a particularly suitable application. Our

approach combines the speech features extracted from different

channels into four different dimensions of a set of quaternions

(Section 2.3). We then employ a Quaternion Long-Short Term

Memory (QLSTM) neural network [21]. This way, our archi-

tecture not only models the latent intra- and inter- microphone

correlations with the quaternion algebra, but also jointly learns

time-dependencies with recurrent connections.

Our QLSTM achieves promising results on both a simu-

lated version of TIMIT and the DIRHA corpus [22], which

are characterized by the presence of signiﬁcant levels of non-

stationary noises and reverberation. In particular, we outper-

form both a beamforming baseline (15% relative improvement)

and a real-valued model with the same number of parameters

(8% relative improvement). In the interest of reproducibility,

we release the code under PyTorch-Kaldi [23] 1.

1https://github.com/mravanelli/pytorch-kaldi/

2. Methodology

This section ﬁrst describes the quaternion algebra (Section

2.1) and quaternion long short-term memory neural networks

(Section 2.2). Finally, the quaternion representation of multi-

channel signals is introduced in Section 2.3.

2.1. Quaternion Algebra

A quaternion is an extension of a complex number to the four-

dimensional space [24]. A quaternion Qis written as:

Q=a+bi+cj+dk,(1)

with a,b,c, and dfour real numbers, and 1,i,j, and kthe

quaternion unit basis. In a quaternion, ais the real part, while

bi+cj+dkwith i2=j2=k2=ijk =−1is the imaginary

part, or the vector part. Such deﬁnition can be used to describe

spatial rotations. In the same manner as complex numbers, the

conjugate Q∗of Qis deﬁned as:

Q∗=a−bi−cj−dk,(2)

and a unitary quaternion (i.e. whose norm is equal to 1) is de-

ﬁned as:

Q/=Q

√a2+b2+c2+d2.(3)

The Hamilton product between Q1=a1+b1i+c1j+d1kand

Q2=a2+b2i+c2j+d2kis determined by the products of

the basis elements and the distributive law:

Q1⊗Q2= (a1a2−b1b2−c1c2−d1d2)

+ (a1b2+b1a2+c1d2−d1c2)i

+ (a1c2−b1d2+c1a2+d1b2)j(4)

+ (a1d2+b1c2−c1b2+d1a2)k.

Analogously to complex numbers, quaternions also have a ma-

trix representation deﬁned in a way that quaternion addition and

multiplication correspond to a matrix addition and a matrix mul-

tiplication. An example of such matrix is:

Qmat =

a−b−c−d

b a −d c

c d a −b

d−c b a

.(5)

Following this representation, the Hamilton product can be

written as a matrix multiplication as follow:

Q1⊗Q2=

a1−b1−c1−d1

b1a1−d1c1

c1d1a1−b1

d1−c1b1a1

a2

b2

c2

d2

.(6)

Using the matrix representation of quaternions turns out to be

particularly suitable for computations on modern GPUs com-

pared to the less efﬁcient object programming.

2.2. Quaternion Long Short-Term Memory Networks

Equivalently to standard LSTM models, a QLSTM consists of

a forget gate ft, an input gate it, a cell input activation vector

˜

Ct, a cell state Ctand an output gate ot. In a QLSTM layer,

however, inputs x, hidden states ht, cell states Ct, biases b, and

weight parameters Ware quaternion numbers. All multiplica-

tions are thus replaced with the Hamilton product. Different

activation functions deﬁned in the quaternion domain can be

used [17,25]. In this work, we follow the split approach deﬁned

Figure 1: Illustration of the integration of multiple microphones

with a quaternion dense layer. Each microphone is encapsu-

lated by one component of a set of quaternions. All the neural

parameters are quaternion numbers.

as:

α(Q) = α(a) + α(b)i+α(c)j+α(d)k,(7)

where αis any real-valued activation function (i.e. ReLU, Sig-

moid, ...). Indeed, fully quaternion-valued activation functions

have been demonstrated to be hard to train due to numerous sin-

gularities [17]. Then, the output layer is commonly deﬁned in

the real-valued space to be combined with traditional loss func-

tions (e.g. cross-entropy) [26] due to the real-valued nature of

the labels implied by the considered speech recognition task.

Therefore, a QLSTM layer can be summarised with the follow-

ing equations:

ft=σ(Wfh ⊗ht−1+Wf x ⊗xt+bf),

it=σ(Wih ⊗ht−1+Wix ⊗xt+bi),

˜

Ct=tanh(WCh ⊗ht−1+WC x ⊗xt+bC),

Ct=ft⊗Ct−1+it⊗˜

Ct,(8)

ot=σ(Woh ⊗ht−1+Wox ⊗xt+bo),

ht=ot⊗tanh(Ct),

with two split activations σand tanh as described in Eq. 7.

As shown in [21], QLSTM models can be trained following

the quaternion-valued backpropagation through time. Finally,

weight initialisation is crucial to train deep neural networks ef-

fectively [27]. Hence, a well-adapted quaternion weight initiali-

sation process [21,28] is applied. Quaternion neural parameters

are sampled with respect to their polar form and a random dis-

tribution following common initialization criteria [27,29].

2.3. Quaternion Representation of Multi-channel Signals

We propose to use quaternion numbers in a multi-microphone

speech processing scenario. More precisely, quaternion num-

bers offer the possibility to encode up to four microphones

(Fig. 1). Therefore, common acoustic features (e.g. MFCCs,

FBANKs, ...) are computed from each microphone signal

M1,2,3,4, and then concatenated to compose a quaternion as fol-

lows:

Q=M1,a +M2,bi+M3,b j+M4,bk(9)

Internal relations are captured with the speciﬁc weight sharing

property of the Hamilton product. By using Hamilton prod-

ucts, quaternion-weight components are shared through multi-

ple quaternion-input, creating relations within the elements as

demonstrated in [19]. More precisely, real-valued network in-

puts are treated as a group of uni-dimensional elements that

could be related to each other, potentially decorrelating the four

microphone signals. Conversely, quaternion networks consider

each time frame as an entity of four related elements. Hence,

internal relations are naturally captured and learned through the

process. Indeed, a small variation in one of the microphone

would result in an important change in the internal representa-

tion affecting the encoding of the three other microphones.

It is worth noticing that four microphones may be limit-

ing for realistic applications. For instance, the latest CHIME-6

challenge [30] proposes various recordings obtained from six

microphones in different scenarios. This difﬁculty could be

easily avoided by considering these tasks as a special case of

higher algebras, such as octonions (eight dimensions) or sede-

nions (sixteen dimensions). Nevertheless, this paper proposes

to ﬁrst consider four dimensions to evaluate the viability of the

application of high-dimensional neural networks for distant and

multi-microphone ASR. Finally, quaternion neural networks are

known to be more computationally intensive than real-valued

neural networks. Indeed, the Hamilton product involves 28 ba-

sic operations compared to 1for a standard product. Nonethe-

less, the training time can be reduced with the matrix represen-

tation deﬁned in Eq.(6), and can be drastically improved with

simple linear algebra properties [31].

3. Experimental Protocol

A perturbed speech and multi-channel TIMIT [32] version pre-

sented thereafter is ﬁrst used as a preliminary task to investi-

gate the impact of the Hamilton product. Then, the DIRHA

dataset [33] is used to verify the scalability of the proposed ap-

proach to more realistic conditions.

3.1. TIMIT Dataset

The TIMIT corpus contains broadband recordings of 630 speak-

ers of eight main dialects of American English, each reading

ten phonetically rich sentences. The training dataset consists of

the standard 3696 sentences uttered by 462 speakers, while the

testing one consists of 192 sentences uttered by 24 speakers.

A validation dataset composed of 400 sentences uttered by 50

speakers is used for hyper-parameter tuning.

In our experiments, we created a multi-channel simulated

version of TIMIT using the impulse responses measured

in [34, 35]2. The reference environment is a living room of a

real apartment with an average reverberation time T60 of 0.7

seconds. The considered four microphones (i.e. LA2,LA3,

LA4,LA5) are placed on the ceiling of the room. Data are

created considering all the different positions, and different

positions are used for training and testing data. We also

integrate a single-channel signal obtained with delay-and-sum

beamforming as a baseline comparison [7]. Input features

consist of 40 Mel ﬁlters bank energies (FBANK) with no deltas

extracted with Kaldi [36]. To show that the obtained gain

in performance is independent of the input features, we also

propose 13 MFCC coefﬁcients as an alternative set of features.

2Perturbation can be re-created following: https://github.

com/SHINE-FBK/DIRHA_English_wsj

3.2. DIRHA Dataset

To validate our model in a more realistic scenario, a set of

experiments is also conducted with the larger DIRHA-English

corpus [22]. Equivalently to the generated TIMIT dataset, the

reference context is a domestic environment characterized by

the presence of non-stationary noise and acoustic reverberation.

Training is based on the original Wall-Street-Journal-5k (WSJ)

corpus (i.e. consisting of 7138 sentences uttered by 83 speak-

ers) contaminated with a set of impulse responses measured in

a real apartment [37, 38]. Both a real and a simulated dataset

are used for testing, each consisting of 409 WSJ sentences

uttered by six native American speakers. Note that a validation

set of 310 WSJ sentences is used for hyper-parameter tuning.

Only the ﬁrst four microphones of the circular array are used

in our experiments to ﬁt the quaternion representation. A

single-channel signal obtained with delay-and-sum beamform-

ing is also proposed as a baseline comparison [7]. It is worth

noting that we also used 13 MFCC coefﬁcients as features in

comparison to FBANKs to evaluate the robustness of the model

to the input representation.

3.3. Neural Network Architectures

We decided to ﬁx the number of neural parameters to 5M for

both LSTM and QLSTM following the models studied in [21].

Therefore, the QLSTM model is composed of 4bidirectional

QLSTM layers followed by a linear layer with a softmax ac-

tivation function for classiﬁcation. Output labels are the dif-

ferent HMM states of the Kaldi decoder. Each of the QL-

STM layers consists of 128 quaternion nodes. Although there

are 128 ∗4 = 512 real-valued nodes in total, there are only

128 ∗128 ∗4real-valued weight parameters, due to the weight

sharing property of quaternion neural networks. The LSTM

model is composed of 4 bidirectional LSTM layers of size 290

(i.e. ensuring the same number of neural parameters as the

QLSTM) followed by the same linear layer to obtain poste-

rior probabilities. A dropout rate of 0.2is applied across all

(Q)LSTM layers. Quaternion parameters are initialised with

the speciﬁc initialisation deﬁned in [21], while LSTM param-

eters are initialised with the Glorot criterion [27].

Training is performed with the RMSPROP optimizer with

vanilla hyper-parameters and an initial learning rate of 1.6e−3

over 24 epochs. The learning-rate is halved every time the

loss on the validation set increases, ensuring an optimal con-

vergence. Finally, both LSTM and QLSTM are manually im-

plemented in PyTorch to alleviate any variation due to different

implementations.

4. Results and Discussions

The results on the distant multi-channel TIMIT dataset are re-

ported in Table 1. From this comparison, it emerges that QL-

STM with four microphones outperforms the other approaches.

Our best QLSTM model, in fact, obtains a PER of 28.7%

against a PER of 30.2% achieved with a standard real-value

LSTM. In both cases, the best performance is obtained with

FBANK features. Interestingly, Table 1 shows that the con-

catenation of the four input signals with a real-valued LSTM

outperforms the delay-and-sum beamforming approach. Sim-

ilar achievements have already emerged in previous works on

multi-channel ASR [16] and can be due to the ability of modern

neural networks to obtain disentangled and informative repre-

sentations from noisy inputs.

Table 1: Results expressed in terms of Phoneme Error Rate (PER) percentage (i.e lower is better) of both QLSTM and LSTM models

on the TIMIT distant phoneme recognition task with different acoustic features. Results are from an average of 5runs.

Models Signals Test

(FBANK)

Test

(MFCC)

QLSTM 1 microphone copied 32.1 ±0.02 34.2 ±0.13

LSTM 1 microphone 32.3 ±0.14 35.0 ±0.23

LSTM beamforming 31.1 ±0.11 33.4 ±0.07

LSTM 4 microphones 30.2 ±0.16 32.8 ±0.09

QLSTM 4 microphones 28.7 ±0.06 30.4 ±0.11

Table 2: Results expressed in terms of Word Error Rate (WER) (i.e lower is better) of both QLSTM and LSTM based models on the

DIRHA dataset with different acoustic features. ’Test Sim.’ corresponds to the simulated test set of the corpus, while “Test Real” is the

set composed of real recordings.

Models Signals Test Real

(MFCC)

Test Sim.

(MFCC)

Test Real

(FBANK)

Test Sim.

(FBANK)

LSTM beamforming 35.1 33.7 35.0 33.0

LSTM 4 microphones 32.7 26.4 31.6 26.3

QLSTM 4 microphones 29.8 23.8 29.7 23.4

We can now investigate in more detail the role played by

the quaternion algebra on learning cross-microphone dependen-

cies. One way to do it is to overwrite the quaternion dimensions

with the features extracted from the same microphone (see the

ﬁrst row of Table 1). In this case, we expect that our QLSTM

will fail to learn cross-microphone dependencies, simply be-

cause we have a single feature vector replicated multiple times.

For a fair comparison, the aforementioned experiment is con-

ducted by selecting the best microphone of the array (i.e. LA4).

From the ﬁrst and the second rows of Table 1, one can note

that both single-channel QLSTM and LSTM perform roughly

the same. As expected, in fact, the single-channel QLSTM is

not able to model useful dependencies when the quaternion di-

mensions are dumped with the same feature vector. Nonethe-

less, switching to four-channel signal brings an average PER

improvement of 3.6% for the QLSTM compared to 2.1% for

the LSTM, showing a higher gain obtained on multiple chan-

nels with the QLSTM. This illustrates the ability of QLSTM to

better capture latent relations across the different microphones.

To provide some experimental evidence on a more realis-

tic task, we evaluate our model with the DIRHA dataset. The

results obtained in Table 2 conﬁrm the trend observed with

TIMIT. Indeed, Word Error Rates (WER) of 29.8% and 23.8%

are obtained for the QLSTM on the real and simulated test sets

respectively, compared to 32.7% and 26.4% for the equivalent

real-valued LSTM. The same remark holds while feeding our

models with FBANK features with a best WER of 29.7obtained

with the QLSTM compared to 31.6. As a side note, the accura-

cies reported on Table 2 are slightly worse compared to the ones

given in [23]. Indeed, the latter work includes a speciﬁc batch-

normalisation that is not applied in our experiments due to the

very high complexity of the Quaternion Batch-Normalisation

(QBN) introduced in [39]. As a matter of fact, the current equa-

tions of the QBN induce an increase of the VRAM consumption

by a factor of 4. As expected, WER observed on the real test

set are also higher than those on the simulated one, due to more

complex and realistic perturbations.

As shown in both TIMIT and DIRHA experiments, the per-

formance improvement observed with the QLSTM is indepen-

dent of the initial acoustic representation, implying that a sim-

ilar increase of accuracy may be expected with other acous-

tic features such as fMLLR or PLP. Interestingly, the single-

channel beamforming approach gives the worst results among

all the investigated methods on both TIMIT and DIRHA.

5. Conclusion

Summary. This paper proposed to perform multi-channel

speech recognition with an LSTM based on quaternion num-

bers. Our experiments, conducted on multi-channel TIMIT and

DIRHA have shown that: 1) Given the same number of param-

eters, our multi-channel QLSTM signiﬁcantly outperforms an

equivalent LSTM network; 2) the performance improvement

is observed with different features, implying that a similar in-

crease of accuracy may be expected with others acoustic rep-

resentations such as fMLLR or PLP; 3) our QLSTM learns in-

ternal latent relations across microphones. Therefore, the initial

intuition that quaternion neural networks are suitable for multi-

channel distant automatic speech recognition has been veriﬁed.

Perspectives. One limitation of the current approach is due

to the fact that quaternion neural networks can only deal with

four-dimensional input signals. Even though popular devices

such as the Microsoft Kinect, or the ReSpeaker are based on

4-microphones arrays, future efforts will focus on generalis-

ing this paradigm to an arbitrary number of microphones by

considering, for instance, higher dimensional algebras such as

octonions and sedenions, or by investigating other methods of

weight sharing for multi-channel ASR. Finally, despite recent

works on investigating efﬁcient quaternion computations, the

current training and inference processes of the QLSTM remain

slower than that of a LSTM. Therefore, efforts should be put in

developing and implementing faster training procedures.

6. Acknowledgements

This work was supported by the EPRSC through MOA

(EP/S001530/) and Samsung AI. We would also like to thank

Elena Rastorgueva and Renato De Mori for the helpful com-

ments and discussions.

7. References

[1] M. W¨

olfel and J. W. McDonough, Distant speech recognition.

Wiley Online Library, 2009.

[2] J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong, Robust Automatic

Speech Recognition - A Bridge to Practical Applications (1st Edi-

tion), October 2015.

[3] M. Ravanelli, Deep learning for Distant Speech Recognition.

PhD Thesis, Unitn, 2017.

[4] M. Brandstein and D. Ward, Microphone arrays: signal process-

ing techniques and applications. Springer Science & Business

Media, 2013.

[5] J. Benesty, J. Chen, and Y. Huang, Microphone array signal pro-

cessing. Springer Science & Business Media, 2008, vol. 1.

[6] W. Kellermann, Beamforming for Speech and Audio Signals,

2008.

[7] C. H. Knapp and G. C. Carter, “The generalized correlation

method for estimation of time delay,” IEEE Transactions on

Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–

327, 1976.

[8] M. Kajala and M. Hamalainen, “Filter-and-sum beamformer with

adjustable ﬁlter characteristics,” in Proc. of ICASSP, 2001, pp.

2917–2920.

[9] J. Bitzer and K. Simmer, “Superdirective microphone arrays,” in

Microphone Arrays. Springer Berlin Heidelberg, 2001, pp. 19–

38.

[10] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb-

Umbach, “Beamnet: End-to-end training of a beamformer-

supported multi-channel asr system,” in 2017 IEEE Interna-

tional Conference on Acoustics, Speech and Signal Processing

(ICASSP). IEEE, 2017, pp. 5325–5329.

[11] S. Braun, D. Neil, J. Anumula, E. Ceolini, and S.-C. Liu, “Multi-

channel attention for end-to-end speech recognition,” 2018 Inter-

speech, pp. 0–0, 2018.

[12] B. Li, T. N. Sainath, R. J. Weiss, K. W. Wilson, and M. Bacchiani,

“Neural network adaptive beamforming for robust multichannel

speech recognition,” 2016.

[13] T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, and X. Xiao, “Uni-

ﬁed architecture for multichannel end-to-end speech recognition

with neural beamforming,” IEEE Journal of Selected Topics in

Signal Processing, vol. 11, no. 8, pp. 1274–1288, 2017.

[14] X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L.

Seltzer, G. Chen, Y. Zhang, M. Mandel, and D. Yu, “Deep beam-

forming networks for multi-channel speech recognition,” in Proc.

of ICASSP, 2016, pp. 5745–5749.

[15] S. Kim and I. Lane, “End-to-end speech recognition with auditory

attention for multi-microphone distance speech recognition,” in

Proc. Interspeech 2017, 2017.

[16] Y. Liu, P. Zhang, and T. Hain, “Using neural network front-ends

on far ﬁeld multiple microphones based speech recognition,” in

Proc. of ICASSP, 2014, pp. 5542–5546.

[17] T. Parcollet, M. Morchid, and G. Linar`

es, “A survey of quaternion

neural networks,” Artiﬁcial Intelligence Review, pp. 1–26, 2019.

[18] T. Isokawa, T. Kusakabe, N. Matsui, and F. Peper, “Quater-

nion neural network and its application,” in Internation-

alschrank2016deep conference on knowledge-based and intelli-

gent information and engineering systems. Springer, 2003, pp.

318–324.

[19] T. Parcollet, M. Morchid, and G. Linar`

es, “Quaternion convolu-

tional neural networks for heterogeneous image processing,” in

ICASSP 2019-2019 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 8514–

8518.

[20] D. Comminiello, M. Lella, S. Scardapane, and A. Uncini,

“Quaternion convolutional neural networks for detection and lo-

calization of 3d sound events,” in ICASSP 2019-2019 IEEE Inter-

national Conference on Acoustics, Speech and Signal Processing

(ICASSP). IEEE, 2019, pp. 8533–8537.

[21] T. Parcollet, M. Ravanelli, M. Morchid, G. Linar`

es, C. Trabelsi,

R. De Mori, and Y. Bengio, “Quaternion recurrent neural net-

works,” arXiv preprint arXiv:1806.04418, 2018.

[22] M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, and

M. Omologo, “The dirha-english corpus and related tasks for

distant-speech recognition in domestic environments,” in 2015

IEEE Workshop on Automatic Speech Recognition and Under-

standing (ASRU), 2015, pp. 275–282.

[23] M. Ravanelli, T. Parcollet, and Y. Bengio, “The pytorch-kaldi

speech recognition toolkit,” in ICASSP 2019-2019 IEEE Inter-

national Conference on Acoustics, Speech and Signal Processing

(ICASSP). IEEE, 2019, pp. 6465–6469.

[24] W. R. Hamilton and C. J. Joly, Elements of quaternions. Long-

mans, Green, and Company, 1899, vol. 1.

[25] P. Arena, L. Fortuna, L. Occhipinti, and M. G. Xibilia, “Neu-

ral networks for quaternion-valued function approximation,” in

Proceedings of IEEE International Symposium on Circuits and

Systems-ISCAS’94, vol. 6. IEEE, 1994, pp. 307–310.

[26] P. Arena, L. Fortuna, G. Muscato, and M. G. Xibilia, “Multilayer

perceptrons to approximate quaternion valued functions,” Neural

Networks, vol. 10, no. 2, pp. 335–342, 1997.

[27] X. Glorot and Y. Bengio, “Understanding the difﬁculty of train-

ing deep feedforward neural networks,” in Proceedings of the

thirteenth international conference on artiﬁcial intelligence and

statistics, 2010, pp. 249–256.

[28] T. Parcollet, Y. Zhang, M. Morchid, C. Trabelsi, G. Linar`

es,

R. De Mori, and Y. Bengio, “Quaternion convolutional neural

networks for end-to-end automatic speech recognition,” arXiv

preprint arXiv:1806.07789, 2018.

[29] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:

Surpassing human-level performance on imagenet classiﬁcation,”

in Proceedings of the IEEE international conference on computer

vision, 2015, pp. 1026–1034.

[30] S. Watanabe, M. Mandel, J. Barker, and E. Vincent, “Chime-6

challenge: Tackling multispeaker speech recognition for unseg-

mented recordings,” arXiv preprint arXiv:2004.09249, 2020.

[31] A. Cariow and G. Cariowa, “Fast algorithms for quaternion-

valued convolutional neural networks,” IEEE Transactions on

Neural Networks and Learning Systems, 2020.

[32] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S.

Pallett, “Darpa timit acoustic-phonetic continous speech corpus

cd-rom. nist speech disc 1-1.1,” NASA STI/Recon technical report

n, vol. 93, 1993.

[33] L. Cristoforetti, M. Ravanelli, M. Omologo, A. Sosi, A. Abad,

M. Hagm¨

uller, and P. Maragos, “The dirha simulated corpus.” in

LREC, 2014, pp. 2629–2634.

[34] M. Ravanelli, A. Sosi, P. Svaizer, and M. Omologo, “Impulse re-

sponse estimation for robust speech recognition in a reverberant

environment,” in 2012 Proceedings of the 20th European Signal

Processing Conference (EUSIPCO). IEEE, 2012, pp. 1668–

1672.

[35] M. Ravanelli and M. Omologo, “On the selection of the impulse

responses for distant-speech recognition based on contaminated

speech training,” in Proc. of Interspeech, 2014.

[36] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,

N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,

“The kaldi speech recognition toolkit,” in IEEE 2011 workshop

on automatic speech recognition and understanding, no. CONF.

IEEE Signal Processing Society, 2011.

[37] M. Ravanelli, P. Svaizer, and M. Omologo, “Realistic multi-

microphone data simulation for distant speech recognition,” arXiv

preprint arXiv:1711.09470, 2017.

[38] M. Ravanelli and M. Omologo, “Contaminated speech training

methods for robust dnn-hmm distant speech recognition,” arXiv

preprint arXiv:1710.03538, 2017.

[39] C. J. Gaudet and A. S. Maida, “Deep quaternion networks,”

in 2018 International Joint Conference on Neural Networks

(IJCNN). IEEE, 2018, pp. 1–8.