Quaternion Neural Networks for Multi-channel Distant Speech Recognition
Xinchi Qiu1, Titouan Parcollet1, Mirco Ravanelli3, Nicholas Lane1,2, Mohamed Morchid4
1University of Oxford, United-Kingdom
2Samsung AI, Cambridge, United-Kingdom
e de Montr´
4LIA, Avignon University, France
Despite the signiﬁcant progress in automatic speech recogni-
tion (ASR), distant ASR remains challenging due to noise and
reverberation. A common approach to mitigate this issue con-
sists of equipping the recording devices with multiple micro-
phones that capture the acoustic scene from different perspec-
tives. These multi-channel audio recordings contain speciﬁc in-
ternal relations between each signal. In this paper, we propose
to capture these inter- and intra- structural dependencies with
quaternion neural networks, which can jointly process multiple
signals as whole quaternion entities. The quaternion algebra
replaces the standard dot product with the Hamilton one, thus
offering a simple and elegant way to model dependencies be-
tween elements. The quaternion layers are then coupled with
a recurrent neural network, which can learn long-term depen-
dencies in the time domain. We show that a quaternion long-
short term memory neural network (QLSTM), trained on the
concatenated multi-channel speech signals, outperforms equiv-
alent real-valued LSTM on two different tasks of multi-channel
distant speech recognition.
Index Terms: distant speech recognition, quaternion neural
networks, multi-microphone speech recognition.
State-of-the-art speech recognition systems perform reasonably
well in close-talking conditions. However, their performance
degrades signiﬁcantly in more realistic distant-talking scenar-
ios, since the signals are corrupted with noise and reverbera-
tion [1–3]. A common approach to improve the robustness of
distant speech recognizers relies on the adoption of multiple
microphones [4, 5]. Multiple microphones, either in the form
of arrays or distributed networks, capture different views of an
acoustic scene that are combined to improve robustness.
A common practice is to combine the microphones using
signal processing techniques such as beamforming . The
goal of beamforming is to achieve spatial selectivity (i.e., priv-
ilege the areas where a target speaker is speaking), limiting the
effects of both noise and reverberation. One way to perform
spatial ﬁltering is provided by the delay-and-sum beamform-
ing, which simply performs a time alignment followed by a
sum of the recorded signals . More sophisticated techniques
are ﬁlter-and-sum beamforming , that ﬁlters the signal before
summing them up, and super-directive beamforming , which
further enhances the target speech by suppressing the contribu-
tions of the noise sources from other directions.
An alternative that is gaining signiﬁcant popularity is End-
to-end (E2E) multi-channel ASR [10–15]. Here, the core idea is
to replace the signal processing part with an end-to-end differ-
entiable neural network, that is jointly trained with the speech
recognizer. It will make the speech processing pipeline sig-
niﬁcantly simpler, and different modules composing the whole
system match better with each other. The most straightforward
approach is concatenating the speech features of the different
microphones and feeding them to a neural network . How-
ever, this approach forces the network to deal with very high-
dimensional data, and might thus make learning the complex
relationships between microphones difﬁcult due to numerous
independent neural parameters. To mitigate this issue, it is
common to inject prior knowledge or inductive biases into the
model. For instance,  suggested an adaptive neural beam-
former that performs ﬁlter-and-sum beamforming using learned
ﬁlters. Similar techniques have been proposed in [13, 14]. In all
aforementioned works, the microphone combination is not im-
plemented with an arbitrary function, but a restricted pool of
functions like beamforming ones. This introduces a regulariza-
tion effect that helps the convergence of the speech recognizer.
In this paper, we propose a novel approach to model the
complex inter- and intra- microphone dependencies that occur
in multi-microphone ASR. Our inductive bias relies on the use
of quaternion algebra. Quaternions extend complex numbers
and deﬁne four-dimensional vectors composed of a real part
and three imaginary components. The standard dot product is
replaced with the Hamilton product that offers a simple and ele-
gant way to learn dependencies across input channels by sharing
weights across them. More precisely, Quaternion Neural Net-
works (QNN) have recently been the object of several research
efforts focusing on image processing [17–19], 3D sound event
detection  and single-channel speech recognition . To
the best of our knowledge, our work is the ﬁrst that proposes
the use of quaternions in a multi-microphone speech process-
ing scenario, which is a particularly suitable application. Our
approach combines the speech features extracted from different
channels into four different dimensions of a set of quaternions
(Section 2.3). We then employ a Quaternion Long-Short Term
Memory (QLSTM) neural network . This way, our archi-
tecture not only models the latent intra- and inter- microphone
correlations with the quaternion algebra, but also jointly learns
time-dependencies with recurrent connections.
Our QLSTM achieves promising results on both a simu-
lated version of TIMIT and the DIRHA corpus , which
are characterized by the presence of signiﬁcant levels of non-
stationary noises and reverberation. In particular, we outper-
form both a beamforming baseline (15% relative improvement)
and a real-valued model with the same number of parameters
(8% relative improvement). In the interest of reproducibility,
we release the code under PyTorch-Kaldi  1.
This section ﬁrst describes the quaternion algebra (Section
2.1) and quaternion long short-term memory neural networks
(Section 2.2). Finally, the quaternion representation of multi-
channel signals is introduced in Section 2.3.
2.1. Quaternion Algebra
A quaternion is an extension of a complex number to the four-
dimensional space . A quaternion Qis written as:
with a,b,c, and dfour real numbers, and 1,i,j, and kthe
quaternion unit basis. In a quaternion, ais the real part, while
bi+cj+dkwith i2=j2=k2=ijk =−1is the imaginary
part, or the vector part. Such deﬁnition can be used to describe
spatial rotations. In the same manner as complex numbers, the
conjugate Q∗of Qis deﬁned as:
and a unitary quaternion (i.e. whose norm is equal to 1) is de-
The Hamilton product between Q1=a1+b1i+c1j+d1kand
Q2=a2+b2i+c2j+d2kis determined by the products of
the basis elements and the distributive law:
Analogously to complex numbers, quaternions also have a ma-
trix representation deﬁned in a way that quaternion addition and
multiplication correspond to a matrix addition and a matrix mul-
tiplication. An example of such matrix is:
b a −d c
c d a −b
d−c b a
Following this representation, the Hamilton product can be
written as a matrix multiplication as follow:
Using the matrix representation of quaternions turns out to be
particularly suitable for computations on modern GPUs com-
pared to the less efﬁcient object programming.
2.2. Quaternion Long Short-Term Memory Networks
Equivalently to standard LSTM models, a QLSTM consists of
a forget gate ft, an input gate it, a cell input activation vector
Ct, a cell state Ctand an output gate ot. In a QLSTM layer,
however, inputs x, hidden states ht, cell states Ct, biases b, and
weight parameters Ware quaternion numbers. All multiplica-
tions are thus replaced with the Hamilton product. Different
activation functions deﬁned in the quaternion domain can be
used [17,25]. In this work, we follow the split approach deﬁned
Figure 1: Illustration of the integration of multiple microphones
with a quaternion dense layer. Each microphone is encapsu-
lated by one component of a set of quaternions. All the neural
parameters are quaternion numbers.
α(Q) = α(a) + α(b)i+α(c)j+α(d)k,(7)
where αis any real-valued activation function (i.e. ReLU, Sig-
moid, ...). Indeed, fully quaternion-valued activation functions
have been demonstrated to be hard to train due to numerous sin-
gularities . Then, the output layer is commonly deﬁned in
the real-valued space to be combined with traditional loss func-
tions (e.g. cross-entropy)  due to the real-valued nature of
the labels implied by the considered speech recognition task.
Therefore, a QLSTM layer can be summarised with the follow-
ft=σ(Wfh ⊗ht−1+Wf x ⊗xt+bf),
it=σ(Wih ⊗ht−1+Wix ⊗xt+bi),
Ct=tanh(WCh ⊗ht−1+WC x ⊗xt+bC),
ot=σ(Woh ⊗ht−1+Wox ⊗xt+bo),
with two split activations σand tanh as described in Eq. 7.
As shown in , QLSTM models can be trained following
the quaternion-valued backpropagation through time. Finally,
weight initialisation is crucial to train deep neural networks ef-
fectively . Hence, a well-adapted quaternion weight initiali-
sation process [21,28] is applied. Quaternion neural parameters
are sampled with respect to their polar form and a random dis-
tribution following common initialization criteria [27,29].
2.3. Quaternion Representation of Multi-channel Signals
We propose to use quaternion numbers in a multi-microphone
speech processing scenario. More precisely, quaternion num-
bers offer the possibility to encode up to four microphones
(Fig. 1). Therefore, common acoustic features (e.g. MFCCs,
FBANKs, ...) are computed from each microphone signal
M1,2,3,4, and then concatenated to compose a quaternion as fol-
Q=M1,a +M2,bi+M3,b j+M4,bk(9)
Internal relations are captured with the speciﬁc weight sharing
property of the Hamilton product. By using Hamilton prod-
ucts, quaternion-weight components are shared through multi-
ple quaternion-input, creating relations within the elements as
demonstrated in . More precisely, real-valued network in-
puts are treated as a group of uni-dimensional elements that
could be related to each other, potentially decorrelating the four
microphone signals. Conversely, quaternion networks consider
each time frame as an entity of four related elements. Hence,
internal relations are naturally captured and learned through the
process. Indeed, a small variation in one of the microphone
would result in an important change in the internal representa-
tion affecting the encoding of the three other microphones.
It is worth noticing that four microphones may be limit-
ing for realistic applications. For instance, the latest CHIME-6
challenge  proposes various recordings obtained from six
microphones in different scenarios. This difﬁculty could be
easily avoided by considering these tasks as a special case of
higher algebras, such as octonions (eight dimensions) or sede-
nions (sixteen dimensions). Nevertheless, this paper proposes
to ﬁrst consider four dimensions to evaluate the viability of the
application of high-dimensional neural networks for distant and
multi-microphone ASR. Finally, quaternion neural networks are
known to be more computationally intensive than real-valued
neural networks. Indeed, the Hamilton product involves 28 ba-
sic operations compared to 1for a standard product. Nonethe-
less, the training time can be reduced with the matrix represen-
tation deﬁned in Eq.(6), and can be drastically improved with
simple linear algebra properties .
3. Experimental Protocol
A perturbed speech and multi-channel TIMIT  version pre-
sented thereafter is ﬁrst used as a preliminary task to investi-
gate the impact of the Hamilton product. Then, the DIRHA
dataset  is used to verify the scalability of the proposed ap-
proach to more realistic conditions.
3.1. TIMIT Dataset
The TIMIT corpus contains broadband recordings of 630 speak-
ers of eight main dialects of American English, each reading
ten phonetically rich sentences. The training dataset consists of
the standard 3696 sentences uttered by 462 speakers, while the
testing one consists of 192 sentences uttered by 24 speakers.
A validation dataset composed of 400 sentences uttered by 50
speakers is used for hyper-parameter tuning.
In our experiments, we created a multi-channel simulated
version of TIMIT using the impulse responses measured
in [34, 35]2. The reference environment is a living room of a
real apartment with an average reverberation time T60 of 0.7
seconds. The considered four microphones (i.e. LA2,LA3,
LA4,LA5) are placed on the ceiling of the room. Data are
created considering all the different positions, and different
positions are used for training and testing data. We also
integrate a single-channel signal obtained with delay-and-sum
beamforming as a baseline comparison . Input features
consist of 40 Mel ﬁlters bank energies (FBANK) with no deltas
extracted with Kaldi . To show that the obtained gain
in performance is independent of the input features, we also
propose 13 MFCC coefﬁcients as an alternative set of features.
2Perturbation can be re-created following: https://github.
3.2. DIRHA Dataset
To validate our model in a more realistic scenario, a set of
experiments is also conducted with the larger DIRHA-English
corpus . Equivalently to the generated TIMIT dataset, the
reference context is a domestic environment characterized by
the presence of non-stationary noise and acoustic reverberation.
Training is based on the original Wall-Street-Journal-5k (WSJ)
corpus (i.e. consisting of 7138 sentences uttered by 83 speak-
ers) contaminated with a set of impulse responses measured in
a real apartment [37, 38]. Both a real and a simulated dataset
are used for testing, each consisting of 409 WSJ sentences
uttered by six native American speakers. Note that a validation
set of 310 WSJ sentences is used for hyper-parameter tuning.
Only the ﬁrst four microphones of the circular array are used
in our experiments to ﬁt the quaternion representation. A
single-channel signal obtained with delay-and-sum beamform-
ing is also proposed as a baseline comparison . It is worth
noting that we also used 13 MFCC coefﬁcients as features in
comparison to FBANKs to evaluate the robustness of the model
to the input representation.
3.3. Neural Network Architectures
We decided to ﬁx the number of neural parameters to 5M for
both LSTM and QLSTM following the models studied in .
Therefore, the QLSTM model is composed of 4bidirectional
QLSTM layers followed by a linear layer with a softmax ac-
tivation function for classiﬁcation. Output labels are the dif-
ferent HMM states of the Kaldi decoder. Each of the QL-
STM layers consists of 128 quaternion nodes. Although there
are 128 ∗4 = 512 real-valued nodes in total, there are only
128 ∗128 ∗4real-valued weight parameters, due to the weight
sharing property of quaternion neural networks. The LSTM
model is composed of 4 bidirectional LSTM layers of size 290
(i.e. ensuring the same number of neural parameters as the
QLSTM) followed by the same linear layer to obtain poste-
rior probabilities. A dropout rate of 0.2is applied across all
(Q)LSTM layers. Quaternion parameters are initialised with
the speciﬁc initialisation deﬁned in , while LSTM param-
eters are initialised with the Glorot criterion .
Training is performed with the RMSPROP optimizer with
vanilla hyper-parameters and an initial learning rate of 1.6e−3
over 24 epochs. The learning-rate is halved every time the
loss on the validation set increases, ensuring an optimal con-
vergence. Finally, both LSTM and QLSTM are manually im-
plemented in PyTorch to alleviate any variation due to different
4. Results and Discussions
The results on the distant multi-channel TIMIT dataset are re-
ported in Table 1. From this comparison, it emerges that QL-
STM with four microphones outperforms the other approaches.
Our best QLSTM model, in fact, obtains a PER of 28.7%
against a PER of 30.2% achieved with a standard real-value
LSTM. In both cases, the best performance is obtained with
FBANK features. Interestingly, Table 1 shows that the con-
catenation of the four input signals with a real-valued LSTM
outperforms the delay-and-sum beamforming approach. Sim-
ilar achievements have already emerged in previous works on
multi-channel ASR  and can be due to the ability of modern
neural networks to obtain disentangled and informative repre-
sentations from noisy inputs.
Table 1: Results expressed in terms of Phoneme Error Rate (PER) percentage (i.e lower is better) of both QLSTM and LSTM models
on the TIMIT distant phoneme recognition task with different acoustic features. Results are from an average of 5runs.
Models Signals Test
QLSTM 1 microphone copied 32.1 ±0.02 34.2 ±0.13
LSTM 1 microphone 32.3 ±0.14 35.0 ±0.23
LSTM beamforming 31.1 ±0.11 33.4 ±0.07
LSTM 4 microphones 30.2 ±0.16 32.8 ±0.09
QLSTM 4 microphones 28.7 ±0.06 30.4 ±0.11
Table 2: Results expressed in terms of Word Error Rate (WER) (i.e lower is better) of both QLSTM and LSTM based models on the
DIRHA dataset with different acoustic features. ’Test Sim.’ corresponds to the simulated test set of the corpus, while “Test Real” is the
set composed of real recordings.
Models Signals Test Real
LSTM beamforming 35.1 33.7 35.0 33.0
LSTM 4 microphones 32.7 26.4 31.6 26.3
QLSTM 4 microphones 29.8 23.8 29.7 23.4
We can now investigate in more detail the role played by
the quaternion algebra on learning cross-microphone dependen-
cies. One way to do it is to overwrite the quaternion dimensions
with the features extracted from the same microphone (see the
ﬁrst row of Table 1). In this case, we expect that our QLSTM
will fail to learn cross-microphone dependencies, simply be-
cause we have a single feature vector replicated multiple times.
For a fair comparison, the aforementioned experiment is con-
ducted by selecting the best microphone of the array (i.e. LA4).
From the ﬁrst and the second rows of Table 1, one can note
that both single-channel QLSTM and LSTM perform roughly
the same. As expected, in fact, the single-channel QLSTM is
not able to model useful dependencies when the quaternion di-
mensions are dumped with the same feature vector. Nonethe-
less, switching to four-channel signal brings an average PER
improvement of 3.6% for the QLSTM compared to 2.1% for
the LSTM, showing a higher gain obtained on multiple chan-
nels with the QLSTM. This illustrates the ability of QLSTM to
better capture latent relations across the different microphones.
To provide some experimental evidence on a more realis-
tic task, we evaluate our model with the DIRHA dataset. The
results obtained in Table 2 conﬁrm the trend observed with
TIMIT. Indeed, Word Error Rates (WER) of 29.8% and 23.8%
are obtained for the QLSTM on the real and simulated test sets
respectively, compared to 32.7% and 26.4% for the equivalent
real-valued LSTM. The same remark holds while feeding our
models with FBANK features with a best WER of 29.7obtained
with the QLSTM compared to 31.6. As a side note, the accura-
cies reported on Table 2 are slightly worse compared to the ones
given in . Indeed, the latter work includes a speciﬁc batch-
normalisation that is not applied in our experiments due to the
very high complexity of the Quaternion Batch-Normalisation
(QBN) introduced in . As a matter of fact, the current equa-
tions of the QBN induce an increase of the VRAM consumption
by a factor of 4. As expected, WER observed on the real test
set are also higher than those on the simulated one, due to more
complex and realistic perturbations.
As shown in both TIMIT and DIRHA experiments, the per-
formance improvement observed with the QLSTM is indepen-
dent of the initial acoustic representation, implying that a sim-
ilar increase of accuracy may be expected with other acous-
tic features such as fMLLR or PLP. Interestingly, the single-
channel beamforming approach gives the worst results among
all the investigated methods on both TIMIT and DIRHA.
Summary. This paper proposed to perform multi-channel
speech recognition with an LSTM based on quaternion num-
bers. Our experiments, conducted on multi-channel TIMIT and
DIRHA have shown that: 1) Given the same number of param-
eters, our multi-channel QLSTM signiﬁcantly outperforms an
equivalent LSTM network; 2) the performance improvement
is observed with different features, implying that a similar in-
crease of accuracy may be expected with others acoustic rep-
resentations such as fMLLR or PLP; 3) our QLSTM learns in-
ternal latent relations across microphones. Therefore, the initial
intuition that quaternion neural networks are suitable for multi-
channel distant automatic speech recognition has been veriﬁed.
Perspectives. One limitation of the current approach is due
to the fact that quaternion neural networks can only deal with
four-dimensional input signals. Even though popular devices
such as the Microsoft Kinect, or the ReSpeaker are based on
4-microphones arrays, future efforts will focus on generalis-
ing this paradigm to an arbitrary number of microphones by
considering, for instance, higher dimensional algebras such as
octonions and sedenions, or by investigating other methods of
weight sharing for multi-channel ASR. Finally, despite recent
works on investigating efﬁcient quaternion computations, the
current training and inference processes of the QLSTM remain
slower than that of a LSTM. Therefore, efforts should be put in
developing and implementing faster training procedures.
This work was supported by the EPRSC through MOA
(EP/S001530/) and Samsung AI. We would also like to thank
Elena Rastorgueva and Renato De Mori for the helpful com-
ments and discussions.
 M. W¨
olfel and J. W. McDonough, Distant speech recognition.
Wiley Online Library, 2009.
 J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong, Robust Automatic
Speech Recognition - A Bridge to Practical Applications (1st Edi-
tion), October 2015.
 M. Ravanelli, Deep learning for Distant Speech Recognition.
PhD Thesis, Unitn, 2017.
 M. Brandstein and D. Ward, Microphone arrays: signal process-
ing techniques and applications. Springer Science & Business
 J. Benesty, J. Chen, and Y. Huang, Microphone array signal pro-
cessing. Springer Science & Business Media, 2008, vol. 1.
 W. Kellermann, Beamforming for Speech and Audio Signals,
 C. H. Knapp and G. C. Carter, “The generalized correlation
method for estimation of time delay,” IEEE Transactions on
Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–
 M. Kajala and M. Hamalainen, “Filter-and-sum beamformer with
adjustable ﬁlter characteristics,” in Proc. of ICASSP, 2001, pp.
 J. Bitzer and K. Simmer, “Superdirective microphone arrays,” in
Microphone Arrays. Springer Berlin Heidelberg, 2001, pp. 19–
 J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb-
Umbach, “Beamnet: End-to-end training of a beamformer-
supported multi-channel asr system,” in 2017 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2017, pp. 5325–5329.
 S. Braun, D. Neil, J. Anumula, E. Ceolini, and S.-C. Liu, “Multi-
channel attention for end-to-end speech recognition,” 2018 Inter-
speech, pp. 0–0, 2018.
 B. Li, T. N. Sainath, R. J. Weiss, K. W. Wilson, and M. Bacchiani,
“Neural network adaptive beamforming for robust multichannel
speech recognition,” 2016.
 T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, and X. Xiao, “Uni-
ﬁed architecture for multichannel end-to-end speech recognition
with neural beamforming,” IEEE Journal of Selected Topics in
Signal Processing, vol. 11, no. 8, pp. 1274–1288, 2017.
 X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L.
Seltzer, G. Chen, Y. Zhang, M. Mandel, and D. Yu, “Deep beam-
forming networks for multi-channel speech recognition,” in Proc.
of ICASSP, 2016, pp. 5745–5749.
 S. Kim and I. Lane, “End-to-end speech recognition with auditory
attention for multi-microphone distance speech recognition,” in
Proc. Interspeech 2017, 2017.
 Y. Liu, P. Zhang, and T. Hain, “Using neural network front-ends
on far ﬁeld multiple microphones based speech recognition,” in
Proc. of ICASSP, 2014, pp. 5542–5546.
 T. Parcollet, M. Morchid, and G. Linar`
es, “A survey of quaternion
neural networks,” Artiﬁcial Intelligence Review, pp. 1–26, 2019.
 T. Isokawa, T. Kusakabe, N. Matsui, and F. Peper, “Quater-
nion neural network and its application,” in Internation-
alschrank2016deep conference on knowledge-based and intelli-
gent information and engineering systems. Springer, 2003, pp.
 T. Parcollet, M. Morchid, and G. Linar`
es, “Quaternion convolu-
tional neural networks for heterogeneous image processing,” in
ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 8514–
 D. Comminiello, M. Lella, S. Scardapane, and A. Uncini,
“Quaternion convolutional neural networks for detection and lo-
calization of 3d sound events,” in ICASSP 2019-2019 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2019, pp. 8533–8537.
 T. Parcollet, M. Ravanelli, M. Morchid, G. Linar`
es, C. Trabelsi,
R. De Mori, and Y. Bengio, “Quaternion recurrent neural net-
works,” arXiv preprint arXiv:1806.04418, 2018.
 M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, and
M. Omologo, “The dirha-english corpus and related tasks for
distant-speech recognition in domestic environments,” in 2015
IEEE Workshop on Automatic Speech Recognition and Under-
standing (ASRU), 2015, pp. 275–282.
 M. Ravanelli, T. Parcollet, and Y. Bengio, “The pytorch-kaldi
speech recognition toolkit,” in ICASSP 2019-2019 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2019, pp. 6465–6469.
 W. R. Hamilton and C. J. Joly, Elements of quaternions. Long-
mans, Green, and Company, 1899, vol. 1.
 P. Arena, L. Fortuna, L. Occhipinti, and M. G. Xibilia, “Neu-
ral networks for quaternion-valued function approximation,” in
Proceedings of IEEE International Symposium on Circuits and
Systems-ISCAS’94, vol. 6. IEEE, 1994, pp. 307–310.
 P. Arena, L. Fortuna, G. Muscato, and M. G. Xibilia, “Multilayer
perceptrons to approximate quaternion valued functions,” Neural
Networks, vol. 10, no. 2, pp. 335–342, 1997.
 X. Glorot and Y. Bengio, “Understanding the difﬁculty of train-
ing deep feedforward neural networks,” in Proceedings of the
thirteenth international conference on artiﬁcial intelligence and
statistics, 2010, pp. 249–256.
 T. Parcollet, Y. Zhang, M. Morchid, C. Trabelsi, G. Linar`
R. De Mori, and Y. Bengio, “Quaternion convolutional neural
networks for end-to-end automatic speech recognition,” arXiv
preprint arXiv:1806.07789, 2018.
 K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:
Surpassing human-level performance on imagenet classiﬁcation,”
in Proceedings of the IEEE international conference on computer
vision, 2015, pp. 1026–1034.
 S. Watanabe, M. Mandel, J. Barker, and E. Vincent, “Chime-6
challenge: Tackling multispeaker speech recognition for unseg-
mented recordings,” arXiv preprint arXiv:2004.09249, 2020.
 A. Cariow and G. Cariowa, “Fast algorithms for quaternion-
valued convolutional neural networks,” IEEE Transactions on
Neural Networks and Learning Systems, 2020.
 J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S.
Pallett, “Darpa timit acoustic-phonetic continous speech corpus
cd-rom. nist speech disc 1-1.1,” NASA STI/Recon technical report
n, vol. 93, 1993.
 L. Cristoforetti, M. Ravanelli, M. Omologo, A. Sosi, A. Abad,
uller, and P. Maragos, “The dirha simulated corpus.” in
LREC, 2014, pp. 2629–2634.
 M. Ravanelli, A. Sosi, P. Svaizer, and M. Omologo, “Impulse re-
sponse estimation for robust speech recognition in a reverberant
environment,” in 2012 Proceedings of the 20th European Signal
Processing Conference (EUSIPCO). IEEE, 2012, pp. 1668–
 M. Ravanelli and M. Omologo, “On the selection of the impulse
responses for distant-speech recognition based on contaminated
speech training,” in Proc. of Interspeech, 2014.
 D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
“The kaldi speech recognition toolkit,” in IEEE 2011 workshop
on automatic speech recognition and understanding, no. CONF.
IEEE Signal Processing Society, 2011.
 M. Ravanelli, P. Svaizer, and M. Omologo, “Realistic multi-
microphone data simulation for distant speech recognition,” arXiv
preprint arXiv:1711.09470, 2017.
 M. Ravanelli and M. Omologo, “Contaminated speech training
methods for robust dnn-hmm distant speech recognition,” arXiv
preprint arXiv:1710.03538, 2017.
 C. J. Gaudet and A. S. Maida, “Deep quaternion networks,”
in 2018 International Joint Conference on Neural Networks
(IJCNN). IEEE, 2018, pp. 1–8.