PreprintPDF Available

FusionRNN: Shared Neural Parameters for Multi-Channel Distant Speech Recognition

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Distant speech recognition remains a challenging application for modern deep learning based Automatic Speech Recognition (ASR) systems, due to complex recording conditions involving noise and reverberation. Multiple microphones are commonly combined with well-known speech processing techniques to enhance the original signals and thus enhance the speech recog-nizer performance. These multi-channel follow similar input distributions with respect to the global speech information but also contain an important part of noise. Consequently, the input representation robustness is key to obtaining reasonable recognition rates. In this work, we propose a Fusion Layer (FL) based on shared neural parameters. We use it to produce an expressive embedding of multiple microphone signals, that can easily be combined with any existing ASR pipeline. The proposed model called FusionRNN showed promising results on a multi-channel distant speech recognition task, and consistently outperformed baseline models while maintaining an equal training time.
Content may be subject to copyright.
FusionRNN: Shared Neural Parameters
for Multi-Channel Distant Speech Recognition
Titouan Parcollet1, Xinchi Qiu1, Nicholas Lane1,2
University of Oxford, United-Kingdom
Samsung AI, Cambridge, United-Kingdom
Distant speech recognition remains a challenging application
for modern deep learning based Automatic Speech Recognition
(ASR) systems, due to complex recording conditions involving
noise and reverberation. Multiple microphones are commonly
combined with well-known speech processing techniques to en-
hance the original signals and thus enhance the speech recog-
nizer performance. These multi-channel follow similar input
distributions with respect to the global speech information but
also contain an important part of noise. Consequently, the input
representation robustness is key to obtaining reasonable recog-
nition rates. In this work, we propose a Fusion Layer (FL) based
on shared neural parameters. We use it to produce an expressive
embedding of multiple microphone signals, that can easily be
combined with any existing ASR pipeline. The proposed model
called FusionRNN showed promising results on a multi-channel
distant speech recognition task, and consistently outperformed
baseline models while maintaining an equal training time.
Index Terms: Multi-channel distant speech recognition, shared
neural parameters, light gated recurrent unit neural networks.
1. Introduction
Modern automatic speech recognition (ASR) systems signifi-
cantly struggle in more realistic distant-talking speech scenar-
ios [1, 2], despite their promising results in close-talking and
controlled conditions. Indeed, distant speech recognition is sig-
nificantly more difficult as it often implies speech signal highly
corrupted with noise and reverberation [3].
The use of Multi-microphone arrays is a common approach
to enhance distant-talking recognizer performance [4, 5]. The
variety of microphones enables the model to receive different
and complementary views of the same acoustic event. Conse-
quently, it facilitates the differentiation between noise, reverber-
ation, and the relevant acoustic information. Thus, improving
the robustness of the speech recognition system.
Multi-microphone input arrays require the adoption of sig-
nal processing techniques aimed at efficiently combining dif-
ferent signals. Traditionally, the beamforming method [6] is
commonly used to achieve spatial selectivity. It enables the ob-
tained representation to privilege the spatial areas of the target
speaker, and reduces the impact of noise and reverberation. Ex-
amples of beamforming are delay-and-sum and filter-and-sum
[7, 8]. Most of these techniques propose to realign the differ-
ent signals in the time domain, and to enhance, increase or filter
the energy observed for the relevant speech information by per-
forming specific operations.
More recently, deep Multi-microphone signal processing
methods have been developed on the back of the current deep
learning renaissance [9, 10, 11, 12, 13, 14]. They integrate DSP
techniques into the pre-existing neural ASR systems. Such inte-
gration of deep learning and traditional signal processing tends
to deliver speech pipelines that are more straightforward, trans-
parent, and better performing. For instance, a first paper in this
direction has shown that simply concatenating multi-channel
features along one dimension and feed a neural acoustic model
with it is sufficient to achieve competitive results [15].
Nevertheless, complex inter- and intra- dependencies exist-
ing between different signals remain difficult to capture with
such simple approaches. Indeed, the different microphones are
connected to various independent weights and both inter- and
intra- relations are considered at the same level, while we may
assume that the input representation is made of Mdifferent
microphones sharing slightly different but close input distribu-
tions. For instance, R, G, B color components characterizing
a single pixel may be considered as an analogy to the 3Mel
filter banks energies obtained from 3different microphones for
a single time frame. In this example, a good model must cre-
ate an expressive and robust latent representation of close but
noisy input distributions (i.e. the variations between the differ-
ent microphones) to enable a better understanding of the speech
information carried by the signal (i.e. what is being spoken).
One approach to modelling such dependencies is to inject prior
knowledge [11]. In this work, the authors proposed an adaptive
neural beamformer with learned filters to perform a delay-and-
sum beamforming. Despite good performances, this approach
relies on a specific set of beamforming functions and is there-
fore limited by their definitions. Furthermore, this set of jointly-
trained model often increases the complexity of the model, thus
prolonging the training time [16].
Our proposal is to consider other inductive biases related to
the multi-dimensionality of the data rather than just to its speech
representation. First, this proposal draws on the treatment of
multidimensional features in deep learning applications includ-
ing image processing (3D pixels), natural language processing
(N-dimensional vectors for a given token), robotic (3D coordi-
nates). In these different contexts, high-dimensional neural net-
works are applied partly due to the fact that their algebras enable
the learning of embeddings that take into account both intra-
and inter- dependencies. Examples are complex-valued neu-
ral networks [17, 18], quaternion neural networks [19, 20], and
models relying on the Clifford algebra [21, 22]. All these archi-
tectures have demonstrated promising performances on a wide
range of applications due to a specific weight sharing scheme,
ensuring natural encodings of the internal relations [23]. Re-
cently, shared weights have been investigated in the context of
heterogeneous audio datasets processing [24], showing better
out-of-distribution generalisation capabilities than recent trans-
fer learning methods. Siamese neural networks also heavily rely
on shared weights [25, 26] to project their input vectors into a
shared latent subspace.
Shared W
Shared W
<latexit sha1_base64="ZATZluevDrYoGyBR3QQGdew8sK4=">AAACyHicjVHLSsNAFD2Nr1pfVZdugkVwVZIq6EoKbsRVBdMWai3JdFqH5kUyUUpx4w+41S8T/0D/wjtjCmoRnZDkzLn3nJl7rxf7IpWW9Vow5uYXFpeKy6WV1bX1jfLmVjONsoRxh0V+lLQ9N+W+CLkjhfR5O064G3g+b3mjUxVv3fIkFVF4Kccx7wbuMBQDwVxJlNO+tnqyV65YVUsvcxbYOaggX42o/IIr9BGBIUMAjhCSsA8XKT0d2LAQE9fFhLiEkNBxjnuUSJtRFqcMl9gRfYe06+RsSHvlmWo1o1N8ehNSmtgjTUR5CWF1mqnjmXZW7G/eE+2p7jamv5d7BcRK3BD7l26a+V+dqkVigGNdg6CaYs2o6ljukumuqJubX6qS5BATp3Cf4glhppXTPptak+raVW9dHX/TmYpVe5bnZnhXt6QB2z/HOQuatap9UK1dHFbqJ/moi9jBLvZpnkeo4wwNOOQt8IgnPBvnRmzcGePPVKOQa7bxbRkPHwAXkOY=</latexit>
<latexit sha1_base64="ftEaAPbiLY9d1yTvWIipC7UOj1E=">AAACyHicjVHLSsNAFD2Nr1pfVZdugkVwVZIq6EoKbsRVBdMWai3JdFqH5kUyUUpx4w+41S8T/0D/wjtjCmoRnZDkzLn3nJl7rxf7IpWW9Vow5uYXFpeKy6WV1bX1jfLmVjONsoRxh0V+lLQ9N+W+CLkjhfR5O064G3g+b3mjUxVv3fIkFVF4Kccx7wbuMBQDwVxJlNO+tnuyV65YVUsvcxbYOaggX42o/IIr9BGBIUMAjhCSsA8XKT0d2LAQE9fFhLiEkNBxjnuUSJtRFqcMl9gRfYe06+RsSHvlmWo1o1N8ehNSmtgjTUR5CWF1mqnjmXZW7G/eE+2p7jamv5d7BcRK3BD7l26a+V+dqkVigGNdg6CaYs2o6ljukumuqJubX6qS5BATp3Cf4glhppXTPptak+raVW9dHX/TmYpVe5bnZnhXt6QB2z/HOQuatap9UK1dHFbqJ/moi9jBLvZpnkeo4wwNOOQt8IgnPBvnRmzcGePPVKOQa7bxbRkPHwJ5kOc=</latexit>
<latexit sha1_base64="mxGLQYrBtc8y5noa7cc8u3XgKVI=">AAACyHicjVHLSsNAFD2Nr1pfVZdugkVwVZIq6EoKbsRVBdMWai3JdFqH5kUyUUpx4w+41S8T/0D/wjtjCmoRnZDkzLn3nJl7rxf7IpWW9Vow5uYXFpeKy6WV1bX1jfLmVjONsoRxh0V+lLQ9N+W+CLkjhfR5O064G3g+b3mjUxVv3fIkFVF4Kccx7wbuMBQDwVxJlNO+rvVkr1yxqpZe5iywc1BBvhpR+QVX6CMCQ4YAHCEkYR8uUno6sGEhJq6LCXEJIaHjHPcokTajLE4ZLrEj+g5p18nZkPbKM9VqRqf49CakNLFHmojyEsLqNFPHM+2s2N+8J9pT3W1Mfy/3CoiVuCH2L9008786VYvEAMe6BkE1xZpR1bHcJdNdUTc3v1QlySEmTuE+xRPCTCunfTa1JtW1q966Ov6mMxWr9izPzfCubkkDtn+OcxY0a1X7oFq7OKzUT/JRF7GDXezTPI9QxxkacMhb4BFPeDbOjdi4M8afqUYh12zj2zIePgAE25Do</latexit>
<latexit sha1_base64="UURpFAUymfgPJjDNJYT4h+R0/u0=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkVwVZIq6EoKbroRKtoH1CrJdFpD0yRMJkopgj/gVj9N/AP9C++MI6hFdEKSM+fec2buvX4SBql0nJecNTM7N7+QXywsLa+srhXXN5ppnAnGGywOY9H2vZSHQcQbMpAhbyeCeyM/5C1/eKzirRsu0iCOzuU44d2RN4iCfsA8SdTZyWXlqlhyyo5e9jRwDSjBrHpcfMYFeojBkGEEjgiScAgPKT0duHCQENfFhDhBKNBxjjsUSJtRFqcMj9ghfQe06xg2or3yTLWa0SkhvYKUNnZIE1OeIKxOs3U8086K/c17oj3V3cb0943XiFiJa2L/0n1m/lenapHo41DXEFBNiWZUdcy4ZLor6ub2l6okOSTEKdyjuCDMtPKzz7bWpLp21VtPx191pmLVnpncDG/qljRg9+c4p0GzUnb3ypXT/VL1yIw6jy1sY5fmeYAqaqijQd4DPOART1bNiqzMuv1ItXJGs4lvy7p/B6efj/Y=</latexit>
<latexit sha1_base64="An6hWUImpv1y9I8TstJDlQ/4wkk=">AAACx3icjVHLSsNAFD2Nr1pfVZdugkVwVZIq6EoKbnQhVLAPqFWSdNoOzYtkUizFhT/gVv9M/AP9C++MU1CL6IQkZ86958zce93Y56mwrNecMTe/sLiUXy6srK6tbxQ3txpplCUeq3uRHyUt10mZz0NWF1z4rBUnzAlcnzXd4amMN0csSXkUXolxzDqB0w95j3uOkNTFjV24LZassqWWOQtsDUrQqxYVX3CNLiJ4yBCAIYQg7MNBSk8bNizExHUwIS4hxFWc4R4F0maUxSjDIXZI3z7t2poNaS89U6X26BSf3oSUJvZIE1FeQlieZqp4ppwl+5v3RHnKu43p72qvgFiBAbF/6aaZ/9XJWgR6OFY1cKopVoysztMumeqKvLn5pSpBDjFxEncpnhD2lHLaZ1NpUlW77K2j4m8qU7Jy7+ncDO/yljRg++c4Z0GjUrYPypXLw1L1RI86jx3sYp/meYQqzlBDnbwHeMQTno1zIzJGxt1nqpHTmm18W8bDB+OfkAk=</latexit>
<latexit sha1_base64="mfZqTSXRDsgxMxZeP17SPJQduu8=">AAACx3icjVHLSsNAFD2Nr1pfVZdugkVwVZIq6EoKbnQhVLAPqFWSdNoOzYtkUizFhT/gVv9M/AP9C++MU1CL6IQkZ86958zce93Y56mwrNecMTe/sLiUXy6srK6tbxQ3txpplCUeq3uRHyUt10mZz0NWF1z4rBUnzAlcnzXd4amMN0csSXkUXolxzDqB0w95j3uOkNTFjVW4LZassqWWOQtsDUrQqxYVX3CNLiJ4yBCAIYQg7MNBSk8bNizExHUwIS4hxFWc4R4F0maUxSjDIXZI3z7t2poNaS89U6X26BSf3oSUJvZIE1FeQlieZqp4ppwl+5v3RHnKu43p72qvgFiBAbF/6aaZ/9XJWgR6OFY1cKopVoysztMumeqKvLn5pSpBDjFxEncpnhD2lHLaZ1NpUlW77K2j4m8qU7Jy7+ncDO/yljRg++c4Z0GjUrYPypXLw1L1RI86jx3sYp/meYQqzlBDnbwHeMQTno1zIzJGxt1nqpHTmm18W8bDB+E+kAg=</latexit>
<latexit sha1_base64="0YTMWUSF1B7eCaNKVkTio/FKZfE=">AAACxXicjVHLSsNAFD2Nr1pfVZdugkUQhJLWgi4LLnRZxT6gFkmm0zo0L5JJoRTxB9zqr4l/oH/hnXEKahGdkOTMufecmXuvF/silY7zmrMWFpeWV/KrhbX1jc2t4vZOK42yhPEmi/wo6Xhuyn0R8qYU0uedOOFu4Pm87Y3OVLw95kkqovBaTmLeC9xhKAaCuZKoq6PCbbHklB297HlQMaAEsxpR8QU36CMCQ4YAHCEkYR8uUnq6qMBBTFwPU+ISQkLHOe5RIG1GWZwyXGJH9B3SrmvYkPbKM9VqRqf49CaktHFAmojyEsLqNFvHM+2s2N+8p9pT3W1Cf894BcRK3BH7l26W+V+dqkVigFNdg6CaYs2o6phxyXRX1M3tL1VJcoiJU7hP8YQw08pZn22tSXXtqreujr/pTMWqPTO5Gd7VLWnAlZ/jnAetarlyXK5e1kr1mhl1HnvYxyHN8wR1XKCBJnkP8IgnPFvnVmBJa/yZauWMZhfflvXwAe7wjzo=</latexit>
<latexit sha1_base64="0YTMWUSF1B7eCaNKVkTio/FKZfE=">AAACxXicjVHLSsNAFD2Nr1pfVZdugkUQhJLWgi4LLnRZxT6gFkmm0zo0L5JJoRTxB9zqr4l/oH/hnXEKahGdkOTMufecmXuvF/silY7zmrMWFpeWV/KrhbX1jc2t4vZOK42yhPEmi/wo6Xhuyn0R8qYU0uedOOFu4Pm87Y3OVLw95kkqovBaTmLeC9xhKAaCuZKoq6PCbbHklB297HlQMaAEsxpR8QU36CMCQ4YAHCEkYR8uUnq6qMBBTFwPU+ISQkLHOe5RIG1GWZwyXGJH9B3SrmvYkPbKM9VqRqf49CaktHFAmojyEsLqNFvHM+2s2N+8p9pT3W1Cf894BcRK3BH7l26W+V+dqkVigFNdg6CaYs2o6phxyXRX1M3tL1VJcoiJU7hP8YQw08pZn22tSXXtqreujr/pTMWqPTO5Gd7VLWnAlZ/jnAetarlyXK5e1kr1mhl1HnvYxyHN8wR1XKCBJnkP8IgnPFvnVmBJa/yZauWMZhfflvXwAe7wjzo=</latexit>
<latexit sha1_base64="3J/g+GOEG4iDjbpXBnznN8jAvKA=">AAACyHicjVHLSsNAFD2NrxpfVZdugkUQhJLUgi4LbsRVBdMWapVkOq1D0yQkE7UUN/6AW/0y8Q/0L7wzRlCL6IQkZ86958zce/04EKm07ZeCMTM7N79QXDSXlldW10rrG800yhLGXRYFUdL2vZQHIuSuFDLg7Tjh3sgPeMsfHql465onqYjCMzmOeXfkDULRF8yTRLm3F3umeVkq2xVbL2saODkoI1+NqPSMc/QQgSHDCBwhJOEAHlJ6OnBgIyauiwlxCSGh4xx3MEmbURanDI/YIX0HtOvkbEh75ZlqNaNTAnoTUlrYIU1EeQlhdZql45l2Vuxv3hPtqe42pr+fe42Ilbgi9i/dZ+Z/daoWiT4OdQ2Caoo1o6pjuUumu6Jubn2pSpJDTJzCPYonhJlWfvbZ0ppU16566+n4q85UrNqzPDfDm7olDdj5Oc5p0KxWnP1K9bRWrtfyURexhW3s0jwPUMcxGnDJW+ABj3gyTozYuDHGH6lGIdds4tsy7t8Bd5iQOA==</latexit>
<latexit sha1_base64="PrSNix9HAyjcTLZymJRqDDkxJ6U=">AAACy3icjVHLSsNAFD2NrxpfVZdugkVwVZJa0GXBjRuhgn1AWyRJp3VoXmQmQq0u/QG3+l/iH+hfeGdMQS2iE5KcOfecO3Pv9ZKAC2nbrwVjYXFpeaW4aq6tb2xulbZ3WiLOUp81/TiI047nChbwiDUllwHrJClzQy9gbW98quLtG5YKHkeXcpKwfuiOIj7kviuJ6vQEH4WuaV6VynbF1suaB04OyshXIy69oIcBYvjIEIIhgiQcwIWgpwsHNhLi+pgSlxLiOs5wD5O8GakYKVxix/Qd0a6bsxHtVU6h3T6dEtCbktPCAXli0qWE1WmWjmc6s2J/yz3VOdXdJvT38lwhsRLXxP7lmyn/61O1SAxxomvgVFOiGVWdn2fJdFfUza0vVUnKkBCn8IDiKWFfO2d9trRH6NpVb10df9NKxaq9n2szvKtb0oCdn+OcB61qxTmqVC9q5XotH3URe9jHIc3zGHWcoYGmnuMjnvBsnBvCuDXuPqVGIffs4tsyHj4AUUqRwg==</latexit>
<latexit sha1_base64="wYrgRqQhF4FYayq4aFFfyjTPCpo=">AAACyHicjVHLSsNAFD2Nr1pfVZdugkVwVZJa0GXBjbiqYNpCLSWZTuvQvEgmSi1u/AG3+mXiH+hfeGdMQS2iE5KcOfeeM3Pv9WJfpNKyXgvGwuLS8kpxtbS2vrG5Vd7eaaVRljDusMiPko7nptwXIXekkD7vxAl3A8/nbW98quLtG56kIgov5STmvcAdhWIomCuJcu76slTqlytW1dLLnAd2DirIVzMqv+AKA0RgyBCAI4Qk7MNFSk8XNizExPUwJS4hJHSc4x4l0maUxSnDJXZM3xHtujkb0l55plrN6BSf3oSUJg5IE1FeQlidZup4pp0V+5v3VHuqu03o7+VeAbES18T+pZtl/lenapEY4kTXIKimWDOqOpa7ZLor6ubml6okOcTEKTygeEKYaeWsz6bWpLp21VtXx990pmLVnuW5Gd7VLWnA9s9xzoNWrWofVWsX9Uqjno+6iD3s45DmeYwGztCEQ94Cj3jCs3FuxMatMflMNQq5ZhfflvHwASzEkIQ=</latexit>
<latexit sha1_base64="V3bZaISKSGHQQ2uNduTm5GtYe6g=">AAACynicjVHLSsNAFD2NrxpfVZdugkVwVZJa0GXBjQsXFewD2iJJOq2haSZMJkIp7vwBt/ph4h/oX3hnnIJaRCckOXPuOXfm3hukcZRJ130tWEvLK6trxXV7Y3Nre6e0u9fKeC5C1gx5zEUn8DMWRwlrykjGrJMK5k+CmLWD8bmKt++YyCKeXMtpyvoTf5REwyj0JVHtHh9wads3pbJbcfVyFoFnQBlmNXjpBT0MwBEixwQMCSThGD4yerrw4CIlro8ZcYJQpOMM97DJm5OKkcIndkzfEe26hk1or3Jm2h3SKTG9gpwOjsjDSScIq9McHc91ZsX+lnumc6q7TekfmFwTYiVuif3LN1f+16dqkRjiTNcQUU2pZlR1ocmS666omztfqpKUISVO4QHFBeFQO+d9drQn07Wr3vo6/qaVilX70GhzvKtb0oC9n+NcBK1qxTupVK9q5XrNjLqIAxzimOZ5ijou0EBTV/mIJzxbl5awptbsU2oVjGcf35b18AFQf5Fd</latexit>
<latexit sha1_base64="V3bZaISKSGHQQ2uNduTm5GtYe6g=">AAACynicjVHLSsNAFD2NrxpfVZdugkVwVZJa0GXBjQsXFewD2iJJOq2haSZMJkIp7vwBt/ph4h/oX3hnnIJaRCckOXPuOXfm3hukcZRJ130tWEvLK6trxXV7Y3Nre6e0u9fKeC5C1gx5zEUn8DMWRwlrykjGrJMK5k+CmLWD8bmKt++YyCKeXMtpyvoTf5REwyj0JVHtHh9wads3pbJbcfVyFoFnQBlmNXjpBT0MwBEixwQMCSThGD4yerrw4CIlro8ZcYJQpOMM97DJm5OKkcIndkzfEe26hk1or3Jm2h3SKTG9gpwOjsjDSScIq9McHc91ZsX+lnumc6q7TekfmFwTYiVuif3LN1f+16dqkRjiTNcQUU2pZlR1ocmS666omztfqpKUISVO4QHFBeFQO+d9drQn07Wr3vo6/qaVilX70GhzvKtb0oC9n+NcBK1qxTupVK9q5XrNjLqIAxzimOZ5ijou0EBTV/mIJzxbl5awptbsU2oVjGcf35b18AFQf5Fd</latexit>
<latexit sha1_base64="0YTMWUSF1B7eCaNKVkTio/FKZfE=">AAACxXicjVHLSsNAFD2Nr1pfVZdugkUQhJLWgi4LLnRZxT6gFkmm0zo0L5JJoRTxB9zqr4l/oH/hnXEKahGdkOTMufecmXuvF/silY7zmrMWFpeWV/KrhbX1jc2t4vZOK42yhPEmi/wo6Xhuyn0R8qYU0uedOOFu4Pm87Y3OVLw95kkqovBaTmLeC9xhKAaCuZKoq6PCbbHklB297HlQMaAEsxpR8QU36CMCQ4YAHCEkYR8uUnq6qMBBTFwPU+ISQkLHOe5RIG1GWZwyXGJH9B3SrmvYkPbKM9VqRqf49CaktHFAmojyEsLqNFvHM+2s2N+8p9pT3W1Cf894BcRK3BH7l26W+V+dqkVigFNdg6CaYs2o6phxyXRX1M3tL1VJcoiJU7hP8YQw08pZn22tSXXtqreujr/pTMWqPTO5Gd7VLWnAlZ/jnAetarlyXK5e1kr1mhl1HnvYxyHN8wR1XKCBJnkP8IgnPFvnVmBJa/yZauWMZhfflvXwAe7wjzo=</latexit>
<latexit sha1_base64="8a88kDZJs0a6CmDnzWq9Ux0BJ2E=">AAACzHicjVHLSsNAFD2Nr1pfVZdugkVwY0lqQZcFN66kgn1ILSVJp3VoXkwmQi3d+gNu9bvEP9C/8M6YglpEJyQ5c+49Z+be68Y+T6RlveaMhcWl5ZX8amFtfWNzq7i900yiVHis4UV+JNqukzCfh6whufRZOxbMCVyftdzRmYq37phIeBReyXHMuoEzDPmAe44k6to2j8z7niwUesWSVbb0MueBnYESslWPii+4QR8RPKQIwBBCEvbhIKGnAxsWYuK6mBAnCHEdZ5iiQNqUshhlOMSO6DukXSdjQ9orz0SrPTrFp1eQ0sQBaSLKE4TVaaaOp9pZsb95T7SnutuY/m7mFRArcUvsX7pZ5n91qhaJAU51DZxqijWjqvMyl1R3Rd3c/FKVJIeYOIX7FBeEPa2c9dnUmkTXrnrr6PibzlSs2ntZbop3dUsasP1znPOgWSnbx+XKZbVUq2ajzmMP+zikeZ6ghnPU0SDvAI94wrNxYUhjYkw/U41cptnFt2U8fAA/ZJFK</latexit>
<latexit sha1_base64="qh6jujZxYIbatxMsY+f5fqUzuCs=">AAAC0HicjVHLSsNAFD2NrxpfVZdugkVwVZJa0GXBjcsq9gG2lCSdtkPzMpmIpRRx6w+41a8S/0D/wjtjCmoRnZDkzLn3nJl7rxN5PBGm+ZrTFhaXllfyq/ra+sbmVmF7p5GEaeyyuht6Ydxy7IR5PGB1wYXHWlHMbN/xWNMZncp484bFCQ+DSzGOWMe3BwHvc9cWRHXagns9NhlOu0LXu4WiWTLVMuaBlYEislULCy9oo4cQLlL4YAggCHuwkdBzBQsmIuI6mBAXE+IqzjCFTtqUshhl2MSO6Dug3VXGBrSXnolSu3SKR29MSgMHpAkpLyYsTzNUPFXOkv3Ne6I85d3G9HcyL59YgSGxf+lmmf/VyVoE+jhRNXCqKVKMrM7NXFLVFXlz40tVghwi4iTuUTwm7CrlrM+G0iSqdtlbW8XfVKZk5d7NclO8y1vSgK2f45wHjXLJOiqVzyvFaiUbdR572MchzfMYVZyhhjp5X+MRT3jWLrRb7U67/0zVcplmF9+W9vABX06UKA==</latexit>
<latexit sha1_base64="YiG1sktl4MsM+SyIq1Lx0zgI0Tw=">AAACyHicjVHLSsNAFD2NrxpfVZdugkVwVZJa0GXBjbiqYNpCLSWZTtuhaRKSiVKKG3/ArX6Z+Af6F94ZU1CL6IQkZ86958zce/04EKm07deCsbS8srpWXDc3Nre2d0q7e800yhLGXRYFUdL2vZQHIuSuFDLg7Tjh3sQPeMsfn6t465YnqYjCazmNeXfiDUMxEMyTRLmjnjTNXqlsV2y9rEXg5KCMfDWi0gtu0EcEhgwTcISQhAN4SOnpwIGNmLguZsQlhISOc9zDJG1GWZwyPGLH9B3SrpOzIe2VZ6rVjE4J6E1IaeGINBHlJYTVaZaOZ9pZsb95z7SnutuU/n7uNSFWYkTsX7p55n91qhaJAc50DYJqijWjqmO5S6a7om5ufalKkkNMnMJ9iieEmVbO+2xpTaprV731dPxNZypW7Vmem+Fd3ZIG7Pwc5yJoVivOSaV6VSvXa/moizjAIY5pnqeo4wINuOQt8IgnPBuXRmzcGdPPVKOQa/bxbRkPHwG8kHI=</latexit>
<latexit sha1_base64="aHEhw+cji0AlH6QH4RVuu4Pem9k=">AAACzHicjVHLSsNAFD2NrxpfVZdugkVwY0lqQZcFN66kgn1ILSWZTtvQvEgmQgnd+gNu9bvEP9C/8M6YglpEJyQ5c+49Z+be60SemwjTfC1oS8srq2vFdX1jc2t7p7S710rCNGa8yUIvjDuOnXDPDXhTuMLjnSjmtu94vO1MLmS8fc/jxA2DGzGNeM+3R4E7dJktiLod9zNxYs10vV8qmxVTLWMRWDkoI1+NsPSCOwwQgiGFD44AgrAHGwk9XVgwERHXQ0ZcTMhVcY4ZdNKmlMUpwyZ2Qt8R7bo5G9BeeiZKzegUj96YlAaOSBNSXkxYnmaoeKqcJfubd6Y85d2m9HdyL59YgTGxf+nmmf/VyVoEhjhXNbhUU6QYWR3LXVLVFXlz40tVghwi4iQeUDwmzJRy3mdDaRJVu+ytreJvKlOycs/y3BTv8pY0YOvnOBdBq1qxTivV61q5XstHXcQBDnFM8zxDHZdooEnePh7xhGftShNaps0+U7VCrtnHt6U9fADO8JHw</latexit>
<latexit sha1_base64="eOApd9xLPyYdjs/x4umZkRLrkAI=">AAACynicjVHLSsNAFD2NrxpfVZdugkVwVZJa0GXBhS5cVDGtUIsk6bQOzYvJRCjFnT/gVj9M/AP9C++MKahFdEKSM+eec2fuvX4a8kza9mvJmJtfWFwqL5srq2vrG5XNrXaW5CJgbpCEibjyvYyFPGau5DJkV6lgXuSHrOOPjlW8c8dExpP4Uo5T1ou8YcwHPPAkUZ2Qn1y4pnlTqdo1Wy9rFjgFqKJYraTygmv0kSBAjggMMSThEB4yerpwYCMlrocJcYIQ13GGe5jkzUnFSOERO6LvkHbdgo1pr3Jm2h3QKSG9gpwW9siTkE4QVqdZOp7rzIr9LfdE51R3G9PfL3JFxErcEvuXb6r8r0/VIjHAka6BU02pZlR1QZEl111RN7e+VCUpQ0qcwn2KC8KBdk77bGlPpmtXvfV0/E0rFav2QaHN8a5uSQN2fo5zFrTrNeegVj9vVJuNYtRl7GAX+zTPQzRxihZcXeUjnvBsnBnCGBuTT6lRKjzb+LaMhw+UCZEO</latexit>
<latexit sha1_base64="8+PyHbJekjHixDkrGIesXxNNl28=">AAAC0HicjVHLSsNAFD2NrxpfVZdugkVwVZIqqLuCG5dV7APaUpJ0WofmZTIRSyni1h9wq18l/oH+hXfGFNQiOiHJmXPvOTP3XifyeCJM8zWnzc0vLC7ll/WV1bX1jcLmVj0J09hlNTf0wrjp2AnzeMBqgguPNaOY2b7jsYYzPJXxxg2LEx4Gl2IUsY5vDwLe564tiOq0Bfd6bNycdIWudwtFs2SqZcwCKwNFZKsaFl7QRg8hXKTwwRBAEPZgI6GnBQsmIuI6GBMXE+IqzjCBTtqUshhl2MQO6TugXStjA9pLz0SpXTrFozcmpYE90oSUFxOWpxkqnipnyf7mPVae8m4j+juZl0+swBWxf+mmmf/VyVoE+jhWNXCqKVKMrM7NXFLVFXlz40tVghwi4iTuUTwm7CrltM+G0iSqdtlbW8XfVKZk5d7NclO8y1vSgK2f45wF9XLJOiiVzw+LlZNs1HnsYBf7NM8jVHCGKmrkfY1HPOFZu9ButTvt/jNVy2WabXxb2sMHOn+UHQ==</latexit>
<latexit sha1_base64="3ZG1nOTzSgh6xvdXJiF9sSdRbVw=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRIV1F3BjQsXFewD2iKT6bQNzYvJRCjFnT/gVj9M/AP9C++MKahFdEKSM+eec2fuvV4S+KlynNeCtbC4tLxSXC2trW9sbpW3d5ppnEkuGjwOYtn2WCoCPxIN5atAtBMpWOgFouWNL3S8dSdk6sfRjZokoheyYeQPfM4UUa0uC5IRK92WK07VMcueB24OKshXPS6/oIs+YnBkCCEQQREOwJDS04ELBwlxPUyJk4R8Exe4R4m8GakEKRixY/oOadfJ2Yj2Omdq3JxOCeiV5LRxQJ6YdJKwPs028cxk1uxvuacmp77bhP5eniskVmFE7F++mfK/Pl2LwgBnpgafakoMo6vjeZbMdEXf3P5SlaIMCXEa9ykuCXPjnPXZNp7U1K57y0z8zSg1q/c812Z417ekAbs/xzkPmkdV97h6dH1SqZ3noy5iD/s4pHmeooZL1NEwVT7iCc/WlSWtiTX9lFqF3LOLb8t6+AD4XJGo</latexit>
<latexit sha1_base64="3ZG1nOTzSgh6xvdXJiF9sSdRbVw=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRIV1F3BjQsXFewD2iKT6bQNzYvJRCjFnT/gVj9M/AP9C++MKahFdEKSM+eec2fuvV4S+KlynNeCtbC4tLxSXC2trW9sbpW3d5ppnEkuGjwOYtn2WCoCPxIN5atAtBMpWOgFouWNL3S8dSdk6sfRjZokoheyYeQPfM4UUa0uC5IRK92WK07VMcueB24OKshXPS6/oIs+YnBkCCEQQREOwJDS04ELBwlxPUyJk4R8Exe4R4m8GakEKRixY/oOadfJ2Yj2Omdq3JxOCeiV5LRxQJ6YdJKwPs028cxk1uxvuacmp77bhP5eniskVmFE7F++mfK/Pl2LwgBnpgafakoMo6vjeZbMdEXf3P5SlaIMCXEa9ykuCXPjnPXZNp7U1K57y0z8zSg1q/c812Z417ekAbs/xzkPmkdV97h6dH1SqZ3noy5iD/s4pHmeooZL1NEwVT7iCc/WlSWtiTX9lFqF3LOLb8t6+AD4XJGo</latexit>
<latexit sha1_base64="3ZG1nOTzSgh6xvdXJiF9sSdRbVw=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRIV1F3BjQsXFewD2iKT6bQNzYvJRCjFnT/gVj9M/AP9C++MKahFdEKSM+eec2fuvV4S+KlynNeCtbC4tLxSXC2trW9sbpW3d5ppnEkuGjwOYtn2WCoCPxIN5atAtBMpWOgFouWNL3S8dSdk6sfRjZokoheyYeQPfM4UUa0uC5IRK92WK07VMcueB24OKshXPS6/oIs+YnBkCCEQQREOwJDS04ELBwlxPUyJk4R8Exe4R4m8GakEKRixY/oOadfJ2Yj2Omdq3JxOCeiV5LRxQJ6YdJKwPs028cxk1uxvuacmp77bhP5eniskVmFE7F++mfK/Pl2LwgBnpgafakoMo6vjeZbMdEXf3P5SlaIMCXEa9ykuCXPjnPXZNp7U1K57y0z8zSg1q/c812Z417ekAbs/xzkPmkdV97h6dH1SqZ3noy5iD/s4pHmeooZL1NEwVT7iCc/WlSWtiTX9lFqF3LOLb8t6+AD4XJGo</latexit>
<latexit sha1_base64="IxPd0boKf7s6TQai5Bs/7HroWCE=">AAACznicjVHLSsNAFD2Nr/quunQTLIKrklRB3RXcuKxgH9CWkqTTOnSahGRSKKW49Qfc6meJf6B/4Z1xCmoRnZDkzLnn3Jl7rx8LnkrHec1ZS8srq2v59Y3Nre2d3cLefj2NsiRgtSASUdL0vZQJHrKa5FKwZpwwb+QL1vCHVyreGLMk5VF4Kycx64y8Qcj7PPAkUa225KLHps2unHULRafk6GUvAteAIsyqRoUXtNFDhAAZRmAIIQkLeEjpacGFg5i4DqbEJYS4jjPMsEHejFSMFB6xQ/oOaNcybEh7lTPV7oBOEfQm5LRxTJ6IdAlhdZqt45nOrNjfck91TnW3Cf19k2tErMQdsX/55sr/+lQtEn1c6Bo41RRrRlUXmCyZ7oq6uf2lKkkZYuIU7lE8IRxo57zPtvakunbVW0/H37RSsWofGG2Gd3VLGrD7c5yLoF4uuael8s1ZsXJpRp3HIY5wQvM8RwXXqKKmO/6IJzxbVWtszaz7T6mVM54DfFvWwwe185P1</latexit>
<latexit sha1_base64="6DKF//282aqOKkdCfabKEF0teLI=">AAACxXicjVHLSsNAFD2Nr1pfVZdugkVwVdIqqLuiC11JFfuAWiSZTuvQvJhMhFLEH3Crvyb+gf6Fd8YU1CI6IcmZc+85M/deL/ZFohznNWfNzM7NL+QXC0vLK6trxfWNZhKlkvEGi/xItj034b4IeUMJ5fN2LLkbeD5vecMTHW/dcZmIKLxSo5h3A3cQir5griLq8vj8plhyyo5Z9jSoZKCEbNWj4guu0UMEhhQBOEIowj5cJPR0UIGDmLguxsRJQsLEOe5RIG1KWZwyXGKH9B3QrpOxIe21Z2LUjE7x6ZWktLFDmojyJGF9mm3iqXHW7G/eY+Op7zaiv5d5BcQq3BL7l26S+V+drkWhj0NTg6CaYsPo6ljmkpqu6JvbX6pS5BATp3GP4pIwM8pJn22jSUztureuib+ZTM3qPctyU7zrW9KAKz/HOQ2a1XJlr1y92C/VjrJR57GFbezSPA9QwxnqaJB3H494wrN1agWWsu4+U61cptnEt2U9fADIt4+a</latexit>
Figure 1: Illustration of a FusionRNN combining a Fusion Layer (left) and a Light Gated Recurrent Unit RNN (right). A new embedding
of the M different microphones is learned through a sum of M non-linear projections with shared parameters of the input signals. This
embedding is then used a the new input representation for the neural acoustic model.
Our proposal delivers a unified speech pipeline for multi-
channel distant ASR referred as FusionRNN. It is composed
of a novel Fusion Layer (see section 2) and a Light Gated Re-
current Unit (liGRU) neural network. We hypothesized that our
Fusion Layer will produce a more expressive and robust embed-
ding from different microphone signals, and increase the accu-
racy of the recurrent acoustic model for distant ASR. The exper-
iments conducted on the DIRHA-English [27] dataset support
this and highlight consistent improvements in terms of Word
Error Rate (WER). We compared the FusionRNN to baseline
approaches in the same required training time category. Finally,
we release the code under PyTorch-Kaldi [28] to facilitate re-
producibility 1.
2. FusionRNN
This Section first motivates and presents the fusion layer (Sec-
tion 2.1). Then, the FusionRNN is introduced as a combina-
tion of a FL and a liGRU network to compose a neural acoustic
model (Section 2.2).
2.1. Fusion Layer
Our Fusion Layer (FL) is at the core of the FusionRNN. We
expect that if multi-channel signals have close but noisy input
distributions then building a common latent representation for
these data will help to reduce both the noise and the final tran-
scription error rate. Therefore, we made our FL project each mi-
crophone signal to a latent sub-space with a weight matrix that
is shared among the different channels (Figure 1). The power
of this approach lies in its versatility as it can efficiently encode
multi-channel input features for any well-known neural archi-
tecture while maintaining an equivalent training time.
Let xm,t
nbe the input to a node ncoming from the micro-
phone m(0mM) at a time-step t, where Mis the total
number of microphones (M[1, +[). The shared weight
matrix Wis composed of N×Hweight parameters w, with
Nand Hcorresponding to the input and hidden vectors sizes
respectively. The output ˜xt
hdescribing the fusion of the Mmi-
crophones obtained from the shared weights of the FL layer at
the output node his computed as following:
wn,j xi,t
with αbe any non-linear activation function and bna bias term.
The introduction of αbefore the summation over the dif-
ferent signal allows the FL to exhibit non-linear responses with
respect to different inputs while sharing the same weight pa-
rameters. Indeed, without α, a fusion layer can be reduced to
a common fully-connected layer with a specific constraint and
three times fewer degrees of freedom.
Note that Eq. 1 can be transformed into a 1D convolutional
layer in the context of a precise setup. For instance, we imple-
ment Eq. 1 by concatenating the different channels along one
dimension and by applying a 1D convolution based on a kernel
size and a stride equals to N. Then, the resulting outputs Hcan
be summed to obtain ˜xt
The FL reduces the number of neural parameters of the in-
put layers from (N×M)×Hfor an equivalent fully-connected
layer to N×H, while preserving the number of computations.
We believe that this reduction will make a speech recognizer
equipped with a FL faster to converge and better at generalizing
from noisy inputs.
Following prior works that investigate weight sharing archi-
tectures such as high-dimensional neural networks [29, 23, 30]
or Siamese neural networks [25, 26], our FL is expected to pro-
vide an efficient way to capture global variations affecting dif-
ferent microphones, thus increase the robustness of the speech
recognizer. In comparison, an embedding obtained with a fully-
connected layer may exhibit higher variances across different
dimensions of the generated vector, in presence of strong per-
turbations, which potentially may harm the final performance.
2.2. Integration to Light Gated Recurrent Units
Due to its simple formulation, the fusion layer can easily be in-
tegrated to existing neural architectures by simply replacing all
the fully-connected layers. In this work, we propose to extend
the FL to the light gated recurrent unit (liGRU) first introduced
in [31] to compose the acoustic modelling part of an hybdrid
DNN-HMM automatic speech recognizer.
LiGRUs are a revised version of the well-known GRU re-
current neural networks that account for the specificities of
speech recognition. More precisely, liGRU models remove the
reset gate and replace the hyperbolic tangent across the hidden
state with a rectified linear unit, and a specific input-to-hidden
batch-normalisation to stabilize and fasten the training. In prac-
tice, liGRU have been shown to always outperform both GRU
and LSTM RNNs in different context of speech recognition in-
cluding distant ASR [31, 28] both in terms of training speed and
accuracy. LiGRU equations are summarized as follow:
zt=σ(BN (Wzxt) + Uzht1),(2)
ht=ReLU (BN (Whxt) + Uhht1),(3)
ht(1 zt),(4)
with ztand htthe update gate and the hidden state at time-step t
respectively. The batch-normalisation denoted BN (x)follows
the definition given in [32] and normalises the processed mini-
batch by considering internal statistics. Biases are integrated to
the BN and are therefore omitted from the liGRU equations.
The FusionRNN inference process is obtained by replacing
the input layers of ztand ˜
hwith fusion layers:
zt=σ(BN (F L(xt)) + Uzht1),(5)
ht=ReLU (BN (F L(xt)) + Uhht1).(6)
The hidden state htis then updated following the standard
liGRU formulation. Finally, the FusionRNN is trained follow-
ing the backpropagation through time with respect to common
cost functions.
3. Experimental Protocol
We propose to evaluate the performance of the FusionRNN on
the multi-channel distant speech recognition task on the DIRHA
dataset [27] (Section 3.1). Our FusionRNN is compared to
equivalent neural networks by varying the number of consid-
ered microphones and the initial acoustic features representa-
tion (Section 3.2).
3.1. The DIRHA Dataset
Experiments are conducted on the DIRHA-English corpus [33].
This dataset models a domestic environment characterized by
the presence of non-stationary noise and acoustic reverberation
enabling various benchmarks of speech-based systems in more
realistic conditions. 40 Mel filter bank energies and 13 MFCC
were computed with windows of 25 ms and an overlap of 10
ms to be used as initial acoustic representations [34]. These
two common acoustic features are considered to illustrate the
independence of the results with respect to the initial input rep-
resentation of the signal. Then a delay-and-sum beamforming
is applied over the six different microphones to be considered
as a baseline.
The training is based on the original Wall-Street-Journal-5k
(WSJ) corpus (i.e. consisting of 7138 sentences uttered by 83
speakers) contaminated with a set of impulse responses mea-
sured in a real apartment [35, 36]. Both a real (Test Real) and
a simulated (Test Sim) dataset are used for testing, each con-
sisting of 409 WSJ sentences uttered by six native American
speakers. Note that a validation set of 310 WSJ sentences is
used for hyper-parameter tuning.
The full circular array of six microphones is considered to
evaluate the impact of the number of microphones available on
our models.
3.2. Models Architectures
LiGRU and FusionRNN are parametrized following the best
model corresponding to the DIRHA recipe proposed in [28].
Models are fed with mmicrophone signals (1mM) cor-
responding to a single time frame from each microphone (i.e.
no right or left context). The bidirectional liGRU layers are
composed of 512 neurons and are stacked before being fed to
the last softmax-based layer for classification. Then the out-
put labels are the different HMM states of the Kaldi decoder.
The fusion layer activation functions are all parametric ReLU
(PReLU) [37].
Recurrent weights are initialized orthogonally [38] while
input to hidden weights are sampled from a normal distribu-
tion following the Glorot criterion [39]. Both FusionRNN and
liGRU models are composed of roughly 8M neural parameters
and are trained with RMSPROP across 20 epoch with an ini-
tial learning rate of 1.6e3. The learning rate is halved every
time the loss on the validation set increases to ensure an opti-
mal convergence. A dropout rate of 0.2is applied on all the
recurrent layers. Input sequences are chunked to 100 time-steps
to warm up the training and doubled at the end of each epoch
up to 500. The models are based on the same PyTorch imple-
mentation to alleviate the variations that could be observed with
different source codes.
4. Results and Discussions
First, the results obtained with different number of microphones
on the DIRHA dataset are reported in Table 1. The FusionRNN
always outperform equivalent standard liGRU conditioned on
the same number of microphones with both the real and simu-
lated test sets. Hence average absolute improvements of 0.7%
and 0.6% are obtained with the FusionRNN on the real test set
with MFCC and FBANKs features respectively. The same re-
sults are observed on the simulated test set with an average gain
of 0.9% and 1% based on the same input conditions. This phe-
nomenon highlights the transferability of the fusion layer to dif-
ferent initial acoustic representations. In particular, a best WER
of 24.5% is reported on the real test set with a FusionRNN fed
with 6microphones compared to 25.0% for standard liGRU.
It is worth underlining that this represents a decrease of 2.9%
in WER over prior experiments with DIRHA and non speaker-
adapted acoustic features [31].
The gap in transcription error rate between FusionRNN and
liGRU increases with the number of microphones as shown in
the last column of Table 1. An initial absolute gain of 0.2%
in average (i.e. with respect to all test sets and features) is ob-
served with 2microphones, increasing to 1.1% with 5micro-
phones. This behaviour tends to validate the assumption that
weight sharing is helpful in the context of increasing number of
acoustic sources by learning an expressive latent representation
of the different input distributions.
The introduction of the sixth microphone (LA6) slightly
harms the liGRU results while FusionRNN performance is not
changed. This may be because the sixth microphone (LA6) is
disposed at the centre of the circular array while LA1-LA5 form
the circle. This finding is a first step to support the increased ro-
bustness offered by the fusion layer.
Table 1: Results are expressed in terms of Word Error Rate (WER) (i.e lower is better) for different models on the DIRHA dataset
with different acoustic features. ’Test Sim. corresponds to the simulated test set of the corpus, while “Test Real” is the set composed
of real recordings. Beam-liGRU is a liGRU fed with delay-and-sum beamformed acoustic features. “Gain” is the absolute average
improvement observed with FusionRNN on all test sets and features.
Models Nb. of Mic. Test Real
Test Sim.
Test Real
Test Sim.
Beam-liGRU 6 27.2 22.0 27.9 21.9
liGRU [31] 1 27.8 21.3 27.6 21.4
liGRU 2 26.4 20.1 27.1 21.2
FusionRNN 2 26.4 20.0 26.8 20.8 -0.2
liGRU 3 26.4 19.3 26.2 20.2
FusionRNN 3 25.3 18.5 25.7 19.3 -0.9
liGRU 4 26.0 19.1 26.1 20.1
FusionRNN 4 25.0 18.5 25.5 19.0 -1.0
liGRU 5 25.2 19.8 25.8 20.1
FusionRNN 5 24.7 18.4 24.9 18.5 -1.1
liGRU 6 25.0 19.9 25.9 19.5
FusionRNN 6 24.5 18.4 25.1 18.5 -1.0
A crucial benefit of the Fusion Layer lies in the consistency
of improvements in terms of WER it deliverers. The proposed
FusionRNN always performs better. Furthermore, FusionRNN
transfer well to different acoustic representations by achieving
superior results with both MFCC and FBANK features. There-
fore, it is expected that similar improvements may be expected
with other common extraction techniques including fMLLR and
PLP that have been shown to be particularly helpful in noisy and
reverberated conditions [31].
Second, we investigated the impact of the fusion layer (Eq.
1) on the training time of the FusionRNN compared to a stan-
dard liGRU equipped with a plain fully-connected layer. Figure
2 reports the different duration in seconds needed to complete
each epoch by the different models. This is to quantify the la-
tency introduced by the FL as described in Section 2.1. We find
that the cost in absolute as well as relative terms is marginal.
The FusionRNN completes in average one round in 935 sec-
onds compared to 929 for liGRU. This 0.7% difference could
be even further reduced with a well-designed PyTorch optimi-
sation of the fusion layer.
1 2 3 4 5 10 15 20
Time (s)
Figure 2: Training time is seconds recorded for each epoch with
both FusionRNN and liGRU ASR systems. Models have been
trained on a single Nvidia RTX 2080 Ti.
Note that the significant increase in duration observed be-
tween the first epoch and the fourth one with both models is due
to the training strategy. Sequences sizes are gradually increased
to warm up the training of the acoustic models. Here, the up-
per limit is 500 time frames and is reached at epoch number
four. Finally, and as shown in the experiments, a fusion layer
can easily be embedded in any pre-existing multi-channel neu-
ral acoustic model to reduce the number of transcription errors
at the only cost of implementation.
5. Conclusion
Summary. This paper first introduced a fusion layer that
can easily be plugged into any existing multi-channel ASR
systems to learn expressive embeddings from multiple micro-
phone signals while maintaining the training time effectively
unaltered. Furthermore, its shared neural parameters allow
the FL to project close but different input distributions com-
ing from different microphones in a shared latent subspace,
which is more robust to perturbations induced by distant ASR.
FL when included in FusionRNN delivered a consistent, ma-
terial, and robust improvement in transcription error rate over
resource-equivalent architectures on a multi-channel distant
speech recognition tasks.
Perspectives. Despite few prior works on shared weights, it re-
mains unclear how this mechanism affects the filtering of the
acoustic signal in the fusion layer. Therefore, a future work
will be to measure the response of the FusionRNN to out-of-
distribution speech samples. In this context, we expect that the
obtained embedding will offer fewer variance across the differ-
ent recordings compared to traditional approaches, hence induc-
ing better generalisation capabilities.
6. Acknowledgements
This work was supported by the EPRSC through MOA
(EP/S001530/) and Samsung AI. We also would like to thank
Filip Svoboda for the numerous and usefull comments.
7. References
[1] M. W¨
olfel and J. W. McDonough, Distant speech recognition.
Wiley Online Library, 2009.
[2] J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong, Robust Automatic
Speech Recognition - A Bridge to Practical Applications (1st Edi-
tion), October 2015.
[3] M. Ravanelli, Deep learning for Distant Speech Recognition.
PhD Thesis, Unitn, 2017.
[4] M. Brandstein and D. Ward, Microphone arrays: signal process-
ing techniques and applications. Springer Science & Business
Media, 2013.
[5] J. Benesty, J. Chen, and Y. Huang, Microphone array signal pro-
cessing. Springer Science & Business Media, 2008, vol. 1.
[6] W. Kellermann, Beamforming for Speech and Audio Signals,
[7] C. H. Knapp and G. C. Carter, “The generalized correlation
method for estimation of time delay, IEEE Transactions on
Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–
327, 1976.
[8] M. Kajala and M. Hamalainen, “Filter-and-sum beamformer with
adjustable filter characteristics,” in Proc. of ICASSP, 2001, pp.
[9] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb-
Umbach, “Beamnet: End-to-end training of a beamformer-
supported multi-channel asr system,” in 2017 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2017, pp. 5325–5329.
[10] S. Braun, D. Neil, J. Anumula, E. Ceolini, and S.-C. Liu, “Multi-
channel attention for end-to-end speech recognition,” 2018 Inter-
speech, pp. 0–0, 2018.
[11] B. Li, T. N. Sainath, R. J. Weiss, K. W. Wilson, and M. Bacchiani,
“Neural network adaptive beamforming for robust multichannel
speech recognition,” 2016.
[12] T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, and X. Xiao, “Uni-
fied architecture for multichannel end-to-end speech recognition
with neural beamforming,” IEEE Journal of Selected Topics in
Signal Processing, vol. 11, no. 8, pp. 1274–1288, 2017.
[13] X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L.
Seltzer, G. Chen, Y. Zhang, M. Mandel, and D. Yu, “Deep beam-
forming networks for multi-channel speech recognition,” in Proc.
of ICASSP, 2016, pp. 5745–5749.
[14] S. Kim and I. Lane, “End-to-end speech recognition with auditory
attention for multi-microphone distance speech recognition,” in
Proc. Interspeech 2017, 2017.
[15] Y. Liu, P. Zhang, and T. Hain, “Using neural network front-ends
on far field multiple microphones based speech recognition,” in
Proc. of ICASSP, 2014, pp. 5542–5546.
[16] T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, and X. Xiao, “Uni-
fied architecture for multichannel end-to-end speech recognition
with neural beamforming,” IEEE Journal of Selected Topics in
Signal Processing, vol. 11, no. 8, pp. 1274–1288, 2017.
[17] A. Hirose, Complex-valued neural networks: theories and appli-
cations. World Scientific, 2003, vol. 5.
[18] C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian,
J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, and C. J.
Pal, “Deep complex networks,arXiv preprint arXiv:1705.09792,
[19] T. Parcollet, M. Morchid, and G. Linar`
es, “A survey of quaternion
neural networks,” Artificial Intelligence Review, pp. 1–26, 2019.
[20] C. J. Gaudet and A. S. Maida, “Deep quaternion networks,”
in 2018 International Joint Conference on Neural Networks
(IJCNN). IEEE, 2018, pp. 1–8.
[21] S. Buchholz and G. Sommer, “On clifford neurons and clifford
multi-layer perceptrons,” Neural Networks, vol. 21, no. 7, pp.
925–935, 2008.
[22] Y. Liu, P. Xu, J. Lu, and J. Liang, “Global stability of clifford-
valued recurrent neural networks with time delays,Nonlinear
Dynamics, vol. 84, no. 2, pp. 767–777, 2016.
[23] T. Parcollet, M. Morchid, and G. Linar`
es, “Quaternion convolu-
tional neural networks for heterogeneous image processing,” in
ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 8514–
[24] J. S. Larsen and L. Clemmensen, “Weight sharing and deep
learning for spectral data,” in ICASSP 2020-2020 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2020, pp. 4227–4231.
[25] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural net-
works for one-shot image recognition,” in ICML deep learning
workshop, vol. 2. Lille, 2015.
[26] A. Gresse, M. Quillot, R. Dufour, V. Labatut, and J.-F. Bonas-
tre, “Similarity metric based on siamese neural networks for voice
casting,” in ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019,
pp. 6585–6589.
[27] L. Cristoforetti, M. Ravanelli, M. Omologo, A. Sosi, A. Abad,
M. Hagm¨
uller, and P. Maragos, “The dirha simulated corpus.” in
LREC, 2014, pp. 2629–2634.
[28] M. Ravanelli, T. Parcollet, and Y. Bengio, “The pytorch-kaldi
speech recognition toolkit,” in ICASSP 2019-2019 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2019, pp. 6465–6469.
[29] T. Isokawa, T. Kusakabe, N. Matsui, and F. Peper, “Quaternion
neural network and its application,” in International conference
on knowledge-based and intelligent information and engineering
systems. Springer, 2003, pp. 318–324.
[30] M. Morchid, G. Linares, M. El-Beze, and R. De Mori, “Theme
identification in telephone service conversations using quater-
nions of speech features.” in INTERSPEECH, 2013, pp. 1394–
[31] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Light
gated recurrent units for speech recognition,” IEEE Transactions
on Emerging Topics in Computational Intelligence, vol. 2, no. 2,
pp. 92–102, 2018.
[32] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
deep network training by reducing internal covariate shift,arXiv
preprint arXiv:1502.03167, 2015.
[33] M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, and
M. Omologo, “The dirha-english corpus and related tasks for
distant-speech recognition in domestic environments,” in 2015
IEEE Workshop on Automatic Speech Recognition and Under-
standing (ASRU), 2015, pp. 275–282.
[34] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
“The kaldi speech recognition toolkit,” in IEEE 2011 workshop
on automatic speech recognition and understanding, no. CONF.
IEEE Signal Processing Society, 2011.
[35] M. Ravanelli, P. Svaizer, and M. Omologo, “Realistic multi-
microphone data simulation for distant speech recognition,” arXiv
preprint arXiv:1711.09470, 2017.
[36] M. Ravanelli and M. Omologo, “Contaminated speech training
methods for robust dnn-hmm distant speech recognition,” arXiv
preprint arXiv:1710.03538, 2017.
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification,
in Proceedings of the IEEE international conference on computer
vision, 2015, pp. 1026–1034.
[38] A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions
to the nonlinear dynamics of learning in deep linear neural net-
works,” arXiv preprint arXiv:1312.6120, 2013.
[39] X. Glorot and Y. Bengio, “Understanding the difficulty of train-
ing deep feedforward neural networks,” in Proceedings of the
thirteenth international conference on artificial intelligence and
statistics, 2010, pp. 249–256.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Quaternion neural networks have recently received an increasing interest due to noticeable improvements over real-valued neural networks on real world tasks such as image, speech and signal processing. The extension of quaternion numbers to neural architectures reached state-of-the-art performances with a reduction of the number of neural parameters. This survey provides a review of past and recent research on quaternion neural networks and their applications in different domains. The paper details methods, algorithms and applications for each quaternion-valued neural networks proposed.
Conference Paper
Full-text available
Dubbing contributes to a larger international distribution of multimedia documents. It aims to replace the original voice in a source language by a new one in a target language. For now, the target voice selection procedure, called voice casting, is manually performed by human experts. This selection is not exclusively based on acoustic similarity between the two voices. Actually, it is also supported by more subjective criteria such as the "color" of the voice, sociocultural choices… The objective of this work is to model a voice similarity metric able to embed all the concerned voice characteristics, including the observers’ receptive interests. In this paper, we propose a Siamese Neural Networks-based approach, measuring proximity between the original and dubbed voices. We propose an adapted jackknifing cross-validation method to evaluate our similarity model on unseen voices. The results show that we successfully capture information allowing two voices to be associated, with respect to the character’s or role’s abstract dimension.
Conference Paper
Full-text available
Full-text available
A field that has directly benefited from the recent advances in deep learning is automatic speech recognition (ASR). Despite the great achievements of the past decades, however, a natural and robust human–machine speech interaction still appears to be out of reach, especially in challenging environments characterized by significant noise and reverberation. To improve robustness, modern speech recognizers often employ acoustic models based on recurrent neural networks (RNNs) that are naturally able to exploit large time contexts and long-term speech modulations. It is thus of great interest to continue the study of proper techniques for improving the effectiveness of RNNs in processing speech signals. In this paper, we revise one of the most popular RNN models, namely, gated recurrent units (GRUs), and propose a simplified architecture that turned out to be very effective for ASR. The contribution of this work is twofold: First, we analyze the role played by the reset gate, showing that a significant redundancy with the update gate occurs. As a result, we propose to remove the former from the GRU design, leading to a more efficient and compact single-gate model. Second, we propose to replace hyperbolic tangent with rectified linear unit activations. This variation couples well with batch normalization and could help the model learn long-term dependencies without numerical issues. Results show that the proposed architecture, called light GRU, not only reduces the per-epoch training time by more than 30% over a standard GRU, but also consistently improves the recognition accuracy across different tasks, input features, noisy conditions, as well as across different ASR paradigms, ranging from standard DNN-HMM speech recognizers to end-to-end connectionist temporal classification models.
Full-text available
Deep learning is an emerging technology that is considered one of the most promising directions for reaching higher levels of artificial intelligence. Among the other achievements, building computers that understand speech represents a crucial leap towards intelligent machines. Despite the great efforts of the past decades, however, a natural and robust human-machine speech interaction still appears to be out of reach, especially when users interact with a distant microphone in noisy and reverberant environments. The latter disturbances severely hamper the intelligibility of a speech signal, making Distant Speech Recognition (DSR) one of the major open challenges in the field. This thesis addresses the latter scenario and proposes some novel techniques, architectures, and algorithms to improve the robustness of distant-talking acoustic models. We first elaborate on methodologies for realistic data contamination, with a particular emphasis on DNN training with simulated data. We then investigate on approaches for better exploiting speech contexts, proposing some original methodologies for both feed-forward and recurrent neural networks. Lastly, inspired by the idea that cooperation across different DNNs could be the key for counteracting the harmful effects of noise and reverberation, we propose a novel deep learning paradigm called network of deep neural networks. The analysis of the original concepts were based on extensive experimental validations conducted on both real and simulated data, considering different corpora, microphone configurations, environments, noisy conditions, and ASR tasks.
Microphone arrays have attracted a lot of interest in the last two decades. The reason behind this is that they have the potential to solve many important problems in both human-machine and human-human interfaces for different kinds of communications. But before microphone arrays can be deployed broadly, there is a strong need for a deep understanding of the problems encountered in the real world and their clear formulation in order that useful algorithms can be developed to process the sensor signals. While there are many manuscripts on antenna arrays from a narrowband perspective (narrowband signals and narrowband processing), the literature is quite scarce when it comes to sensor arrays explained from a truly broadband perspective. Many algorithms for speech applications were simply borrowed from narrowband antenna arrays. However, a direct application of narrowband ideas to broadband speech processing may not be necessarily appropriate and can lead to many misunderstandings. Therefore, the main objective of this book is to derive and explain the most fundamental algorithms from a strictly broadband (signals and/or processing) viewpoint. Thanks to the approach taken here, new concepts come in light that have the great potential of solving several, very difficult problems encountered in acoustic and speech applications. Microphone Array Signal Processing is a timely and important professional reference for researchers and practicing engineers from universities and a wide range of industries. It is also an excellent text for graduate students who are interested in this promising and exciting field