PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to its simplicity and flexibility. The PyTorch-Kaldi project aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these software, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with a rich documentation and is designed to properly work locally or on HPC clusters. Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers.
Content may be subject to copyright.
THE PYTORCH-KALDI SPEECH RECOGNITION TOOLKIT
Mirco Ravanelli1, Titouan Parcollet2, Yoshua Bengio1
1Mila, Universit´
e de Montr´
eal , CIFAR Fellow
2LIA, Universit´
e d’Avignon
ABSTRACT
The availability of open-source software is playing a remarkable role
in the popularization of speech recognition and deep learning. Kaldi,
for instance, is nowadays an established framework used to develop
state-of-the-art speech recognizers. PyTorch is used to build neural
networks with the Python language and has recently spawn tremen-
dous interest within the machine learning community thanks to its
simplicity and flexibility.
The PyTorch-Kaldi project aims to bridge the gap between these
popular toolkits, trying to inherit the efficiency of Kaldi and the
flexibility of PyTorch. PyTorch-Kaldi is not only a simple inter-
face between these software, but it embeds several useful features
for developing modern speech recognizers. For instance, the code is
specifically designed to naturally plug-in user-defined acoustic mod-
els. As an alternative, users can exploit several pre-implemented
neural networks that can be customized using intuitive configuration
files. PyTorch-Kaldi supports multiple feature and label streams as
well as combinations of neural networks, enabling the use of com-
plex neural architectures. The toolkit is publicly-released along with
a rich documentation and is designed to properly work locally or on
HPC clusters.
Experiments, that are conducted on several datasets and tasks,
show that PyTorch-Kaldi can effectively be used to develop modern
state-of-the-art speech recognizers.
Index Terms: speech recognition, deep learning, Kaldi, PyTorch.
1. INTRODUCTION
Over the last years, we witnessed a progressive improvement and
maturation of Automatic Speech Recognition (ASR) technologies
[1, 2], that have reached unprecedented performance levels and are
nowadays used by millions of users worldwide.
A key role in this technological breakthrough is being played by
deep learning [3], that contributed to overcoming previous speech
recognizers based on Gaussian Mixture Models (GMMs). Beyond
deep learning, other factors have played a role in the progress of
the field. A number of speech-related projects such as AMI [4],
DICIT [5], DIRHA [6] and speech recognition challenges such
as CHiME [7], Babel, and Aspire, have remarkably fostered the
progress in ASR. The public distribution of large datasets such
as Librispeech [8] has also played an important role to establish
common evaluation frameworks and tasks.
Among the others factors, the development of open-source soft-
ware such as HTK [9], Julius [10], CMU-Sphinx, RWTH-ASR [11],
LIA-ASR [12] and, more recently, the Kaldi toolkit [13] have further
helped popularize ASR, making both research and development of
novel ASR applications significantly easier.
Kaldi currently represents the most popular ASR toolkit. It re-
lies on finite-state transducers (FSTs) [14] and provides a set of C++
libraries for efficiently implementing state-of-the-art speech recogni-
tion systems. Moreover, the toolkit includes a large set of recipes that
cover all the most popular speech corpora. In parallel to the devel-
opment of this ASR-specific software, several general-purpose deep
learning frameworks, such as Theano [15], TensorFlow [16], and
CNTK [17], have gained popularity in the machine learning com-
munity. These toolkits offer a huge flexibility in the neural network
design and can be used for a variety of deep learning applications.
PyTorch [18] is an emerging python package that implements ef-
ficient GPU-based tensor computations and facilitates the design of
neural architectures, thanks to proper routines for automatic gradi-
ent computation. An interesting feature of PyTorch lies in its modern
and flexible design, that naturally supports dynamic neural networks.
In fact, the computational graph is dynamically constructed on-the-
fly at running time rather than being statically compiled.
The PyTorch-Kaldi project aims to bridge the gap between Kaldi
and PyTorch1. Our toolkit implements acoustic models in PyTorch,
while feature extraction, label/alignment computation, and decod-
ing are performed with the Kaldi toolkit, making it a perfect fit to
develop state-of-the-art DNN-HMM speech recognizers. PyTorch-
Kaldi natively supports several DNNs, CNNs, and RNNs models.
Combinations between deep learning models, acoustic features, and
labels are also supported, enabling the use of complex neural archi-
tectures. For instance, users can employ a cascade between CNNs,
LSTMs, and DNNs, or run in parallel several models that share some
hidden layers. Users can also explore different acoustic features,
context duration, neuron activations (e.g., ReLU, leaky ReLU), nor-
malizations (e.g., batch [19] and layer normalization [20]), cost func-
tions, regularization strategies (e.g, L2, dropout [21]), optimization
algorithms (e.g., SGD, Adam [22], RMSPROP), and many other
hyper-parameters of an ASR system through simple edits of a con-
figuration file.
The toolkit is designed to make the integration of user-defined
acoustic models as simple as possible. In practice, users can em-
bed their deep learning model and conduct ASR experiments even
without being fully familiar with the complex speech recognition
pipeline. The toolkit can perform computations on both local ma-
chines and HPC cluster, and supports multi-gpu training, recovery
strategy, and automatic data chunking.
The experiments, conducted on several datasets and tasks, have
shown that PyTorch-Kaldi makes it possible to easily develop com-
petitive state-of-the-art speech recognition systems.
2. THE PYTORCH-KALDI PROJECT
An overview of the architecture adopted in PyTorch-Kaldi is re-
ported in Fig. 1. The main script run exp.py is written in python
1The code is available on GitHub (github.com/mravanelli/
PyTorch-kaldi/).
DNN
Speech Waveform
Feature Computation
Feature Reading
Feature Processing
Label Computation
Label Reading
Label Processing
Posterior Processing
Decoding
Text
Labels Features
Kaldi
PyTorch Kaldi
Kaldi
Fig. 1: An overview of the PyTorch-Kaldi architecture.
and manages all the phases involved in an ASR system, including
feature and label extraction, training, validation, decoding, and scor-
ing. The toolkit is detailed in the following sub-sections.
2.1. Configuration file
The main script takes as input a configuration file in INI format2,
that is composed of several sections. The section [E xp]specifies
some high-level information such as the folder used for the experi-
ment, the number of training epochs, the random seed. It also allows
users to specify whether the experiments have to be conducted on a
CPU, GPU, or on multiple GPUs. The configuration file continues
with the [dataset]sections, that specify information on features
and labels, including the paths where they are stored, the charac-
teristics of the context window [23], and the number of chunks in
which the speech dataset must be split. The neural models are de-
scribed in the [architecture]sections, while the [model]section
defines how these neural networks are combined. The latter section
exploits a simple meta-language that is automatically interpreted by
the run exp.py script. Finally, the configuration file defines the de-
coding parameters in the [decoding]section.
2.2. Features
The feature extraction is performed with Kaldi, that natively pro-
vides c++ libraries (e.g., compute-mfcc-feats,compute-fbank-feats,
compute-plp-feats) to efficiently extract the most popular speech
recognition features. The computed coefficients are stored in bi-
nary archives (with extension .ark) and are later imported into the
python environment using the kaldi-io utilities inherited from the
kaldi-io-for-python project3. The features are then processed by the
function load-chunk, that performs context window composition,
2The configuration file is fully described in the project documentation.
3github.com/vesis84/kaldi-io- for-python
shuffling, as well as mean and variance normalization. As out-
lined before, PyTorch-Kaldi can manage multiple feature streams.
For instance, users can define models that exploit combinations of
MFCCs, FBANKs, PLP, and fMLLR [24] coefficients.
2.3. Labels
The main labels used for training the acoustic model derive from a
forced alignment procedure between the speech features and the se-
quence of context-dependent phone states computed by Kaldi with
a phonetic decision tree. To enable multi-task learning, PyTorch-
kaldi supports multiple labels. For instance, it is possible to jointly
load both context-dependent and context-independent targets and
use the latter ones to perform monophone regularization [25, 26].
It is also possible to employ models based on an ecosystem of neu-
ral networks performing different tasks, as done in the context of
joint training between speech enhancement and speech recognition
[27–29] or in the context of the recently-proposed cooperative net-
works of deep neural networks [30].
2.4. Chunk and Mini-batch Composition
PyTorch-Kaldi automatically splits the full dataset into a number
of chunks, which are composed of labels and features randomly
sampled from the full corpus. Each chunk is then stored into the
GPU or CPU memory and processed by the neural training algorithm
run nn.py. The toolkit dynamically composes different chunks at
each epoch. A set of mini-batches are then derived from them. Mini-
batches are composed of few training examples that are used for gra-
dient computation and parameter optimization.
The way mini-batches are gathered strongly depends on the ty-
pology of the neural network. For feed-forward models, the mini-
batches are composed of randomly shuffled features and labels sam-
pled from the chunk. For recurrent networks, the minibatches must
be composed of full sentences. Different sentences, however, are
likely to have different duration, making zero-padding necessary to
form mini-batches of the same size. PyTorch-Kaldi sorts the speech
sequences in ascending order according to their lengths (i.e., short
sentences are processed first). This approach minimizes the need of
zero-paddings and turned out to be helpful to avoid possible biases
on batch normalization statistics. Moreover, it has been shown use-
ful to slightly boost the performance and to improve the numerical
stability of gradients.
2.5. DNN acoustic modeling
Each minibatch is processed by a neural network implemented with
PyTorch, that takes as input the features and as outputs a set of poste-
rior probabilities over the context-dependent phone states. The code
is designed to easily plug-in customized models. As reported in the
pseudo-code reported in Fig. 2, the new model can be simply de-
fined by adding a new class into the neural nets.py. The class must
be composed of an initialization method, that specifies the parame-
ters with their initialization, and a forward method that defines the
computations to perform.
As an alternative, a number of pre-defined state-of-the-art neural
models are natively implemented within the toolkit. The current ver-
sion supports standard MLPs, CNNs, RNNs, LSTM, and GRU mod-
els. Moreover, it supports some advanced recurrent architectures,
such as the recently-proposed Light GRU [31] and twin-regularized
RNNs [32]. The SincNet model [33, 34] is also implemented to per-
form speech recognition from raw waveform directly. The hyper-
parameters of the model (such as learning rate, number of neurons,
Fig. 2: Adding a user model into PyTorch-Kaldi
class my NN ( n n . Mo du l e ) :
def init (self , options ) :
super ( my NN , s e l f ) . init ( )
# D e f i n i t i o n o f M o de l P a r a m e t e r s
# Pa r am et er I n i t i a l i z a t i o n
def f o r w a r d ( self , minibatch ) :
# D e f i n i t i o n o f M o de l C o m p u t a t i o n s
return [ o u t p u t p r o b ]
number of layers, dropout factor, etc.) can be tuned using a utility
that implements the random search algorithm [35].
2.6. Decoding and Scoring
The acoustic posterior probabilities generated by the neural network
are normalized by their prior before feeding a HMM-based decoder.
The decoder merges the acoustic scores with the language probabili-
ties derived by an n-gram language model and tries to retrieve the se-
quence of words uttered in the speech signal using a beam-search al-
gorithm. The final Word-Error-Rate (WER) score is computed with
the NIST SCTK scoring toolkit.
3. EXPERIMENTAL SETUP
In the following sub-sections, the corpora, and the DNN setting
adopted for the experimental activity are described.
3.1. Corpora and Tasks
The first set of experiments was performed with the TIMIT corpus,
considering the standard phoneme recognition task (aligned with the
Kaldi s5 recipe [13]).
To validate our model in a more challenging scenario, exper-
iments were also conducted in distant-talking conditions with the
DIRHA-English dataset4[36,37]. Training was based on the original
WSJ-5k corpus (consisting of 7,138 sentences uttered by 83 speak-
ers) that was contaminated with a set of impulse responses measured
in a domestic environment [37]. The test phase was carried out with
the real part of the dataset, consisting of 409 WSJ sentences uttered
in the aforementioned environment by six native American speakers.
Additional experiments were conducted with the CHiME 4
dataset [7], that is based on speech data recorded in four noisy envi-
ronments (on a bus, cafe, pedestrian area, and street junction). The
training set is composed of 43,690 noisy WSJ sentences recorded
by five microphones (arranged on a tablet) and uttered by a total of
87 speakers. The test set ET-real considered in this work is based
on 1,320 real sentences uttered by four speakers, while the subset
DT-real has been used for hyperparameter tuning. The CHiME
experiments were based on the single channel setting [7].
Finally, experiments were performed with the LibriSpeech [8]
dataset. We used the training subset composed of 100 hours and
the dev-clean set for the hyperparameter search. Test results are re-
ported on the test-clean part using the fglarge decoding graph inher-
ited from the Kaldi s5 recipe.
4This dataset is distributed by the Linguistic Data Consortium (LDC).
Table 1: PER(%) obtained for the test set of TIMIT with various
neural architectures.
MFCC FBANK fMLLR
MLP 18.2 18.7 16.7
RNN 17.7 17.2 15.9
LSTM 15.1 14.3 14.5
GRU 16.0 15.2 14.9
Li-GRU 15.3 14.6 14.2
3.2. DNN setting
The experiments consider different acoustic features, i.e., 39 MFCCs
(13 static++∆∆), 40 log-mel filter-bank features (FBANKS), as
well as 40 fMLLR features [24] (extracted as reported in the s5
recipe of Kaldi), that were computed using windows of 25 ms with
an overlap of 10 ms.
The feed-forward models were initialized according to the Glo-
rot’s scheme [38], while recurrent weights were initialized with or-
thogonal matrices [39]. Recurrent dropout was used as a regular-
ization technique [40]. Batch normalization was adopted for feed-
forward connections only, as proposed in [41, 42]. The optimiza-
tion was done using the RMSprop algorithm running for 24 epochs.
The performance on the development set was monitored after each
epoch and the learning rate was halved when the relative perfor-
mance improvement went below 0.1%. The main hyperparameters
of the model (i.e., learning rate, number of hidden layers, hidden
neurons per layer, dropout factor, as well as the twin regularization
term λ) were tuned on the development datasets.
4. BASELINES
In this section, we discuss the baselines obtained with TIMIT,
DIRHA, CHiME, and LibriSpeech datasets. As a showcase to illus-
trate the main functionalities of the PyTorch-Kaldi toolkit, we first
report the experimental validation conducted on TIMIT.
Table 1 shows the performance obtained with several feed-
forward and recurrent models using different features. To ensure
a more accurate comparison between the architectures, five exper-
iments varying the initialization seeds were conducted for each
model and feature. The table thus reports the average phone error
rates (PER)5. Results show that, as expected, fMLLR features out-
perform MFCCs and FBANKs coefficients, thanks to the speaker
adaptation process. Recurrent models significantly outperform
the standard MLP one, especially when using LSTM, GRU, and
Li-GRU architecture, that effectively address gradient vanishing
through multiplicative gates. The best result (PER=14.2%) is ob-
tained with the Li-GRU model [31], that is based on a single gate
and thus saves 33% of the computations over a standard GRU.
Table 2 details the impact of some popular techniques imple-
mented in PyTorch-Kaldi for improving the ASR performance. The
first row (Baseline) reports the performance achieved with a basic re-
current model, where powerful techniques such as dropout and batch
normalization are not adopted. The second row highlights the per-
formance gain that is achieved when progressively increasing the se-
quence length during training. In this case, we started the training by
truncating the speech sentence at 100 steps (i.e, approximately 1 sec-
ond of speech) and we progressively double the maximum sequence
duration at every epoch. This simple strategy generally improves the
5Standard deviations range between 0.15 and 0.2for all the experiments.
Table 2: PER(%) obtained on TIMIT when progressively applying
some techniques implemented within PyTorch-Kaldi.
RNN LSTM GRU Li-GRU
Baseline 16.5 16.0 16.6 16.3
+ Incr. Seq. length 16.6 15.3 16.1 15.4
+ Recurrent Dropout 16.4 15.1 15.4 14.5
+ Batch Normalization 16.0 14.8 15.3 14.4
+ Monophone Reg. 15.9 14.5 14.9 14.2
Table 3: PER(%) obtained by combining multiple neural networks
and acoustic features.
Architecture Features PER (%)
Li-GRU fMLLR 14.2
MLP+Li-GRU+MLP MFCC+FBANK+fMLLR 13.8
Table 4: PER(%) obtained with standard convolutional and with the
SincNet architectures.
Model Features PER (%)
CNN FBANK 18.3
CNN Raw waveform 18.3
SincNet Raw waveform 18.1
system performance since it encourages the model to first focus on
short-term dependencies and learn longer-term ones only at a later
stage. The third row shows the improvement achieved when adding
recurrent dropout. Similarly to [40,42], we applied the same dropout
mask for all the time steps to avoid gradient vanishing problems. The
fourth line, instead, shows the benefits derived from batch normal-
ization [19]. Finally, the last line shows the performance achieved
when also applying monophone regularization [26]. In this case, we
employ a multi-task learning strategy by means of two softmax clas-
sifiers: the first one estimates context-dependent states, while the
second one predicts monophone targets. As observed in [26], our
results confirm that this technique can successfully be used as an
effective regularizer.
The experiments discussed so far are based on single neural
models. In Table 3 we compare our best Li-GRU system with a
more complex architecture based on a combination of feed-forward
and recurrent models fed by a concatenation of features. To the best
of our knowledge, the PER=13.8% achieved by the latter system
yields the best-published performance on the TIMIT test-set.
Previous achievements were based on standard acoustic features
computed with Kaldi. However, within PyTorch-Kaldi users can em-
ploy their own features. Table 4 shows the results achieved with con-
volutional models fed by standard FBANKs coefficients or by raw
acoustic waveform directly. The standard CNN based on raw sam-
ples performs similarly to the one fed by FBANK features. A slight
performance improvement is observed with SincNet [33], whose ef-
fectiveness to process raw waveforms for speech recognition is here
highlighted for the first time.
We now extend our experimental validation to other datasets.
With this regard, Table 5 shows the performance achieved on
DIRHA, CHiME, and Librispeech (100h) datasets. The Table
consistently show better performance with the Li-GRU model,
confirming our previous achievements on TIMIT. The results on
DIRHA and CHiME show the effectiveness of the proposed toolkit
also in noisy condition. In particular, DIRHA represents a very
Table 5: WER(%) obtained for the DIRHA, CHiME, and Lib-
riSpeech (100h) datasets with various neural architectures.
DIRHA CHiME LibriSpeech
MLP 26.1 18.7 6.5
LSTM 24.8 15.5 6.4
GRU 24.8 15.2 6.3
Li-GRU 23.9 14.6 6.2
challenging task, that is characterized by the presence of consider-
able levels of noise and reverberation. The WER=23.9% obtained
on this dataset represents the best performance published so-far on
the single-microphone task. Finally, the performance obtained with
Librispeech outperforms the corresponding p-norm Kaldi baseline
(W ER = 6.5%) on the considered 100 hours subset.
5. CONCLUSIONS
This paper described the PyTorch-Kaldi project, a new initiative that
aims to bridge the gap between Kaldi and PyTorch. The toolkit
is designed to make the development of an ASR system simpler
and more flexible, allowing users to easily plug-in their customized
acoustic models. PyTorch-Kaldi also supports combinations of neu-
ral architectures, features, and labels, allowing users to possibly em-
ploy complex ASR pipelines. The experiments have confirmed that
PyTorch-Kaldi can achieve state-of-the-art results in some popular
speech recognition tasks and datasets.
The current version of the PyTorch-Kaldi is already publicly-
available along with a detailed documentation. The project is still in
its initial phase and we invite all potential contributors to participate
in it. We hope to build a community of developers larger enough to
progressively maintain, improve, and expand the functionalities of
our current toolkit. In the future, we plan to increase the number of
pre-implemented models coded in our framework, and we would like
to extend the current project by integrating neural language model
training, as well proper support for end-to-end ASR systems.
6. ACKNOWLEDGMENT
We would like to thank Maurizio Omologo for his helpful comments.
This research was enabled in part by support provided by Calcul
Qu´
ebec and Compute Canada.
7. REFERENCES
[1] D. Yu and L. Deng, Automatic Speech Recognition – A Deep
Learning Approach, Springer, 2015.
[2] M. Ravanelli, Deep learning for Distant Speech Recognition,
PhD Thesis, Unitn, 2017.
[3] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning,
MIT Press, 2016.
[4] S. Renals, T. Hain, and H. Bourlard, “Interpretation of Mul-
tiparty Meetings the AMI and Amida Projects,” in Proc. of
HSCMA, 2008, pp. 115–118.
[5] M. Omologo, “A prototype of distant-talking interface for con-
trol of interactive TV, in Proceedings of Asilomar Conference
on Signals, Systems and Computers, 2010.
[6] L. Cristoforetti, M. Ravanelli, M. Omologo, A. Sosi, A. Abad,
M. Hagmueller, and P. Maragos, “The DIRHA simulated cor-
pus,” in Proc. of LREC, 2014, pp. 2629–2634.
[7] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The
third CHiME Speech Separation and Recognition Challenge:
Dataset, task and baselines,” in Proc. of ASRU, 2015, pp. 504–
511.
[8] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
rispeech: An ASR corpus based on public domain audio
books,” in Proc. of ICASSP, 2015, pp. 5206–5210.
[9] S. Young et al., HTK – Hidden Markov Model Toolkit, 2006.
[10] A. Lee and T. Kawahara., “Recent development of open-source
speech recognition engine julius,” in Proc. of APSIPA-ASC,
2008.
[11] D. Rybach, S. Hahn, P. Lehnen, D. Nolden, M. Sundermeyer,
Z. T¨
uske, S. Wiesler, R. Schl¨
uter, and H. Ney, “RASR - The
RWTH Aachen University Open Source Speech Recognition
Toolkit,” in Proc. of ASRU, 2011.
[12] G. Linar`
es, P. Nocera, D. Massoni´
e, and D. Matrouf, “The
lia speech recognition system: From 10xrt to 1xrt,” in Text,
Speech and Dialogue, V´
aclav Matouˇ
sek and Pavel Mautner,
Eds. 2007, pp. 302–308, Springer Berlin Heidelberg.
[13] D. Povey et al., “The Kaldi Speech Recognition Toolkit,” in
Proc. of ASRU, 2011.
[14] M. Mohri, “Finite-state transducers in language and speech
processing,” Computational Linguistics, vol. 23, no. 2, pp.
269–311, 1997.
[15] Theano Development Team, “Theano: A Python framework
for fast computation of mathematical expressions, arXiv e-
prints, vol. abs/1605.02688, May 2016.
[16] M. Abadi et al., “Tensorflow: A system for large-scale machine
learning,” in Proc. of USENIX-OSDI Symposium, 2016, pp.
265–283.
[17] F. Seide and A. Agarwal, “CNTK: Microsoft’s Open-Source
Deep-Learning Toolkit, in Proceedings of ACM SIGKDD,
2016, pp. 2135–2135.
[18] A. Paszke et al., “Automatic differentiation in pytorch, 2017.
[19] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
deep network training by reducing internal covariate shift, in
Proc. of ICML, 2015, pp. 448–456.
[20] L. J. Ba, R. K., and G. E. Hinton, “Layer normalization,”
CoRR, vol. abs/1607.06450, 2016.
[21] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: A simple way to prevent neu-
ral networks from overfitting, Journal of Machine Learning
Research, vol. 15, pp. 1929–1958, 2014.
[22] D.P. Kingma and J. Ba, Adam: A method for stochastic opti-
mization,” in Proc. of ICLR, 2015.
[23] M. Ravanelli and M. Omologo, “Automatic context window
composition for distant speech recognition,” Speech Commu-
nication, vol. 101, pp. 34 – 44, 2018.
[24] M.J.F. Gales, “Maximum Likelihood Linear Transformations
for HMM-Based Speech Recognition,” Computer Speech and
Language, vol. 12, no. 4, pp. 75–98, 1998.
[25] M. Ravanelli and M. Omologo, “Contaminated speech training
methods for robust DNN-HMM distant speech recognition, in
Proc. of Interspeech, 2015, pp. 756–760.
[26] P. Bell, P. Swietojanski, and S. Renals, “Multitask learning
of context-dependent targets in deep neural network acoustic
models,” IEEE/ACM Trans. Audio, Speech & Language Pro-
cessing, vol. 25, no. 2, pp. 238–247, 2017.
[27] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Batch-
normalized joint training for dnn-based distant speech recog-
nition,” in Proc. of SLT, 2016, pp. 28–34.
[28] A. Narayanan and D. Wang, “Joint noise adaptive training
for robust automatic speech recognition, in Proc. of ICASSP,
2014, pp. 4380–4384.
[29] X. Xiao et al., “Deep beamforming networks for multi-channel
speech recognition,” in Proc. of ICASSP, 2016.
[30] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, A net-
work of deep neural networks for distant speech recognition,
in Proc. of ICASSP, 2017, pp. 4880–4884.
[31] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Light
gated recurrent units for speech recognition,” IEEE Transac-
tions on Emerging Topics in Computational Intelligence, vol.
2, no. 2, pp. 92–102, April 2018.
[32] M. Ravanelli, D. Serdyuk, and Y. Bengio, “Twin regularization
for online speech recognition,” in Proc. of Interspeech, 2018.
[33] M. Ravanelli and Y.Bengio, “Speaker Recognition from raw
waveform with SincNet, in Proc. of SLT, 2018.
[34] M. Ravanelli and Y.Bengio, “Interpretable Convolutional Fil-
ters with SincNet,” in Proc. of NIPS@IRASL, 2018.
[35] J. Bergstra and Y. Bengio, “Random search for hyper-
parameter optimization,” Journal of Machine Learning Re-
search, vol. 13, pp. 281–305, 2012.
[36] M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi,
and M. Omologo, “The DIRHA-English corpus and re-
lated tasks for distant-speech recognition in domestic environ-
ments,” in Proc. of ASRU, 2015, pp. 275–282.
[37] M. Ravanelli, P. Svaizer, and M. Omologo, “Realistic multi-
microphone data simulation for distant speech recognition,” in
Proc. of Interspeech, 2016, pp. 2786–2790.
[38] X. Glorot and Y. Bengio, “Understanding the difficulty of train-
ing deep feedforward neural networks, in Proc. of AISTATS,
2010, pp. 249–256.
[39] Q.V. Le, N. Jaitly, and G.E. Hinton, “A simple way
to initialize recurrent networks of rectified linear units,”
arXiv:1504.00941, 2015.
[40] T. Moon, H. Choi, H. Lee, and I. Song, “RNNDROP: A novel
dropout for RNNS in ASR,” in Proc. of ASRU, 2015.
[41] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio,
“Batch normalized recurrent neural networks,” in Proc. of
ICASSP, 2016, pp. 2657–2661.
[42] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Im-
proving speech recognition by revising gated recurrent units,
in Proc. of Interspeech, 2017.
ResearchGate has not been able to resolve any citations for this publication.
Poster
Full-text available
Online speech recognition is crucial for developing natural human-machine interfaces. This modality, however, is significantly more challenging than off-line ASR, since real-time/low-latency constraints inevitably hinder the use of future information, that is known to be very helpful to perform robust predictions. A popular solution to mitigate this issue consists of feeding neural acoustic models with context windows that gather some future frames. This introduces a latency which depends on the number of employed look-ahead features. This paper explores a different approach, based on estimating the future rather than waiting for it. Our technique encourages the hidden representations of a unidirectional recurrent network to embed some useful information about the future. Inspired by a recently proposed technique called Twin Networks, we add a regularization term that forces forward hidden states to be as close as possible to cotemporal backward ones, computed by a "twin" neural network running backwards in time. The experiments, conducted on a number of datasets, recurrent architectures, input features, and acoustic conditions, have shown the effectiveness of this approach. One important advantage is that our method does not introduce any additional computation at test time if compared to standard unidirectional recurrent networks. https://arxiv.org/abs/1804.05374
Article
Full-text available
Distant speech recognition is being revolutionized by deep learning, that has contributed to significantly outperform previous HMM-GMM systems. A key aspect behind the rapid rise and success of DNNs is their ability to better manage large time contexts. With this regard, asymmetric context windows that embed more past than future frames have been recently used with feed-forward neural networks. This context configuration turns out to be useful not only to address low-latency speech recognition, but also to boost the recognition performance under reverberant conditions. This paper investigates on the mechanisms occurring inside DNNs, which lead to an effective application of asymmetric contexts.In particular, we propose a novel method for automatic context window composition based on a gradient analysis. The experiments, performed with different acoustic environments, features, DNN architectures, microphone settings, and recognition tasks show that our simple and efficient strategy leads to a less redundant frame configuration, which makes DNN training more effective in reverberant scenarios.
Article
Full-text available
A field that has directly benefited from the recent advances in deep learning is automatic speech recognition (ASR). Despite the great achievements of the past decades, however, a natural and robust human–machine speech interaction still appears to be out of reach, especially in challenging environments characterized by significant noise and reverberation. To improve robustness, modern speech recognizers often employ acoustic models based on recurrent neural networks (RNNs) that are naturally able to exploit large time contexts and long-term speech modulations. It is thus of great interest to continue the study of proper techniques for improving the effectiveness of RNNs in processing speech signals. In this paper, we revise one of the most popular RNN models, namely, gated recurrent units (GRUs), and propose a simplified architecture that turned out to be very effective for ASR. The contribution of this work is twofold: First, we analyze the role played by the reset gate, showing that a significant redundancy with the update gate occurs. As a result, we propose to remove the former from the GRU design, leading to a more efficient and compact single-gate model. Second, we propose to replace hyperbolic tangent with rectified linear unit activations. This variation couples well with batch normalization and could help the model learn long-term dependencies without numerical issues. Results show that the proposed architecture, called light GRU, not only reduces the per-epoch training time by more than 30% over a standard GRU, but also consistently improves the recognition accuracy across different tasks, input features, noisy conditions, as well as across different ASR paradigms, ranging from standard DNN-HMM speech recognizers to end-to-end connectionist temporal classification models.
Article
Full-text available
Deep learning is an emerging technology that is considered one of the most promising directions for reaching higher levels of artificial intelligence. Among the other achievements, building computers that understand speech represents a crucial leap towards intelligent machines. Despite the great efforts of the past decades, however, a natural and robust human-machine speech interaction still appears to be out of reach, especially when users interact with a distant microphone in noisy and reverberant environments. The latter disturbances severely hamper the intelligibility of a speech signal, making Distant Speech Recognition (DSR) one of the major open challenges in the field. This thesis addresses the latter scenario and proposes some novel techniques, architectures, and algorithms to improve the robustness of distant-talking acoustic models. We first elaborate on methodologies for realistic data contamination, with a particular emphasis on DNN training with simulated data. We then investigate on approaches for better exploiting speech contexts, proposing some original methodologies for both feed-forward and recurrent neural networks. Lastly, inspired by the idea that cooperation across different DNNs could be the key for counteracting the harmful effects of noise and reverberation, we propose a novel deep learning paradigm called network of deep neural networks. The analysis of the original concepts were based on extensive experimental validations conducted on both real and simulated data, considering different corpora, microphone configurations, environments, noisy conditions, and ASR tasks.
Article
Full-text available
Despite the significant progress made in the last years, state-of-the-art speech recognition technologies provide a satisfactory performance only in the close-talking condition. Robustness of distant speech recognition in adverse acoustic conditions, on the other hand, remains a crucial open issue for future applications of human-machine interaction. To this end, several advances in speech enhancement, acoustic scene analysis as well as acoustic modeling, have recently contributed to improve the state-of-the-art in the field. One of the most effective approaches to derive a robust acoustic modeling is based on using contaminated speech, which proved helpful in reducing the acoustic mismatch between training and testing conditions. In this paper, we revise this classical approach in the context of modern DNN-HMM systems, and propose the adoption of three methods, namely, asymmetric context windowing, close-talk based supervision, and close-talk based pre-training. The experimental results, obtained using both real and simulated data, show a significant advantage in using these three methods, overall providing a 15% error rate reduction compared to the baseline systems. The same trend in performance is confirmed either using a high-quality training set of small size, and a large one.
Article
Full-text available
Speech recognition is largely taking advantage of deep learning, showing that substantial benefits can be obtained by modern Recurrent Neural Networks (RNNs). The most popular RNNs are Long Short-Term Memory (LSTMs), which typically reach state-of-the-art performance in many tasks thanks to their ability to learn long-term dependencies and robustness to vanishing gradients. Nevertheless, LSTMs have a rather complex design with three multiplicative gates, that might impair their efficient implementation. An attempt to simplify LSTMs has recently led to Gated Recurrent Units (GRUs), which are based on just two multiplicative gates. This paper builds on these efforts by further revising GRUs and proposing a simplified architecture potentially more suitable for speech recognition. The contribution of this work is two-fold. First, we suggest to remove the reset gate in the GRU design, resulting in a more efficient single-gate architecture. Second, we propose to replace tanh with ReLU activations in the state update equations. Results show that, in our implementation, the revised architecture reduces the per-epoch training time with more than 30% and consistently improves recognition performance across different tasks, input features, and noisy conditions when compared to a standard GRU.
Conference Paper
Full-text available
Despite the remarkable progress recently made in distant speech recognition, state-of-the-art technology still suffers from a lack of robustness, especially when adverse acoustic conditions characterized by non-stationary noises and reverberation are met. A prominent limitation of current systems lies in the lack of matching and communication between the various technologies involved in the distant speech recognition process. The speech enhancement and speech recognition modules are, for instance, often trained independently. Moreover, the speech enhancement normally helps the speech recognizer, but the output of the latter is not commonly used, in turn, to improve the speech enhancement. To address both concerns, we propose a novel architecture based on a network of deep neural networks, where all the components are jointly trained and better cooperate with each other thanks to a full communication scheme between them. Experiments, conducted using different datasets, tasks and acoustic conditions, revealed that the proposed framework can overtake other competitive solutions, including recent joint training approaches.
Conference Paper
Full-text available
Improving distant speech recognition is a crucial step towards flexible human-machine interfaces. Current technology, however, still exhibits a lack of robustness, especially when adverse acoustic conditions are met. Despite the significant progress made in the last years on both speech enhancement and speech recognition, one potential limitation of state-of-the-art technology lies in composing modules that are not well matched because they are not trained jointly. To address this concern, a promising approach consists in concatenating a speech enhancement and a speech recognition deep neural network and to jointly update their parameters as if they were within a single bigger network. Unfortunately, joint training can be difficult because the output distribution of the speech enhancement system may change substantially during the optimization procedure. The speech recognition module would have to deal with an input distribution that is non-stationary and unnormalized. To mitigate this issue, we propose a joint training approach based on a fully batch-normalized architecture. Experiments, conducted using different datasets, tasks and acoustic conditions, revealed that the proposed framework significantly overtakes other competitive solutions, especially in challenging environments.