Available via license: CC0
Content may be subject to copyright.
arXiv:1907.01030v1 [eess.AS] 1 Jul 2019
LSTM Language Models for LVCSR in First-Pass Decoding and
Lattice-Rescoring
Eugen Beck1,2, Wei Zhou1,2, Ralf Schl¨
uter1, Hermann Ney1,2
1Human Language Technology and Pattern Recognition, Computer Science Department,
RWTH Aachen University, 52074 Aachen, Germany
2AppTek GmbH, 52062 Aachen, Germany
{beck, zhou, schlueter, ney}@cs.rwth-aachen.de
Abstract
LSTM based language models are an important part of mod-
ern LVCSR systems as they significantly improve performance
over traditional backoff language models. Incorporating them
efficiently into decoding has been notoriously difficult. In this
paper we present an approach based on a combination of one-
pass decoding and lattice rescoring. We perform decoding with
the LSTM-LM in the first pass but recombine hypothesis that
share the last two words, afterwards we rescore the resulting
lattice. We run our systems on GPGPU equipped machines and
are able to produce competitive results on the Hub5’00 and Lib-
rispeech evaluation corpora with a runtime better than real-time.
In addition we shortly investigate the possibility to carry out the
full sum over all state-sequences belonging to a given word-
hypothesis during decoding without recombination.
Index Terms: speech recognition, decoding, LSTM language
models, lattice rescoring
1. Introduction
In recent years, language models (LMs) based on long short-
term memory (LSTM) neural networks have become an inte-
gral part of many state-of-the-art automatic speech recogition
systems [1, 2, 3]. LSTMs thus supersede traditional backoff-
models which are based on word counts. For count based
models relative frequencies of word n-grams are computed and
stored. Due to the sparseness of the training data many n-grams
are not seen and probability mass has to be allocated to them by
smoothing out the probability distribution, e.g. by Kneser-Ney
smoothing [4]. During decoding this sparseness can be utilized
to recombine word-end hypotheses. LSTM LMs on the other
hand use an internal state that is updated for each seen word
to produce word posterior probabilities. This, in theory, gives
them unlimited context that has to be considered which leads to
large increases in required computation time.
In this work we show how to deal with this problem from
a decoding point of view. We have chosen to utilize General
Purpose Graphics Processing Units (GPGPUs) as an inference
platform. This is reasonable, as more and more dedicated co-
processors for machine learning workloads become available.
On the server-side, GPGPUs by nVidia/AMD, TPUs by Google
and (soon) AWS Inferentia-chips provide highly parallel com-
putation platforms. Even for low-power devices chips like Edge
TPU (by Google) or Kirin 970 (by Huawei) provide highly par-
allel computation platforms. We show that using an LSTM-LM
in 1-st pass decoding is better than rescoring of lattices gener-
ated with a backoff LM. In addition forcing recombination of
histories that share a trigram context during the 1st pass fol-
lowed by lattice rescoring yields the same WER at lower RTF.
This paper is organized as follows: First we give an
overview of existing work. Afterwards we present some im-
plementation details. This is followed by a description of the
models, corpora and hardware used for our experiments. Then
we present a series of experiments on the Switchboard and Lib-
rispeech corpora.
2. Related Work
Using Neural Network based Language Models (NN-LMs) in
Decoding is computationally more expensive than using back-
off Language Models. In this section we give a short overview
of how other researchers have dealt with this problem.
Early approaches of introducing NN-LMs into decoding in-
clude some form of conversion to a more traditional backoff
LM: A very straightforward approach to convert complex mod-
els is to sample them to create large training corpora on which
back-off LMs can be trained on. This is the approach of [5]. In
[6] the continues states of an RNN-LM are discretized to cre-
ate a weighted finite state transducer. The authors of [7] trained
feed-forward LMs for different orders and extracted the proba-
bilities for the backoff LM directly from the neural network. [8]
compares different techniques for conversion and [9] uses these
techniques to investigate conversion of domain adapted LSTM
LMs.
Another option is to reduce the number of operations re-
quired for NN-LMs. For models with large vocabulary the
lion’s share of the computations occurs in the final layer, where
a large matrix multiplication is required. For models using the
softmax activation function the probability for all words needs
to be computed even if only the probability for one word is re-
quired. Thus there is a large interest in developing techniques
to avoid this. One popular approach is Noise-Contrastive-
Estimation [10, 11]. Noise-Contrastive-Estimation is an adapta-
tion of the loss function used in training to guide the model into
a state where the output before the softmax is already approx-
imately normalized and thus only a dot product is required to
compute the probability of one word. It is used by [12, 13, 14].
In [14] other methods like caching and the choice of activation
functions within hidden layers are investigated aswell.
Yet another way to employ NN-LMs to improve ASR per-
formance is to use them in a second pass for lattice rescoring.
This is more efficient as only word sequences which are not
pruned away need to be scored by the NN-LM. Examples for
this approach are [15, 16, 17, 18, 19, 20] where a variety of
heuristics is proposed to speed up the rescoring process.
Closest to the work presented in this paper there are also
publications where the authors integrated LSTM-LMs into first
pass decoding: In [21] a set of caches was introduced to min-
imize unnecessary computations when evaluating the LSTM-
LM. In [22], an on-the-fly rescoring approach to integrate
LSTM-LMs into 1-st pass decoding is presented. The authors
of [23] use a hybrid CPU/GPGPU architecture for real time de-
coding. The HCL transducer is composed with a small n-gram
model and is expanded on the GPU while rescoring with an
LSTM LM happens on CPU. Caching of previous outputs en-
ables real-time decoding. All three papers use hierarchical soft-
max / word classes to reduce the number of computations in
the output layer [24] and with the exception of [21] interpolate
the LSTM-LM with a Max-Entropy LM [25]. The works of
[23] are extended in [26]. The LSTM Units are replaced with
GRUs, NCE replaces the hierarchical softmax and GRU states
are quantized to reduce the number of necessary computations.
3. Implementation
For this work we extended the decoder of the RWTH ASR
toolkit, described extensively in [27]. The decoder uses tree-
conditioned search, which differs from the more common
HCLG-based decoder in that we do not do static composition
of the grammar WFST with the rest of the search network. In-
stead hypotheses from the HCL part of the decoder are grouped
by their LM-history. Because these histories are opaque ob-
jects to the decoder we do not need to build a static WFST to
represent the LM, which would be infeasible for LSTM-LMs
anyway. Instead we only have to store the sequence of words
and the state of the LSTM layers for each history. The language
model itself can be any tensorflow graph as long as it is compat-
ible with the general idea of a recurrent LM. The LM receives
a state and one word and produces a new state and probabilities
for the next word. One effect of statically combining Grammar
and the HCL automaton is the pushing of the Grammar weights
towards the start state of the transducer. In our decoder this
early LM information is retrieved via Language-Model looka-
head which is dynamically computed at runtime. We found that
it is not necessary to use the LSTM-LM when computing this
lookahead information to achieve the best possible WER. Only
for very small beam-sizes we can observe a difference in WER.
For our rescoring experiments we use push-forward rescoring
as described in [17].
4. Experiments
4.1. Hardware and Measurement Methodology
Each node used for our experiments has two sockets with Intel
Xeon E5-2620 v4 CPUs with a base-clock speed of 2.1Ghz and
4 Nvidia Geforce 1080Ti GPUs. Unless stated otherwise, our
decoder ran in a single thread. The tensorflow runtime spawns
more threads as it sees fit. As we are primarily using the GPU
to do computations we set intra/inter op parallelism threads to
1. To compute the real time factor (RTF) we measure the total
wallclock time required by the recognizer/rescorer to process
all segments within the corpus and divide it by the total dura-
tion. This includes loading features from disk, forwarding them
through the acoustic model and decoding / rescoring. Startup
time is not included. Features are not extracted on the fly as
it creates higher load on our fileserver and is not a major part
during decoding anyway. In a research context preextracting
features for a common task is useful as they are required for
many experiments. In a production streaming system, feature
extraction can be offloaded into a separate thread and will only
contribute to latency, but not (significantly) to RTF.
Corpus Model Topology #out #param
Switchboard AM 6x500 9K 47.6M
LM 2x1024 30K 82.8M
Librispeech AM 6x1000 12K 152.5M
LM 2x2048 200K 486.8M
Table 1: Sizes of acoustic (AM) and langugae models (LM) used
in this paper. The format for the topology column is #layers
times #units
Corpus +LSTM-LM PPL
Hub 5’00 no 79.45
yes 50.94
Librispeech dev no 146.18
yes 70.59
Librispeech test no 151.81
yes 73.96
Table 2: Perplexities of Language Models used. Backoff models
optionally combined with LSTM
4.2. Corpora and Models
In this paper we present results on two tasks. The first task is
the 300h Switchboard-1 Release 2 which is evaluated on the
Hub500 corpus. The second corpus is Librispeech [28].
All our systems use 40 dimensional Gammatone features
[29]. The acoustic model is a multilayer BLSTM neural net-
work trained with the state-level minimum Bayes Risk (sMBR)
criterion [30]. The output units of the acoustic models are tied
triphone states obtained using a Classification and Regression
Tree (CART). When we use an LSTM-LM we interpolate it
with the backoff-LM using log-linear combination. The per-
plexities of the models can be found in Table 2. The sizes of
all models can be found in Table 1. More information about the
Librispeech system can be found in [31].
4.3. Baseline
Our baseline uses a 4-gram count model with an sMBR trained
acoustic model. It saturates at a WER of 13.9% with a RTF of
0.23. A detailed WER/RTF plot can be found in Figure 1.
13.8
14
14.2
14.4
14.6
14.8
0 0.2 0.4 0.6 0.811.2
WER[%]
RTF
Count LM
Figure 1: Backoff-LM baseline
0
0.5
1
1.5
2
2.5
3
3.5
4
0 10 20 30 40 50
forwarding time [ms]
#histories per batch
Figure 2: Time to process one batch for one step with the
Switchboard LM divided by the number of histories in the batch
4.4. Parallelism
As GPGPUs are massively parallel architectures it is important
to provide them with enough opportunities for parallelization
when doing computations. In Figure 2 we show the time it
takes to forward one batch for various batch sizes divided by
the number of histories for the Switchboard LM, i.e. we divide
the total computation time by the number of histories within
one batch. We can clearly see that it is much more efficient to
forward many histories at the same time. Thus, once the LM
receives the request to compute a specific word probability that
is not already computed, we look for other histories that are
not forwarded yet. We prioritize those histories that have a hy-
pothesis close to a word-end state and with a score close to the
currently best hypothesis.
4.5. Effect of recombination
The biggest difference between traditional backoff models and
LSTM Language Models from the decoding point of view is
the fact that backoff LMs allow for recombination of hypothe-
ses at word-ends. For the LSTM-LM this is not possible, as
each sequence of words for the history vector forms a unique
history, that in principle can encode all words hithero seen. But
of course the state of LSTM-Layers is finite and thus the LSTM
will not be able to store an arbitrary amount. Furthermore some
information about the past context might not be relevant for the
search process anyway. Thus, it is reasonable to assume that
even for an LSTM LM we can recombine hypotheses if the last
nwords match. To empirically determine nwe conducted ex-
periments in which we recombined word-end hypothesis where
the last nwords matched. We did not recompute the state of the
LSTM for this reduced context, but kept the context of the word
end with the lowest score (i.e. highest probability). This forced
recombination changes the lattice structure from a tree back to
a directed graph.
In Figure 3 we show the results of our experiments. We
can see that for a recombination limit of 5 we already reach the
best WER (11.7%) when rounding to 1 digit, as is usual for this
task. Larger beam sizes do not yield significant improvements
in terms of best achievable WER, but they allow us to reach it
faster or get better WER for a fixed RTF value. This is because
for n= 5 we need a larger beam to reach the best possible
WER, while larger nallow for smaller beam sizes. For RTF
∼1 a recombination of n≥9should be selected.
11.6
11.8
12
12.2
12.4
12.6
12.8
13
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
WER[%]
RTF
Recombine n=2
Recombine n=3
Recombine n=5
Recombine n=9
Recombine n=13
Recombine n=inf
Figure 3: Different recombination limits for the Switchboard
system
4.6. One-pass vs Two-pass with Rescoring
As a next step, we want to measure the performance of our sys-
tem when the LSTM-LM is applied during lattice rescoring. As
a baseline, we rescore the lattices generated by decoding with
the backoff-LM. As can be seen in Figure 4 this is very fast for
small beam sizes. For larger beam sizes it is still faster than us-
ing the LSTM-LM in the first pass, but it yields slightly worse
results (11.8% vs 11.7%). In theory, if the beam-pruning were
big enough, the lattice would contain the same word sequences
that receive the best scores from the LSTM-LM. However this
does not happen for the usual beam sizes we use during recog-
nition as Figure 4 shows. As lattice rescoring is relatively fast
(0.02-0.12 RTF, depending on lattice size) we also use it in com-
bination with one-pass decoding with LSTM-LMs. In order to
keep the complexity of the first pass low we use a small re-
combination threshold (n= 2). This will produce lattices with
many recombinations and thus many opportunities for lattice
rescoring to find new word-sequences. Our experiments show
that this indeed is more efficient than performing one-pass de-
coding with a high recombination limit.
4.7. Full-sum decoding
The one pass decoding without recombination makes it pos-
sible to apply full sum over all state sequences in the HMM
instead of Viterbi approximation in the Bayes decision rule.
We also explore this effect on the Hub 5’00 data set. We use
the same decoder framework as implementation baseline and
changed the auxiliary function of dynamic programming to be
the probability summation over all incoming paths instead of
maximization. The pruning threshold of the beam search has to
be set a bit higher to keep most of the contributing paths. Paths
of pronunciation variants are normalized and merged as well.
The simple hand-crafted time distortion penalties are normal-
ized back to probability domain. Grid search is used to find the
optimal acoustic and language model scaling. To our surprise,
although mathematically more accurate, the full-sum decoding
leads to the same accuracy w.r.t single best in the lattice, while
the size of search space is generally larger. However, applying
confusion network decoding on the lattice additionally further
improves the full-sum result from 11.7% to 11.4%, while no
11.6
11.8
12
12.2
12.4
12.6
12.8
13
13.2
13.4
0 0.5 1 1.5 2 2.5 3 3.5
WER[%]
RTF
Recombine n=9
Rescore: backoff-LM lattice
Rescore: LSTM+backoff lattice (n=2)
Figure 4: Comparison of first-pass recognition, lattice rescor-
ing of backoff-models and lattice rescoring of LSTM-LM based
lattices
Corpus 1-st pass LM Recmb. Rescr. LM WER RTF
dev-clean
backoff N/A - 3.72 0.56
N/A interpolated 2.55 0.97
interpolated 10 - 2.39 6.95
2 interpolated 2.40 4.06
dev-other backoff N/A
-
8.73 2.25
interpolated none 5.75 13.38
test-clean backoff N/A 4.18 0.56
interpolated none 2.78 7.14
test-other backoff N/A 9.31 2.59
interpolated none 6.23 15.62
Table 3: Results of various decoding strategies with backoff and
LSTM-LM for the Librispeech dataset
difference is obtained for Viterbi decoding. We will further in-
vestigate the effect of full-sum decoding in future work.
4.8. Librispeech
We conducted initial experiments on Librispeech but did not
yet complete a full analysis. Our results can be found in Ta-
ble 3. Due to the significantly larger vocabulary compared with
the Switchboard system (200k vs 30k) our absolute RTF are
much higher when using the LSTM-LM in one-pass mode (if
one wants to get the best possible WER). The best performing
strategy on Switchboard of using an LSTM LM in first pass with
a short recombination limit and rescoring the resulting lattice is
again the best performing strategy at a RTF of one. However the
best WER is only reached at an RTF far above one. This could
be alleviated by training LSTM LMs using Noise Contrastive
Estimation [11]. This is reserved for future work. Another ob-
servation of interest here is that decoding on the other condition
takes much longer than on the clean condition. This is not too
surprising as decoding an utterance where the acoustic model
is less confident will yield more scores that are close together.
Here the difference in WER is a factor of around 2, while the
difference in RTF is a factor of around 4.
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
0 0.5 1 1.5 2 2.533.544.5 5 5.566.5 7
WER[%]
RTF
Recombine n=10
Rescore: backoff-LM lattice
Rescore: LSTM+backoff lattice (n=2)
Figure 5: Comparison of first-pass recognition, lattice rescor-
ing of backoff-models and lattice rescoring of LSTM-LM based
lattices for the Librispeech dev-clean corpus
5. Conclusions
In this paper we have shown how to use LSTM-LMs in decod-
ing using a GPGPU. We have shown that first using the LSTM
LM with a small recombination limit and doing lattice rescor-
ing afterwards yields the most efficient decoding process. This
approach yields a WER of 11.7% on the Hub5’00 task at an
RTF of 1. Further work is required for systems with very large
vocabulary where the best possible WER is only reached for a
RTF well above 1. Further improvements to the WER (11.7%
to 11.4%) were obtained by full-sum decoding with subsequent
confusion-network decoding.
6. Acknowledgments
This project has received funding from the European Research
Council (ERC) under the European Unions Horizon 2020 re-
search and innovation program (grant agreement No 694537,
project ”SEQCLAS”) and from the European Unions Hori-
zon 2020 research and innovation program under the Marie
Skodowska-Curie grant agreement No 644283. The work re-
flects only the authors’ views and the European Research Coun-
cil Executive Agency (ERCEA) is not responsible for any use
that may be made of the information it contains. Eugen Beck
was partially funded by the 2016 Google PhD Fellowship for
North America, Europe and the Middle East. We also want to
thank our colleagues Kazuki Irie, Christoph L¨uscher and Wil-
fried Michel for providing us with the acoustic/language mod-
els.
7. References
[1] W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stolcke,
“The microsoft 2017 conversational speech recognition system,”
in 2018 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), April 2018, pp. 5934–5938.
[2] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dim-
itriadis, X. Cui, B. Ramabhadran, M. Picheny, L. Lim, B. Roomi,
and P. Hall, “English conversational telephone speech recognition
by humans and machines,” in Interspeech 2017, 18th Annual Con-
ference of the International Speech Communication Association,
Stockholm, Sweden, August 20-24, 2017, F. Lacerda, Ed. ISCA,
2017, pp. 132–136.
[3] K. J. Han, A. Chandrashekaran, J. Kim, and I. R. Lane, “The CA-
PIO 2017 conversational speech recognition system,” CoRR, vol.
abs/1801.00059, 2018.
[4] R. Kneser and H. Ney, “Improved backing-off for m-gram lan-
guage modeling,” in 1995 International Conference on Acoustics,
Speech, and Signal Processing, vol. 1, May 1995, pp. 181–184
vol.1.
[5] A. Deoras, T. Mikolov, S. Kombrink, M. Karafit, and S. Khu-
danpur, “Variational approximation of long-span language models
for lvcsr,” in 2011 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), May 2011, pp. 5532–
5535.
[6] G. Lecorv´e and P. Motl´ıcek, “Conversion of recurrent neural net-
work language models to weighted finite state transducers for au-
tomatic speech recognition,” in INTERSPEECH 2012, 13th An-
nual Conference of the International Speech Communication As-
sociation, Portland, Oregon, USA, September 9-13, 2012. ISCA,
2012, pp. 1668–1671.
[7] E. Arsoy, S. F. Chen, B. Ramabhadran, and A. Sethy, “Con-
verting neural network language models into back-off language
models for efficient decoding in automatic speech recognition,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-
cessing, vol. 22, no. 1, pp. 184–192, Jan 2014.
[8] H. Adel, K. Kirchhoff, N. T. Vu, D. Telaar, and T. Schultz,
“Comparing approaches to convert recurrent neural networks
into backoff language models for efficient decoding,” in INTER-
SPEECH 2014, 15th Annual Conference of the International
Speech Communication Association, Singapore, September 14-
18, 2014, H. Li, H. M. Meng, B. Ma, E. Chng, and L. Xie, Eds.
ISCA, 2014, pp. 651–655.
[9] M. Singh, Y. Oualil, and D. Klakow, “Approximated and domain-
adapted lstm language models for first-pass decoding in speech
recognition,” in Proc. Interspeech 2017. ISCA, 2017, pp. 2720–
2724.
[10] M. Gutmann and A. Hyv¨arinen, “Noise-contrastive estimation: A
new estimation principle for unnormalized statistical models,” in
Proceedings of the Thirteenth International Conference on Artifi-
cial Intelligence and Statistics, AISTATS 2010, Chia Laguna Re-
sort, Sardinia, Italy, May 13-15, 2010, ser. JMLR Proceedings,
Y. W. Teh and D. M. Titterington, Eds., vol. 9. JMLR.org, 2010,
pp. 297–304.
[11] M. Gutmann and A. Hyvarinen, “Noise-contrastive estimation of
unnormalized statistical models, with applications to natural im-
age statistics,” Journal of Machine Learning Research, vol. 13,
pp. 307–361, 2012.
[12] X. Chen, X. Liu, M. J. F. Gales, and P. C. Woodland, “Recurrent
neural network language model training with noise contrastive es-
timation for speech recognition,” in 2015 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP),
April 2015, pp. 5411–5415.
[13] A. Sethy, S. Chen, E. Arisoy, and B. Ramabhadran, “Unnormal-
ized exponential and neural network language models,” in 2015
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), April 2015, pp. 5416–5420.
[14] Y. Huang, A. Sethy, and B. Ramabhadran, “Fast neural network
language model lookups at n-gram speeds,” in Proc. Interspeech
2017, 2017, pp. 274–278.
[15] A. Deoras, T. Mikolov, and K. Church, “A fast re-scoring strat-
egy to capture long-distance dependencies,” in Proceedings of the
Conference on Empirical Methods in Natural Language Process-
ing, ser. EMNLP ’11. Stroudsburg, PA, USA: Association for
Computational Linguistics, 2011, pp. 1116–1127.
[16] X. Liu, Y. Wang, X. Chen, M. J. F. Gales, and P. C. Wood-
land, “Efficient lattice rescoring using recurrent neural network
language models,” in 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), May 2014,
pp. 4908–4912.
[17] M. Sundermeyer, Z. T ¨uske, R. Schl ¨uter, and H. Ney, “Lattice
decoding and rescoring with long-span neural network language
models,” in INTERSPEECH 2014, 15th Annual Conference of
the International Speech Communication Association, Singapore,
September 14-18, 2014, H. Li, H. M. Meng, B. Ma, E. Chng, and
L. Xie, Eds. ISCA, 2014, pp. 661–665.
[18] X. Liu, X. Chen, Y. Wang, M. J. F. Gales, and P. C. Woodland,
“Two efficient lattice rescoring methods using recurrent neural
network language models,” IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 24, no. 8, pp. 1438–1449,
Aug 2016.
[19] S. Kumar, M. Nirschl, D. N. Holtmann-Rice, H. Liao, A. T.
Suresh, and F. X. Yu, “Lattice rescoring strategies for long short
term memory language models in speech recognition,” in 2017
IEEE Automatic Speech Recognition and Understanding Work-
shop, ASRU 2017, Okinawa, Japan, December 16-20, 2017.
IEEE, 2017, pp. 165–172.
[20] H. Xu, T. Chen, D. Gao, Y. Wang, K. Li, N. Goel, Y. Carmiel,
D. Povey, and S. Khudanpur, “A pruned rnnlm lattice-rescoring
algorithm for automatic speech recognition,” in 2018 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing
(ICASSP), April 2018, pp. 5929–5933.
[21] Z. Huang, G. Zweig, and B. Dumoulin, “Cache based recurrent
neural network language model inference for first pass speech
recognition,” in 2014 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), May 2014, pp.
6354–6358.
[22] T. Hori, Y. Kubo, and A. Nakamura, “Real-time one-pass de-
coding with recurrent neural network language model for speech
recognition,” in 2014 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), May 2014, pp.
6364–6368.
[23] K. Lee, C. Park, I. Kim, N. Kim, and J. Lee, “Applying GPGPU
to recurrent neural network language model based fast network
search in the real-time LVCSR,” in Proc. Interspeech 2015.
ISCA, 2015, pp. 2102–2106.
[24] F. Morin and Y. Bengio, “Hierarchical probabilistic neural net-
work language model,” in Proceedings of the Tenth International
Workshop on Artificial Intelligence and Statistics, AISTATS 2005,
Bridgetown, Barbados, January 6-8, 2005, R. G. Cowell and
Z. Ghahramani, Eds. Society for Artificial Intelligence and
Statistics, 2005.
[25] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernock ´y,
“Strategies for training large scale neural network language mod-
els,” in 2011 IEEE Workshop on Automatic Speech Recognition &
Understanding, ASRU 2011, Waikoloa, HI, USA, December 11-
15, 2011, D. Nahamoo and M. Picheny, Eds. IEEE, 2011, pp.
196–201.
[26] K. Lee, C. Park, N. Kim, and J. Lee, “Accelerating recurrent
neural network language model based online speech recognition
system,” in 2018 IEEE International Conference on Acoustics,
Speech and Signal Processing, ICASSP 2018. IEEE, 2018, pp.
5904–5908.
[27] D. Nolden, “Progress in decoding for large vocabulary continuous
speech recognition,” Ph.D. dissertation, RWTH Aachen Univer-
sity, Computer Science Department, RWTH Aachen University,
Aachen, Germany, Apr. 2017.
[28] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
rispeech: an asr corpus based on public domain audio books,” in
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2015, pp. 5206–5210.
[29] R. Schl¨uter, I. Bezrukov, H. Wagner, and H. Ney, “Gamma-
tone features and feature combination for large vocabulary speech
recognition,” in IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), vol. 4, April 2007, pp.
IV–649–IV–652.
[30] M. Gibson and T. Hain, “Hypothesis spaces for minimum bayes
risk training in large vocabulary speech recognition,” in INTER-
SPEECH 2006 - ICSLP, Ninth International Conference on Spo-
ken Language Processing, Pittsburgh, PA, USA, September 17-21,
2006. ISCA, 2006.
[31] C. L¨uscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer,
R. Schl¨uter, and H. Ney, “Rwth asr systems for librispeech: Hy-
brid vs attention,” in submitted to Interspeech, 2019.