Conference PaperPDF Available

Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition.

Authors:

Abstract and Figures

In this paper, we carry out two experiments on the TIMIT speech corpus with bidirectional and unidirectional Long Short Term Memory (LSTM) networks. In the first experiment (framewise phoneme classification) we find that bidirectional LSTMoutperforms both unidirectional LSTMand conventional Recurrent Neural Networks (RNNs). In the second (phoneme recognition) we find that a hybrid BLSTM-HMM system improves on an equivalent traditional HMM system, as well as unidirectional LSTM-HMM.
Content may be subject to copyright.
Bidirectional LSTM Networks for Improved Phoneme
Classification and Recognition
Alex Graves1, Santiago Fern´andez1,andJ¨urgen Schmidhuber1,2
1IDSIA, Galleria 2, 6928 Manno-Lugano, Switzerland
{alex, santiago, juergen}@idsia.ch
2TU Munich, Boltzmannstr. 3, 85748 Garching, Munich, Germany
Abstract. In this paper, we carry out two experiments on the TIMIT speech cor-
pus with bidirectional and unidirectional Long Short Term Memory (LSTM) net-
works. In the first experiment (framewise phoneme classification) we find that
bidirectional LSTM outperforms both unidirectional LSTM and conventional Re-
current Neural Networks (RNNs). In the second (phoneme recognition) we find
that a hybrid BLSTM-HMM system improves on an equivalent traditional HMM
system, as well as unidirectional LSTM-HMM.
1 Introduction
Because the human articulatory system blurs together adjacent sounds in order to pro-
duce them rapidly and smoothly (a process known as co-articulation), contextual infor-
mation is important to many tasks in speech processing. For example, when classifying
a frame of speech data, it helps to look at the frames after it as well as those before —
especially if it occurs near the end of a word or segment. In general, recurrent neural
networks (RNNs) are well suited to such tasks, where the range of contextual effects is
not known in advance. However they do have some limitations: firstly, since they pro-
cess inputs in temporal order, their outputstend to be mostly based on previous context;
secondly they have trouble learning time-dependencies more than a few timesteps long
[8]. An elegant solution to the first problem is provided by bidirectional networks [11,1].
In this model, the input is presented forwards and backwards to two separate recurrent
nets, both of which are connected to the same output layer. For the second problem, an
alternative RNN architecture, LSTM, has been shown to be capable of learning long
time-dependencies (see Section 2).
In this paper, we extend our previous work on bidirectional LSTM (BLSTM) [7]
with experiments on both framewise phoneme classification and phoneme recognition.
For phoneme recognitionwe use the hybrid approach, combining Hidden Markov Mod-
els (HMMs) and RNNs in an iterative training procedure (see Section 3). This gives us
an insight into the likely impact of bidirectional training on speech recognition, and also
allows us to compare our results directly with a traditional HMM system.
2LSTM
LSTM [9,6] is an RNN architecture designed to deal with long time-dependencies. It
was motivated by an analysis of error flow in existing RNNs [8], which found that long
W. Duch et al. (Eds.): ICANN 2005, LNCS 3697, pp. 799–804, 2005.
c
Springer-Verlag Berlin Heidelberg 2005
800 A. Graves, S. Fern´andez, and J. Schmidhuber
time lags were inaccessible to existing architectures, because the backpropagated error
either blows up or decays exponentially.
An LSTM hidden layer consists of a set of recurrently connected blocks, known as
memory blocks . These blocks can be thought of a differentiable versionof the memory
chips in a digital computer. Each of them contains one or more recurrently connected
memory cells and three multiplicative units - the input, output and forget gates - that
provide continuous analogues of write, read and reset operations for the cells. More
precisely,the input to the cells is multiplied by the activation of the input gate, the output
to the net is multiplied by the output gate, and the previous cell values are multiplied by
the forget gate. The net can only interact with the cells via the gates.
Some modifications of the original LSTM training algorithm wererequired for bidi-
rectional LSTM. See [7] for full details and pseudocode.
3 Hybrid LSTM-HMM Phoneme Recognition
Hybrid artificial neural net (ANN)/HMM systems are extensively documented in the
literature (see, e.g. [3]). The hybrid approach benefits, on the one hand, from the use of
neural networks as estimators of the acoustic probabilities and, on the other hand, from
access to higher-level linguistic knowledge, in a unified mathematical framework.
The parameters of the HMM are typically estimated by Viterbi training [10], which
also provides new targets (in the form of a new segmentation of the speech signal) to
re-train the network. This process is repeated until convergence. Alternatively, Bourlard
et al. developed an algorithm to increase iteratively the global posterior probability
of word sequences [2]. The REMAP algorithm, which is similar to the Expectation-
Maximization algorithm, estimates local posterior probabilities that are used as targets
to train the network.
In this paper, we implement a hybridLSTM/HMM system based on Viterbi training
compare it to traditional HMMs on the task of phoneme recognition.
4 Experiments
All experiments were carried out on the TIMIT database [5]. TIMIT contain sentences
of prompted English speech, accompanied by full phonetic transcripts. It has a lexicon
of 61 distinct phonemes. The training and test sets contain 4620 and 1680 utterances
respectively. For all experiments we used 5% (184) of the training utterances as a vali-
dation set and trained on the rest.
We preprocessed all the audio data into frames using 12 Mel-Frequency Cepstrum
Coefficients (MFCCs) from 26 filter-bank channels. We also extracted the log-energy
and the first order derivatives of it and the other coefficients, giving a vector of 26
coefficients per frame in total.
4.1 Experiment 1: Framewise Phoneme Classification
Our first experimental task was the classification of frames of speech data into
phonemes. The targets were the hand labelled transcriptions provided with the data,
Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition 801
Reverse Net Only
Forward Net Only
sil sil f ay vsil w ah n ow
Bidirectional Output
Target
one oh five
sil
Fig. 1. A bidirectional LSTM net classifying the utterance ”one oh five” from the Numbers95 cor-
pus. The different lines represent the activations (or targets) of different output nodes. The bidi-
rectional output combines the predictions of the forward and reverse subnets; it closely matches
the target, indicating accurate classification. To see how the subnets work together, their contri-
butions to the output are plotted separately (“Forward Net Only” and “Reverse Net Only”). As
we would expect, the forward net is more accurate. However there are places where its substitu-
tions (‘w’), insertions (at the start of ‘ow’) and deletions (‘f’) are corrected by the reverse net. In
addition, both are needed to accurately locate phoneme boundaries, with the reverse net tending
to find the starts and the forward net tending to find the ends (‘ay’ is a good example of this).
and the recorded scores were the percentage of frames in the training and test sets for
which the output classification coincided with the target.
We evaluated the following architectures on this task: bidirectional LSTM
(BLSTM), unidirectional LSTM (LSTM), bidirectional standard RNN (BRNN), and
unidirectional RNN (RNN). For some of the unidirectional nets a delay of 4 timesteps
was introduced between the target and the current input — i.e. the net always tried to
predict the phoneme of 4 timesteps ago. For BLSTM we also experimented with dura-
tion weighted error, where the error injected on each frame is scaled by the duration of
the current phoneme.
We used standard RNN topologies for all experiments, with one recurrently con-
nected hidden layer and no direct connections between the input and output layers.
The LSTM (BLSTM) hidden layers contained 140 (93) blocks of one cell in each, and
the RNN (BRNN) hidden layers contained 275 (185) units. This gave approximately
100,000 weights for each network.
802 A. Graves, S. Fern´andez, and J. Schmidhuber
All LSTM blocks had the following activation functions: logistic sigmoids in the
range [2,2] for the input and output squashing functions of the cell , and in the range
[0,1] for the gates. The non-LSTM net had logistic sigmoid activations in the range
[0,1] in the hidden layer.
All nets were trained with gradient descent (error gradient calculated with Back-
propagation Through Time), using a learning rate of 105and a momentum of 0.9.At
the end of each utterance, weight updates were carried out and network activations were
reset to 0.
As is standard for 1 of K classification, the output layers had softmax activations,
and the cross entropy objective function was used for training. There were 61 output
nodes, one for each phonemes At each frame, the output activations were interpreted
as the posterior probabilities of the respective phonemes, given the input signal. The
phoneme with highest probability was recorded as the network’s classification for that
frame.
4.2 Experiment 2: Phoneme Recognition
A traditional HMM was developed with the HTK Speech Recognition Toolkit (http://
htk.eng.cam.ac.uk/). Both context independent (mono-phone) and context dependent
(tri-phone) models were trained and tested. Both were left-to-right models with three
states. Models representing silence (h#, pau, epi) included two extra transitions: from
the first to the final state and vice versa, in order to make them more robust. Observation
probabilities were modelled by eight Gaussian mixtures.
Sixty-one context-independent models and 5491 tied context-dependent models
were used. Context-dependent models for which the left/right context coincide with
the central phone were included since they appear in the TIMIT transcription (e.g. “my
eyes” is transcribed as /m ay ay z/). During recognition, only sequences of context-
dependent models with matching context were allowed.
In order to make a fair comparison of the acoustic modelling capabilities of the
traditional and hybrid LSTM/HMM, no linguistic information or probabilities of partial
phone sequences were included in the system.
For the hybrid LSTM/HMM system, the following networks (trained in the previ-
ous experiment) were used: LSTM with no frame delay, BLSTM and BLSTM trained
with weighted error. 61 models of one state each with a self-transition and an exit tran-
sition probability were trained using Viterbi-based forced-alignment. Initial estimation
of transition and prior probabilities was done using the correct transcription for the
training set. Network output probabilities were divided by prior probabilities to obtain
likelihoods for the HMM. The system was trained until no improvement was observed
or the segmentation of the signal did not change. Due to time limitations, the networks
were not re-trained to convergence.
Since the output of both HMM-based systems is a string of phones, a dynamic
programming-based string alignment procedure (HTK’s HResults tool) was used to
compare the output of the system with the correct transcription of the utterance. The
accuracy of the system is measured not only by the number of hits, but also takes into
account the number of insertions in the output string (accuracy = ((Hits - Insertions) /
Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition 803
Total number of labels) x100%). For both the traditional and hybridsystem, an insertion
penalty was estimated and applied during recognition.
5Results
From Table 1, we can see that bidirectional nets outperformed unidirectional ones in
framewise classification. From Table 2 we can also see that for BLSTM this advantage
carried over into phoneme recognition.
Overall, the hybrid systems outperformed the equivalent HMM systems on
phoneme recognition. Also, for the context dependent HMM, they did so with far fewer
trainable parameters.
The LSTM nets were 8 to 10 times faster to train than the standard RNNs, as well
as slightly more accurate. They were also considerably more prone to overfitting, as
can be seen from the greater difference between their training and test set scores in
Table 1. The highest classification score we recorded on the TIMIT training set with a
bidirectional LSTM net was 86.4% — almost 17% better than we managed on the test
set. This degree of overfitting is remarkable given the high proportion of training frames
to weights (20 to 1, for unidirectional LSTM). Clearly, better generalisation would be
desirable.
Using duration weighted error slightly decreased the classification performance of
BLSTM, but increased its recognition accuracy. This is what we would expect, since its
effect is to make short phones as significant to training as longer ones [4].
Table 1. Framewise Phoneme Classification
Network Training Set Test Set Epochs
BLSTM 77.4% 69.8% 21
BRNN 76.0% 69.0% 170
BLSTM Weighted Error 75.7% 68.9% 15
LSTM (4 frame delay) 77.5% 65.5% 33
RNN (4 frame delay) 70.8% 65.1% 144
LSTM (0 frame delay) 70.9% 64.6% 15
RNN (0 frame delay) 69.9% 64.5% 120
Table 2. Phoneme Recognition Accuracy for Traditional HMM and Hybrid LSTM/HMM
System Number of parameters Accuracy
Context-independent HMM 80 K 53.7 %
Context-dependent HMM >600 K 64.4 %
LSTM/HMM 100 K 60.4 %
BLSTM/HMM 100 K 65.7 %
Weighted error BLSTM/HMM 100 K 66.9 %
804 A. Graves, S. Fern´andez, and J. Schmidhuber
6Conclusion
In this paper, we found that bidirectional recurrent neural nets outperformed unidirec-
tional ones in framewise phoneme classification. We also found that LSTM networks
were faster and more accurate than conventional RNNs at the same task. Furthermore,
we observed that the advantage of bidirectional training carried over into phoneme
recognition with hybrid HMM/LSTM systems. With these systems, we recorded bet-
ter phoneme accuracy than with equivalent traditional HMMs, and did so with fewer
parameters. Lastly we improved the phoneme recognition score of BLSTM by using a
duration weighted error function.
Acknowledgments
The authors would like to thank Nicole Beringer for her expertadvice on linguistics and
speech recognition. This work was supported by SNF, grant number 200020-100249.
References
1. P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri. Exploiting the past and the future
in protein secondary structure prediction. BIOINF: Bioinformatics, 15, 1999.
2. H. Bourlard, Y. Konig, and N. Morgan. REMAP: Recursive estimation and maximization of a
posteriori probabilities in connectionist speech recognition. In Proceedings of Europeech’95,
Madrid, 1995.
3. H.A. Bourlard and N. Morgan. Connnectionist Speech Recognition: A Hybrid Approach.
Kluwer Academic Publishers, 1994.
4. R. Chen and L. Jamieson. Experiments on the implementation of recurrent neural networks
for speech phone recognition. In Proceedings of the Thirtieth Annual Asilomar Conference
on Signals, Systems and Computers, pages 779–782, 1996.
5. J.S.Garofolo,L.F.Lamel,W.M.Fisher,J.G.Fiscus,D.S.Pallett,,andN.L.Dahlgren.
Darpa timit acoustic phonetic continuous speech corpus cdrom, 1993.
6. F. Gers, N. Schraudolph, and J. Schmidhuber. Learning precise timing with LSTM recurrent
networks. Journal of Machine Learning Research, 3:115–143, 2002.
7. A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm
and other neural network architectures. Neural Networks, August 2005. In press.
8. S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets:
the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, editors,
A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.
9. S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation,
9(8):1735–1780, 1997.
10. A. J. Robinson. An application of recurrent nets to phone probability estimation. IEEE
Transactions on Neural Networks, 5(2):298–305, March 1994.
11. M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions
on Signal Processing, 45:2673–2681, November 1997.
... Consequently, natural language processing (NLP) methods have been developed to fully or partially automate toxicity detection (Schmidt & Wiegand, 2017). Prior work has achieved high Accuracy and F1 scores on toxicity detection (e.g., (Zampieri et al., 2020)) across various model architectures: e.g., convolutional (CNN) (Gambäck & Sikdar, 2017), sequential (BiLSTM) (Graves et al., 2005), and transformer (BERT) (Devlin et al., 2018). However, studies have also found that model accuracy can vary greatly across sensitive demographic attributes, such as race or gender (Das et al., 2021;Park et al., 2018;Sap et al., 2019). ...
... To assess the generality of our methods across distinct neural architectures, we evaluate over three types of models: CNN (Gambäck & Sikdar, 2017), BiLSTM (Graves et al., 2005) and BERT (Devlin et al., 2018). For full experimental setup, please refer to Appendix C. For all three models, we freeze the feature representation layers and optimize the weights of the classification layer. ...
Article
Full-text available
Introduction. Optimizing NLP models for fairness poses many challenges. Lack of differentiable fairness measures prevents gradient-based loss training or requires surrogate losses that diverge from the true metric of interest. In addition, competing objectives (e.g., accuracy vs. fairness) often require making trade-offs based on stakeholder preferences, but stakeholders may not know their preferences before seeing system performance under different trade-off settings. Method. We formulate the GAP loss, a differentiable version of a fairness measure, Accuracy Parity, to provide balanced accuracy across binary demographic groups. Analysis. We show how model-agnostic, HyperNetwork optimization can efficiently train arbitrary NLP model architectures to learn Pareto-optimal trade-offs between competing metrics like predictive performance vs. group fairness. Results. Focusing on the task of toxic language detection, we show the generality and efficacy of our proposed GAP loss function across two datasets, three neural architectures, and three fairness loss functions. Conclusions. Our GAP loss for the task of TL detection demonstrates promising results - improved fairness and computational efficiency. Our work can be extended to other tasks, datasets, and neural models in any practical situation where ensuring equal accuracy across different demographic groups is a desired objective.
... Several key techniques have emerged that harness the strength of deep learning models. Recurrent neural networks (RNNs), particularly Gated Recurrent Units (GRUs) and long-short-term memory (LSTM), and several of LSTM's variants, like bidirectional LSTM (Bi-LSTM) [5], attention-based LSTM (AttLSTM) [6], stacked LSTM [7], convolutional LSTM (ConvLSTM) [8], LSTM with temporal convolutional networks (TCNs) [9], and peephole LSTM [10], have gained prominence in trajectory imputation tasks. ...
Article
Full-text available
Pedestrian trajectories are crucial for self-driving cars to plan their paths effectively. The sensors implanted in these self-driving vehicles, despite being state-of-the-art ones, often face inaccuracies in the perception of surrounding environments due to technical challenges in adverse weather conditions, interference from other vehicles’ sensors and electronic devices, and signal reception failure, leading to incompleteness in the trajectory data. But for real-time decision making for autonomous driving, trajectory imputation is no less crucial. Previous attempts to address this issue, such as statistical inference and machine learning approaches, have shown promise. Yet, the landscape of deep learning is rapidly evolving, with new and more robust models emerging. In this research, we have proposed an encoder–decoder architecture, the Human Trajectory Imputation Model, coined HTIM, to tackle these challenges. This architecture aims to fill in the missing parts of pedestrian trajectories. The model is evaluated using the Intersection drone the inD dataset, containing trajectory data at suitable altitudes, preserving naturalistic pedestrian behavior with varied dataset sizes. To assess the effectiveness of our model, we utilize L1, MSE, and quantile and ADE loss. Our experiments demonstrate that HTIM outperforms the majority of the state-of-the-art methods in this field, thus indicating its superior performance in imputing pedestrian trajectories.
... The two types of recurrent neural networks shown in Figures 6 and 7 have the same number of units and layers, but their different connection methods often lead to different effects [20]. The output result of any intermediate time node in the bidirectional recurrent neural networks shown in Figure 6 depends not only on the inputs before that time node but also on the inputs after that time node [21]. The stacked recurrent neural networks depicted in Figure 7 seeks to enhance model accuracy through increased network depth; however, this simultaneously elevates computational demands and the likelihood of encountering local optima [22]. ...
Article
Full-text available
Spectral resources are becoming increasingly scarce due to the rapid growth of communication data volumes, and the modulation methods of signals will also become more diverse. There is an urgent need for more effective modulation recognition methods. Firstly, based on a large amount of high-level information, this paper introduces shallow machine learning and some of its representative methods. Secondly, in response to the low accuracy and high computational complexity of most shallow machine learning models, this paper introduces the application of deep learning in the field of automatic modulation recognition. Finally, possible solutions to the existing problems and challenges in modulation recognition technology are proposed, and future prospects are discussed. Automatic Modulation Recognition(AMR) technology is relatively well developed in closed-set research, but it still needs to be strengthened in open-set research. In addition, research on unlabeled samples is also lacking, and there should be an increased focus on research for powerful hardware platforms capable of modulation recognition.
... , x L }, where L denotes the number of tokens in the instruction sequence. Each token x i is embedded and encoded using a bidirectional LSTM [43]: ...
Article
Full-text available
Traditional Vision-and-Language Navigation (VLN) tasks require an agent to navigate static environments using natural language instructions. However, real-world road conditions such as vehicle movements, traffic signal fluctuations, pedestrian activity, and weather variations are dynamic and continually changing. These factors significantly impact an agent’s decision-making ability, underscoring the limitations of current VLN models, which do not accurately reflect the complexities of real-world navigation. To bridge this gap, we propose a novel task called Dynamic Vision-and-Language Navigation (DynamicVLN), incorporating various dynamic scenarios to enhance the agent’s decision-making abilities and adaptability. By redefining the VLN task, we emphasize that a robust and generalizable agent should not rely solely on predefined instructions but must also demonstrate reasoning skills and adaptability to unforeseen events. Specifically, we have designed ten scenarios that simulate the challenges of dynamic navigation and developed a dedicated dataset of 11,261 instances using the CARLA simulator (ver.0.9.13) and large language model to provide realistic training conditions. Additionally, we introduce a baseline model that integrates advanced perception and decision-making modules, enabling effective navigation and interpretation of the complexities of dynamic road conditions. This model showcases the ability to follow natural language instructions while dynamically adapting to environmental cues. Our approach establishes a benchmark for developing agents capable of functioning in real-world, dynamic environments and extending beyond the limitations of static VLN tasks to more practical and versatile applications.
... Although LSTMs have superior memory capabilities compared to traditional neural networks, they do not consider the future context. To address this limitation, Graves and others developed a new algorithm called BiLSTM, which combines backward feature computation with LSTM [57]. This bidirectional approach allows the algorithm to utilize future information and capture aspects that earlier models might overlook, significantly enhancing both the accuracy and robustness of predictions. ...
Article
Full-text available
The development of surrogate models based on limited data is crucial in enhancing the speed of structural analysis and design optimization. Surrogate models are highly effective in alleviating the challenges between design variables and performance evaluation. Bidirectional Long Short-Term Memory (BiLSTM) is an advanced recurrent neural network that exhibits significant advantages in processing sequential data. However, the training of BiLSTM involves the adjustment of multiple hyperparameters (such as the number of layers, the number of hidden units, and the learning rate), which complicates the training process of the model. To enhance the efficiency and accuracy of neural network model development, this study proposes an Improved Whale Optimization Algorithm-assisted BiLSTM establishment strategy (IWOA-BiLSTM). The new algorithm enhances the initial population design and population position update process of the original Whale Optimization Algorithm (WOA), thereby improving both the global search capability and local exploitation ability of the algorithm. The IWOA is employed during the training process of BiLSTM to search for optimal hyperparameters, which reduces model training time and enhances the robustness and accuracy of the model. Finally, the effectiveness of the model is tested through a parameter optimization problem of a specific analog circuit. Experimental results indicate that, compared to traditional neural network models, IWOA-BiLSTM demonstrates higher accuracy and effectiveness in the optimal parameter design of analog circuit engineering problems.
... LSTM is particularly effective when the time lag between input events and target signals is long and is computationally more efficient than RNNs. Other networks, such as bidirectional LSTM, can also perform simultaneous forward and backward inferencing (i.e., learning from opposite directions; (Schuster and Paliwal 1997)) and thus have higher prediction accuracy than single directional RNNs do (Graves, Fernández, and Schmidhuber 2005). On the other hand, LSTM outperforms RNNs in terms of error rates (Shewalkar, Nyavanandi, and Ludwig 2019). ...
Article
Full-text available
Background Study Assessing learners' inquiry‐based skills is challenging as social, political, and technological dimensions must be considered. The advanced development of artificial intelligence (AI) makes it possible to address these challenges and shape the next generation of science education. Objectives The present study evaluated the SSI inquiry skills of students in an AI‐enabled scoring environment. An AI model for socioscientific issues that can assess students' inquiry skills was developed. Responses to a learning module were collected from 1250 participants, and the open‐ended responses were rated by humans in accordance with a designed rubric. The collected data were then preprocessed and used to train an AI rater that can process natural language. The effects of two hyperparameters, the dropout rate and complexity of the AI neural network, were evaluated. Results and Conclusion The results suggested neither of the two hyperparameters was found to strongly affect the accuracy of the AI rater. In general, the human and AI raters exhibited certain levels of agreement; however, agreement varied among rubric categories. Discrepancies were identified and are discussed both quantitatively and qualitatively.
Article
Full-text available
Abstract The temporal distance between events conveys information essential for numerous sequential tasks such as motor control and rhythm detection. While Hidden Markov Models tend to ignore this information, recurrent neural networks (RNNs) can in principle learn to make use of it. We focus on Long Short-Term Memory (LSTM) because it has been shown to outperform other RNNs on tasks involving long time lags. We find that LSTM augmented by “peephole connections” from its internal cells to its multiplicative gates can learn the fine distinction between sequences of spikes spaced either 50 or 49 time steps apart without the help of any short training exemplars. Without external resets or teacher forcing, our LSTM variant also learns to generate stable streams of precisely timed spikes and other highly nonlinear periodic patterns. This makes LSTM a promising approach for tasks that require the accurate measurement,or generation of time intervals. Keywords: Recurrent Neural Networks, Long Short-Term Memory, Timing.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Article
Full-text available
Predicting the secondary structure of a protein (alpha-helix, beta-sheet, coil) is an important step towards elucidating its three-dimensional structure, as well as its function. Presently, the best predictors are based on machine learning approaches, in particular neural network architectures with a fixed, and relatively short, input window of amino acids, centered at the prediction site. Although a fixed small window avoids overfitting problems, it does not permit capturing variable long-rang information. We introduce a family of novel architectures which can learn to make predictions based on variable ranges of dependencies. These architectures extend recurrent neural networks, introducing non-causal bidirectional dynamics to capture both upstream and downstream information. The prediction algorithm is completed by the use of mixtures of estimators that leverage evolutionary information, expressed in terms of multiple alignments, both at the input and output levels. While our system currently achieves an overall performance close to 76% correct prediction--at least comparable to the best existing systems--the main emphasis here is on the development of new algorithmic ideas. The executable program for predicting protein secondary structure is available from the authors free of charge. pfbaldi@ics.uci.edu, gpollast@ics.uci.edu, brunak@cbs.dtu.dk, paolo@dsi.unifi.it.
Conference Paper
This paper reports on an extensive set of experiments that explore training methods and criteria for recurrent neural networks (RNNs) used for speech phone recognition. Seven different criterion functions are evaluated for speech recognition. A new criterion function that allows direct minimization of the frame error rate is proposed. Two new optimization methods for RNN weight updating are investigated. Experiments have been carried out on the Intel Paragon parallel processing system. The performance of the resulting phone recognition system is competitive with the best results in the literature.
Article
In this paper, we present bidirectional Long Short Term Memory (LSTM) networks, and a modified, full gradient version of the LSTM learning algorithm. We evaluate Bidirectional LSTM (BLSTM) and several other network architectures on the benchmark task of framewise phoneme classification, using the TIMIT database. Our main findings are that bidirectional networks outperform unidirectional ones, and Long Short Term Memory (LSTM) is much faster and also more accurate than both standard Recurrent Neural Nets (RNNs) and time-windowed Multilayer Perceptrons (MLPs). Our results support the view that contextual information is crucial to speech processing, and suggest that BLSTM is an effective architecture with which to exploit it.
Article
This paper presents an application of recurrent networks for phone probability estimation in large vocabulary speech recognition. The need for efficient exploitation of context information is discussed; a role for which the recurrent net appears suitable. An overview of early developments of recurrent nets for phone recognition is given along with the more recent improvements that include their integration with Markov models. Recognition results are presented for the DARPA TIMIT and Resource Management tasks, and it is concluded that recurrent nets are competitive with traditional means for performing phone probability estimation.