ArticlePDF Available

Learning to Forget: Continual Prediction with LSTM

The MIT Press
Neural Computation
Authors:

Abstract and Figures

Long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997) can solve numerous tasks not solvable by previous learning algorithms for recurrent neural networks (RNNs). We identify a weakness of LSTM networks processing continual input streams that are not a priori segmented into subsequences with explicitly marked ends at which the network's internal state could be reset. Without resets, the state may grow indefinitely and eventually cause the network to break down. Our remedy is a novel, adaptive “forget gate” that enables an LSTM cell to learn to reset itself at appropriate times, thus releasing internal resources. We review illustrative benchmark problems on which standard LSTM outperforms other RNN algorithms. All algorithms (including LSTM) fail to solve continual versions of these problems. LSTM with forget gates, however, easily solves them, and in an elegant way.
Content may be subject to copyright.
cc
in
s = s + g y
wg yin
input gating
yc
w
hyout yout
output gating
ouput gate
out
net
out
sc
h( )
output squashing
wc
c
net
net
g( )
c
input squashing
1.0
CEC: memorizing yin
input gate
in
net
in
yc
w
hyout yout
output gating
ouput gate
out
net
out
sc
h( )
output squashing
wg yin
wc
c
net
netg( )
c
input squashing
w
in
y
input gate
in
net
in
input gating
memorizing and forgetting
c
cin
s = s y + g y yϕ
forget gate
ϕ
net
ϕ
ϕ
P
S
T
X
V
V
T
P
S
EB
X
Grammar
Reber
Grammar
Reber
T
P
EB
recurrent connection for continuous prediction
T
P
0
5
10
15
20
25
10 20 30 40 50 60
ERG String Length
Number ERG Strings in %
logarihmic scale
exponential fit
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 10 20 30 40 50 60 70 80 90 100
Probability
Max. ERG String Length
0
10
20
30
40
50
60
0200000 400000 600000 800000 1e+06
Expected Max. ERG String Length
Number Samples
0
10
20
30
40
50
60
110 100 1000 10000 100000 1e+06
Expected Max. ERG String Length
Number Samples
Out
2Out Out
4Out
5Out
6Out
7
Out
1 3
2 4567
In
1 3
In In In In In In
Memory
1
Cell
1
Block
Output
Input
Hidden
Out Gate 1
Forget Gate 1
In Gate 1
Memory
1
Cell
2
Block Out Gate 1
Forget Gate 1
In Gate 1
Memory
2
Cell
2
Block
Memory
2
Cell
1
Block
-50
0
50
100
0 T T T T T P P T T T T P 130
Internal Cell State
Symbol
-9- -10- -14- -10- -10- -9- -10- -10- -12- -10- -9- -9-
-10
0
10
20
680 T P P T T P P T T T T T 850
Internal Cell State
-12- -20- -11- -15- -11- -10- -15- -14- -9- -19- -10- -9- -9-
3.Block, 1.Cell
3.Block, 2.Cell
0
0.2
0.4
0.6
0.8
1
680 T P P T T P P T T T T T 850
Forget Gate Activation
Pattern
-10
0
10
680 T P P T T P P T T T T T 850
Internal State
-12- -20- -11- -15- -11- -10- -15- -14- -9- -19- -10- -9- -9-
1.Block, 1.Cell
0
0.2
0.4
0.6
0.8
1
680 T P P T T P P T T T T T 850
Forget Gate Activation
Pattern
1
10
100
1000
10000
100000
0 5000 10000
Stream Length
Stream Presentations
1
10
100
1000
10000
100000
0 5000 10000
Stream Length
Stream Presentations
... The architecture of an LSTM network includes three main gates: the input gate, the forget gate, and the output gate. These gates regulate the flow of information through the network, enabling it to selectively remember or forget information, which is critical in modeling the dependencies in sequential data like energy prices and their influencing factors (e.g., weather conditions, market dynamics) (Gers et al., 2000). ...
Article
Full-text available
This study proposes a hybrid approach that integrates econometric and deep learning models—specifically, Vector Autoregression (VAR), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU)—to enhance electricity price forecasting. By combining historical data with external factors like weather and market indicators, this hybrid approach aims to improve prediction accuracy in volatile energy markets. The model captures complex temporal dependencies through a hybrid VAR, LSTM, and GRU structure and is tested on historical electricity price data supplemented with weather and market variables. Performance is evaluated using mean absolute error (MAE), root mean square error (RMSE), symmetric mean absolute percentage error (SMAPE), and root mean squared logarithmic error (RMSLE). Results show that deep learning models, particularly GRU, outperform VAR regarding MAE, RMSE, and RMSLE, suggesting superior predictive accuracy for absolute and relative forecasting tasks. However, SMAPE results highlight that the VAR model performs better in capturing proportional errors, suggesting its relative robustness in volatile price environments. Including weather and market data significantly improves the model’s robustness and accuracy. This study’s hybrid approach combines the interpretability of econometric models with the predictive power of deep learning, offering insights into the impact of external factors on energy prices. The model supports better decision-making and risk management for energy market participants in dynamic market environments.
... To solve these problems, an LSTM model with Feed-Forward Error Correction (FFEC) was developed. The LSTM architecture, originally introduced by [26] and subsequently enhanced by [27], is a specialized type of RNN which introduced gating mechanisms that control the flow of information within the neuron and a second transferable state in addition to the hidden state, called the cell state c t , as shown in Fig. 5 ...
Preprint
Full-text available
This study investigates the performance of machine learning models in forecasting electricity Day-Ahead Market (DAM) prices using short historical training windows, with a focus on detecting seasonal trends and price spikes. We evaluate four models, namely LSTM with Feed Forward Error Correction (FFEC), XGBoost, LightGBM, and CatBoost, across three European energy markets (Greece, Belgium, Ireland) using feature sets derived from ENTSO-E forecast data. Training window lengths range from 7 to 90 days, allowing assessment of model adaptability under constrained data availability. Results indicate that LightGBM consistently achieves the highest forecasting accuracy and robustness, particularly with 45 and 60 day training windows, which balance temporal relevance and learning depth. Furthermore, LightGBM demonstrates superior detection of seasonal effects and peak price events compared to LSTM and other boosting models. These findings suggest that short-window training approaches, combined with boosting methods, can effectively support DAM forecasting in volatile, data-scarce environments.
Article
The particle identification (PID) of hadrons plays a crucial role in particle physics experiments, especially in flavor physics and jet tagging. The cluster counting method, which measures the number of primary ionizations in gaseous detectors, is a promising breakthrough in PID. However, developing an effective reconstruction algorithm for cluster counting remains challenging. To address this challenge, we propose a cluster counting algorithm based on long short-term memory and dynamic graph convolutional neural networks for the CEPC drift chamber. Experiments on Monte Carlo simulated samples demonstrate that our machine learning-based algorithm surpasses traditional methods. It improves the K/πK/\pi separation of PID by 10%, meeting the PID requirements of CEPC.
Article
Full-text available
Heavy vehicle rollover plays a pivotal role in road safety scenarios. Numerous researchers addressed the topic, with particular focus on drivers related injuries. Considering the same and other connected implications, the necessity for techniques able to estimate and predict overturning eventualities appears evident. Different methodologies were explored, with notable achievements obtained by neural network-based algorithms. At the same time, their heavy requirements in terms of data needs to be addressed to allow practical applications in terms of time and costs. Consequently, exploring the interaction between simulation and experimental data becomes extremely important, motivating the methodology proposed by this paper. In details, an heavy vehicle model was designed in IPG Carmaker, while experimental data on its physical alter ego were acquired. This led to the generation of a synthetic dataset and the collection of an empirical one. Both were used to define a Long Short-Term Memory architecture, with a dual purpose. First, as typical rollover indicator, estimate the vehicle roll angle. Second, compare the performance of the neural networks, aiming to obtain at least the same order of magnitude in terms of RMSE, MSE and MAE. The goal was to demonstrate that synthetic data can not only be used in combination with real data, but also as substitutes able to address time and cost constraints inevitably linked to the latter, allowing more efficient experiments for overtipping prevention.
Article
In the domain of financial markets, deep learning techniques have emerged as a significant tool for the development of investment strategies. The present study investigates the potential of time series forecasting (TSF) in financial application scenarios, aiming to predict future spreads and inform investment decisions more effectively. However, the inherent nonlinearity and high volatility of financial time series pose significant challenges for accurate forecasting. To address these issues, this paper proposes the IGWO-MALSTM model, a hybrid framework that integrates Improved Grey Wolf Optimization (IGWO) for hyperparameter tuning and a multi-head attention (MA) mechanism to enhance long-term sequence modeling within the long short-term memory (LSTM) architecture. The IGWO algorithm improves population diversity during initialization using the Mersenne Twister, thereby enhancing the convergence speed and search capability of the optimizer. Simultaneously, the MA mechanism mitigates gradient vanishing and explosion problems, enabling the model to better capture long-range dependencies in financial sequences. Experimental results on real futures market data demonstrate that the proposed model reduces Mean Square Error (MSE) by up to 61.45% and Mean Absolute Error (MAE) by 44.53%, and increases the R2 score by 0.83% compared to existing benchmark models. These findings confirm that IGWO-MALSTM offers improved predictive accuracy and stability for financial time series forecasting tasks.
Article
Question answering (QA) systems are a leading and rapidly advancing field of natural language processing (NLP) research. One of their key advantages is that they enable more natural interactions between humans and machines, such as in virtual assistants or search engines. Over the past few decades, many QA systems have been developed to handle diverse QA tasks. However, the evaluation of these systems is intricate, as many of the available evaluation scores are not task-agnostic. Furthermore, translating human judgment into measurable metrics continues to be an open issue. These complexities add challenges to their assessment. This survey provides a systematic overview of evaluation scores and introduces a taxonomy with two main branches: Human-Centric Evaluation Scores (HCES) and Automatic Evaluation Scores (AES). Since many of these scores were originally designed for specific tasks but have been applied more generally, we also cover the basics of QA frameworks and core paradigms to provide a deeper understanding of their capabilities and limitations. Lastly, we discuss benchmark datasets that are critical for conducting systematic evaluations across various QA tasks.
Article
Full-text available
We explore a network architecture introduced by Elman (1988) for predicting successive elements of a sequence. The network uses the pattern of activation over a set of hidden units from time-step t−1, together with element t, to predict element t + 1. When the network is trained with strings from a particular finite-state grammar, it can learn to be a perfect finite-state recognizer for the grammar. When the network has a minimal number of hidden units, patterns on the hidden units come to correspond to the nodes of the grammar, although this correspondence is not necessary for the network to act as a perfect finite-state recognizer. We explore the conditions under which the network can carry information about distant sequential contingencies across intervening elements. Such information is maintained with relative ease if it is relevant at each intermediate step; it tends to be lost when intervening elements do not depend on it. At first glance this may suggest that such networks are not relevant to natural language, in which dependencies may span indefinite distances. However, embeddings in natural language are not completely independent of earlier information. The final simulation shows that long distance sequential contingencies can be encoded by the network even if only subtle statistical properties of embedded strings depend on the early information.
Article
Full-text available
Several strategies are described that overcome limitations of basic network models as steps towards the design of large connectionist speech recognition systems. The two major areas of concern are the problem of time and the problem of scaling. Speech signals continuously vary over time and encode and transmit enormous amounts of human knowledge. To decode these signals, neural networks must be able to use appropriate representations of time and it must be possible to extend these nets to almost arbitrary sizes and complexity within finite resources. The problem of time is addressed by the development of a Time-Delay Neural Network; the problem of scaling by Modularity and Incremental Design of large nets based on smaller subcomponent nets. It is shown that small networks trained to perform limited tasks develop time invariant, hidden abstractions that can subsequently be exploited to train larger, more complex nets efficiently. Using these techniques, phoneme recognition networks of increasing complexity can be constructed that all achieve superior recognition performance.
Article
This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional prediction-learning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions. Although such temporal-difference methods have been used in Samuel's checker player, Holland's bucket brigade, and the author's Adaptive Heuristic Critic, they have remained poorly understood. Here we prove their convergence and optimality for special cases and relate them to supervised-learning methods. For most real-world prediction problems, temporal-difference methods require less memory and less peak computation than conventional methods and they produce more accurate predictions. We argue that most problems to which supervised learning is currently applied are really prediction problems of the sort to which temporal-difference methods can be applied to advantage.
Article
Time is at the heart of many pattern recognition tasks, e.g., speech recognition. However, connectionist learning algorithms to date are not well suited for dealing with time-varying input patterns. This paper introduces a specialized connectionist architecture and corresponding specialization of the backpropagation learning algorithm that operates efficiently on temporal sequences. The key feature of the architecture is a layer of self-connected hidden units that integrate their current value with the new input at each time step to construct a static representation of the temporal input sequence. This architecture avoids two deficiencies found in other models of sequence recognition: first, it reduces the difficulty of temporal credit assignment by focusing the backpropagated error signal; second, it eliminates the need for a buffer to hold the input sequence and/or intermediate activity levels. The latter property is due to the fact that during the forward (activation) phase, incremental activity traces can be locally computed that hold all information necessary for backpropagation in time. It is argued that this architecture should scale better than conventional recurrent architecture should scale better than conventional recurrent architectures with respect to sequence length. The architecture has been used to implement a temporal version of Rumelhart and McClelland’s verb past-tense model D. E. Rumelhart and J. L. McClelland [On learning the past tenses of English verbs, in Parallel distributed processing: Explorations in the microstructure of cognition. Vol. II, MIT Press/Bradford Books, Cambridge, 216-271 (1986)]. The hidden units learn to behave something like Rumelhart and McClelland’s “Wickelphones”, a rich and flexible representation of temporal information.
Technical Report
Error propagation networks are able to learn a variety of tasks in which a static input pattern is mapped onto a static output pattern. This paper presents a generalisation of these nets to deal with time varying, or dynamic patterns. Three possible architectures are explored which deal with learning sequences of known finite length and sequences of unknown and possibly infinite length. Several examples are given and an application to speech coding is discussed. A further development of dynamic nets is made which allows them to be trained by a signal which expresses the correctness of the output of the net, the utility signal. One possible architecture for such a utility driven dynandc net is given and a simple example is presented. Utility driven dynamic nets are potentially able to calculate and maximise any function of the input and output data streams, within the comidered conext. This is a very powerful property, and an appendix presents a comparison of the information processing in utility driven dynamic nets and that in the human brain.