ArticlePDF Available

Long Short-term Memory

Authors:

Abstract and Figures

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Content may be subject to copyright.
in
j
in
j
out
j
out
j
w
ic
j
w
ic
j
y
c
j
g h
1.0
net
w
i
w
i
y
in
j
y
out
j
net
c
j
g y
in
j
= g+s
c
j
s
c
j
y
in
j
h y
out
j
net
1 1
2
output
hidden
input
out 1
in 1
out 2
in 2
1
cell
block block
1cell
block
block
2
cell
2
cell
2
B
T
S
X
X P
V
T
P V
S
E
B
T
P
E
T
P
GRAMMAR
GRAMMAR
REBER
REBER
... Kumar et al. (2019) applied RNN models to predict the monthly rainfall in Indian. As one class of RNN, long short-term memory (LSTM) was introduced to improve the performance of training long sequences and solve the vanishing and exploding gradient problems by enforcing constant error flow through constant error carrousels within special memory cells (Hochreiter and Schmidhuber 1997). LSTM has been successfully employed to hydrological studies in many areas. ...
... However, RNN may fail to capture long-term dependencies because of gradient explosion and gradient vanishes during the back-propagation in long sequences . As a special form of RNN, LSTM is developed to overcome the obstacle of traditional RNN and improve prediction performance and efficiency for data with longterm dependencies by introducing three gates (Hochreiter and Schmidhuber 1997). LSTM has a chain structure including input gates, output gates, and forget gates within each cell state. ...
Article
Full-text available
A continuous and complete spring discharge record is critical in understanding the hydrodynamic behavior of karst aquifers and the variability of freshwater resources. However, due to equipment errors, failure of observation and other reasons, missing data is a common problem for spring discharge monitoring and further hydrological investigations and data analysis. In this study, a novel approach that integrates deep learning algorithms and ensemble empirical mode decomposition (EEMD) is proposed to reconstruct the missing spring discharge data with a given local precipitation record. Using EEMD, the local precipitation data is decomposed into several intrinsic mode functions (IMFs) from high to low frequencies and a residual function, which are served as the input of convolutional neural network (CNN), long short-term memory (LSTM), and hybrid CNN-LSTM models to reconstruct the missing discharge data. Evaluation metrics, including root mean squared error (RMSE), mean absolute error (MAE), and Nash–Sutcliffe efficiency coefficient (NSE), are calculated to evaluate the reconstruction performance. The monthly spring discharge and precipitation data from March 1978 to October 2021 collected at Barton Springs in Texas are used for the validation and evaluation of newly proposed deep learning models. The results indicate that deep learning models coupled with EEMD overperform the models without EEMD and significantly improve the reconstruction results. The LSTM-EEMD model obtains the best reconstruction results among three deep learning algorithms. For models with monthly data, the missing rate affects the reconstruction performance because of the number of data samples: the best reconstruction results are achieved when the missing rate was low. If the missing rate was 50%, the reconstruction results become notably poorer. However, when the daily precipitation and discharge data are used, the models can obtain satisfactory reconstruction results with missing rate ranged from 10 to 50%.
... Recently, hydrologists have started to use Long Short-Term Memory (LSTM) in different applications. Introduced by Hochreiter and Schmidhuber [31], LSTM models are a powerful tool for solving time-series prediction problems [32]. LSTMs are proven to capture various time series data trends and behaviors and learn their long-range dependencies with high performance [33]. ...
... To overcome this flaw, LSTM models were introduced as the long-term dependencies' saver in time series data thanks to their purpose-built memory cell [45]. In addition, LSTM models introduce a solution to the vanishing gradient problem related to long-term context memorization [31] as their architecture trims the gradients in the network using continuous error flows through constant error carousels within special multiplicative units [44] The cells are composed of a sigmoid neural net layer and a multiplication operation, as depicted in Fig. 3b [44]. An input at time step t is (X t ), the hidden state from the previous time step that is introduced to the LSTM block (H t ), then the hidden state (S t ) is computed as follows: ...
Article
Full-text available
This study aims to develop a Machine Learning (ML)-based technique to infer reservoir management rules and predict downstream discharge values. The case study is the Hackensack River Watershed in New Jersey, USA. A Long Short-Term Memory (LSTM) model was used to predict streamflow values at the USGS station at New Milford, right downstream of Oradell reservoir. A good agreement between observed and simulated streamflow values was obtained during the 2020–2021 testing period. An NSE value of 0.93 was determined with the 48-h precipitation lead time, suggesting that the 48-h precipitation forecast mostly drives releases Oradell reservoir. The developed model was tested during Hurricane Ida. The analysis revealed that a similar NSE of 0.95 was obtained with a 48-h precipitation lead time followed by the 12-h lead time model, which was based on the watershed response time. In addition, the conducted feature analysis revealed that only four out of the seven upstream USGS stations in the watershed have a significant impact on the model’s performance. This work implies that ML can capture reservoir management rules and predict reservoir releases using precipitation and upstream flow data as input variables. This study lays the groundwork for a generalization of the method over the CONUS to infer reservoirs’ operation rules for streamflow simulation.
... RNNs are in theory capable of handling sequences of arbitrary length by repeating the cell to cover its entire length, but for long sequences, it is subject to a strong exploding or vanishing gradient effects (Hochreiter et al., 2001). To solve this issue, Hochreiter and Schmidhuber (1997) introduced long short-term memory networks (LSTMs), with cells that not only output a prediction, but also a "cell state" that can be linearly altered by the two other inputs of the cell (the prediction from the previous cell and the current input from the data sequence). This mechanism is similar to residual connections in residual networks (He et al., 2016) as layers can be skipped by a part of the data flowing through the network. ...
... Although the "cell state" of LSTM (Hochreiter and Schmidhuber, 1997) can be considered as a form of such mechanism, the term "attention mechanism" was first introduced in the Transformer architecture (Vaswani et al., 2017) to tackle natural language processing (NLP) problems. Bigger architectures based on similar mechanisms such as Bert (Devlin et al., 2019) and GPT-3 (Brown et al., 2020) have shown impressive results on NLP tasks. ...
Thesis
Constant improvement of DNA sequencing technology that produces large quantities of genetic data should greatly enhance our knowledge of evolution, particularly demographic history. However, the best way to extract information from this large-scale data is still an open problem. Neural networks are a strong candidate to attain this goal, considering their recent success in machine learning. These methods have the advantages of handling high-dimensional data, adapting to most applications and scaling efficiently to available computing resources. However, their performance dependents on their architecture, which should match the data properties to extract the maximum information. In this context, this thesis presents new approaches based on deep learning, as well as the principles for designing architectures adapted to the characteristics of genomic data. The use of convolution layers and attention mechanisms allows the presented networks to be invariant to the sampled haplotypes' permutations and to adapt to data of different dimensions (number of haplotypes and polymorphism sites). Experiments conducted on simulated data demonstrate the efficiency of these approaches by comparing them to more classical network architectures, as well as to state-of-the-art methods. Moreover, coupling neural networks with some methods already proven in population genetics, such as the approximate Bayesian computation, improves the results and combines their advantages. The practicality of neural networks for demographic inference is tested on whole genome sequence data from real populations of Bos taurus and Homo sapiens. Finally, the scenarios obtained are compared with current knowledge of the demographic history of these populations.
... This makes training more difficult due to the vanishing or exploding gradient problems during the back-propagating process and the difficulty in capturing long-term dependencies. Different types of RNN that can process and capture the dynamic of time and retain it as a state, are called the long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997). Therefore, LSTM was applied earlier Duan et al. 2016) for traffic prediction and showed that LSTM could outperform vanilla RNN in terms of accuracy and stability for traffic velocity prediction. ...
... Gated recurrent unit (GRU) proposed in 2014 ) is another improvement of RNN. Compared to LSTM, GRUs have a similar basic idea of gating mechanisms to learn long-term dependencies (Hochreiter and Schmidhuber 1997), albeit, with some differences. The main difference is that GRU only has a reset gate r t and update gate z t . ...
Article
Full-text available
The fast growth of information and communication technology has increased the application of deep learning technology in many areas. Traffic has become one of the leading problems for modern life in urban settings because of the steady growth of vehicles. Tracking congestion throughout the network road for achieving intelligent transportation systems is important. However, predicting traffic flow is quite difficult due to its nonlinear characteristics. In this study, a model was proposed, that used an attention mechanism on modified recurrent neural networks (RNN). The attention mechanism was used to address the limitation of modeling long-dependencies and efficient usage of memory for computation. The modified RNN, which combines the residual module and deep stacked GRU-type RNN, was also applied as the encoder–decoder network function to improve the prediction performance of the model by decreasing vanishing gradient potential and enhancing the ability to capture longer dependencies. The proposed method was also evaluated on two real-world road sensor data from an open-access database named PeMS San Jose Bay area and Northbound Interstate 405 area. The results show how deep learning features with attention mechanisms can provide precise short-term and long-term traffic prediction compared to classical and modern deep neural network-based baselines.
... The first processes the video as a single video frame. Based on image quality assessment models (e.g., 2-dimensional convolutional neural networks, 2D-CNN) add a temporal regression model (e.g., recurrent neural networks such as RNN, LSTM [6]) to model the relationship between frames, and finally, the final video quality score is output by temporal pooling. Algorithms such as VSFA [7] and CNN+LSTM [8], for example, can achieve good results, but these methods greatly rely on spatially perceptive features and lose considerable useful temporal information. ...
Article
Full-text available
Quality assessment of real, user-generated content videos lacking reference videos is a challenging problem. For such scenarios, we propose an objective quality assessment method for no-reference video from the spatio-temporal perception characteristics of the video. First, a dual-branch network is constructed from distorted video frames and frame difference maps generated from a global perspective, considering the interaction between spatial and temporal information, incorporating a motion-guided attention module, and fusing spatio-temporal perceptual features from a multiscale perspective. Second, an InceptionTime network is introduced to further perform long-term sequence fusion to obtain the final perceptual quality score. Finally, the results were evaluated on the four user-generated content video databases of KoNViD-1k, CVD2014, LIVE_VQC and LIVE_Qualcomm, and the experimental results show that the network outperforms other partially recent no-reference VQA methods.
... The LSTM is an evolution of the Recurrent Neural Network (RNN) foundation introduced by Hochreiter and Schmidhuber (1997). This model takes advantage of RNN capabilities, including modeling sequential data, while addressing some of its drawbacks, such as the exploding and vanishing gradient issue by adding extra interactions per cell. ...
Article
Full-text available
As a primary input in meteorology, the accuracy of solar radiation simulations affects hydrological, climatological, and agricultural studies and sustainable development practices and plans. With the advent of machine learning models and their proven capabilities in modelling the hydro-meteorological phenomena, it is necessary to find the best model suitable for each phenomenon. Models performance depends upon their structure and the input data set. Therefore, some well-known and newest machine learning models with different inputs are tested here for solar radiation simulation in Illinois, USA. The data mining models of Support Vector Machine (SVM), Gene Expression Programming (GEP), Long Short-Term Memory (LSTM), and their combination with the wavelet transformation building a total of six model structures are applied to five data sets to examine their suitability for solar radiation simulation. The five input data sets (SCN_1 to SCN_5) are based on five readily accessible parameters, namely extraterrestrial radiation (Ra), maximum and minimum air temperature (Tmin, Tmax), corrected clear-sky solar irradiation (ICSKY), and Day of Year (DOY). The LSTM outperformed other models, consulting the performance measures of RMSE, SI, MAE, SSRMSE, and SSMAE. Of the different input data sets, in general, SCN_4 was the best input scenario for predicting global daily solar radiation using Ra, Tmax, Tmin, and DOY variables. Overall, six machine learning based models showed acceptable performances for estimating solar radiation, with the LSTM machine learning technique being the most recommended.
Article
Adaboost is a mental health prediction method that utilizes an integrated learning algorithm to address the current state of mental health issues among graduates in the workforce. The method first extracts the features of mental health test data, and after data cleaning and normalization, the data are mined and analyzed using a decision tree classifier. The Adaboost algorithm is then used to train the decision tree classifier for multiple iterations in order to improve its classification efficiency, and a mental health prognosis model is constructed. Using the model, 2780 students in the class of 2022 at a university were analyzed. The trial results demonstrated that the strategy was capable of identifying sensitive psychological disorders in a timely manner, providing a basis for making decisions and developing plans for mental health graduate students.
Chapter
Natural language interfaces are gaining popularity as an alternative interface for non-technical users. Natural language interface to database (NLIDB) systems have been attracting considerable interest recently that are being developed to accept user’s query in natural language (NL), and then converting this NL query to an SQL query, the SQL query is executed to extract the resultant data from the database. This Text-to-SQL task is a long-standing, open problem, and towards solving the problem, the standard approach that is followed is to implement a sequence-to-sequence model. In this paper, I recast the Text-to-SQL task as a machine translation problem using sequence-to-sequence-style neural network models. To this end, I have introduced a parallel corpus that I have developed using the WikiSQL dataset. Though there are a lot of work done in this area using sequence-to-sequence-style models, most of the state-of-the-art models use semantic parsing or a variation of it. None of these models’ accuracy exceeds 90%. In contrast to it, my model is based on a very simple architecture as it uses an open-source neural machine translation toolkit OpenNMT, that implements a standard SEQ2SEQ model, and though my model’s performance is not better than the said models in predicting on test and development datasets, its training accuracy is higher than any existing NLIDB system to the best of my knowledge.
Article
College ideological and political education has always been the primary content of national spiritual civilization construction. The current teaching methods are more flexible, resulting in the quality of ideological and political teaching not being reasonably assessed. To address this problem, we propose a method for assessing the quality of ideological and political teaching based on the gated recurrent unit (GRU) network and construct an automatic assessment system for ideological and political teaching. We draw on the migration learning model to improve the loss function by using the generalized intersection set over the joint loss function to compensate for the shortcoming of the small number of ideological and political teaching datasets. We use a masking algorithm to enhance the local features of teaching data sequences for different classes of ideological and political teaching quality assessment metrics. In addition, we use the minimum outer matrix algorithm to extract the sequence features of different assessment dimensions to improve the accuracy of the model for the quality assessment of ideological and political teaching. To meet the quality assessment conditions of ideological and political teaching, we compiled and produced ideological and political teaching datasets according to the teaching data coverage. The experimental results proved that our method performed best in comprehensive quality assessment accuracy in ideological and political teaching, with the assessment accuracy rate above 90%. Compared with traditional machine learning methods and deep learning methods, our method has higher accuracy and better robustness.
Article
Demand forecasting is one of the managers' concerns in service supply chain management. With accurate passenger flow forecasting, the station-level service suppliers can make better service plans accordingly. However, the existing forecasting model cannot identify the different future passenger flow at different types of stations. As a result, the service suppliers cannot make service plans according to the demands of different stations. In this article, we propose a deep learning architecture called DeepSPF (Deep Learning for Subway Passenger Forecasting) to predict subway passenger flow considering the different functional types of stations. We also propose the sliding long short-term memory (LSTM) neural networks as an important component of our model, combining LSTM and one-dimensional convolution. In the experiments of the Beijing subway, DeepSPF outperforms the baseline models in three-time granularities (10, 15, and 30 minutes). Moreover, a comparison between variants of DeepSPF indicates that, with the information of stations' functional types, DeepSPF has strong robustness when an abnormal situation happens.
Article
Full-text available
Numerous recent papers (including many NIPS papers) focus on standard recurrent nets' inability to deal with long time lags between relevant input signals and teacher signals. Rather sophisticated, alternative methods were proposed. We first show: problems used to promote certain algorithms in numerous previous papers can be solved more quickly by random weight guessing than by the proposed algorithms. This does not mean that guessing is a good algorithm. It just casts doubt on whether the other algorithms are, or whether the chosen problems are meaningful. We then use long short term memory (LSTM), our own recent algorithm, to solve hard problems that can neither be quickly solved by random weight guessing nor by any other recurrent net algorithm we are aware of. 1 Introduction / Outline Many recent papers focus on standard recurrent nets' inability to deal with long time lags between relevant signals. See, e.g., Bengio et al., El Hihi and Bengio, and others [3, 1, 6, 15]. Rather s...
Article
Full-text available
Numerous recent papers focus on standard recurrent nets' problems with long time lags between relevant signals. Some propose rather sophisticated, alternative methods. We show: many problems used to test previous methods can be solved more quickly by random weight guessing.
Article
Full-text available
We explore a network architecture introduced by Elman (1988) for predicting successive elements of a sequence. The network uses the pattern of activation over a set of hidden units from time-step t−1, together with element t, to predict element t + 1. When the network is trained with strings from a particular finite-state grammar, it can learn to be a perfect finite-state recognizer for the grammar. When the network has a minimal number of hidden units, patterns on the hidden units come to correspond to the nodes of the grammar, although this correspondence is not necessary for the network to act as a perfect finite-state recognizer. We explore the conditions under which the network can carry information about distant sequential contingencies across intervening elements. Such information is maintained with relative ease if it is relevant at each intermediate step; it tends to be lost when intervening elements do not depend on it. At first glance this may suggest that such networks are not relevant to natural language, in which dependencies may span indefinite distances. However, embeddings in natural language are not completely independent of earlier information. The final simulation shows that long distance sequential contingencies can be encoded by the network even if only subtle statistical properties of embedded strings depend on the early information.
Technical Report
Full-text available
Error propagation networks are able to learn a variety of tasks in which a static input pattern is mapped onto a static output pattern. This paper presents a generalisation of these nets to deal with time varying, or dynamic patterns. Three possible architectures are explored which deal with learning sequences of known finite length and sequences of unknown and possibly infinite length. Several examples are given and an application to speech coding is discussed. A further development of dynamic nets is made which allows them to be trained by a signal which expresses the correctness of the output of the net, the utility signal. One possible architecture for such a utility driven dynandc net is given and a simple example is presented. Utility driven dynamic nets are potentially able to calculate and maximise any function of the input and output data streams, within the comidered conext. This is a very powerful property, and an appendix presents a comparison of the information processing in utility driven dynamic nets and that in the human brain.
Article
Recurrent connections in neural networks potentially allow information about events occurring in the past to be preserved and used in current computations. How effectively this potential is realized depends on the power of the learning algorithm used. As an example of a task requiring recurrency, Servan-Schreiber, Cleeremans, and McClelland1 have applied a simple recurrent learning algorithm to the task of recognizing finite-state grammars of increasing difficulty. These nets showed considerable power and were able to learn fairly complex grammars by emulating the state machines that produced them. However, there was a limit to the difficulty of the grammars that could be learned. We have applied a more powerful recurrent learning procedure, called real-time recurrent learning2,6 (RTRL), to some of the same problems studied by Servan-Schreiber, Cleeremans, and McClelland. The RTRL algorithm solved more difficult forms of the task than the simple recurrent networks. The internal representations developed by RTRL networks revealed that they learn a rich set of internal states that represent more about the past than is required by the underlying grammar. The dynamics of the networks are determined by the state structure and are not chaotic.
Article
An adaptive neural network with asymmetric connections is introduced. This network is related to the Hopfield network with graded neurons and uses a recurrent generalization of the δ rule of Rumelhart, Hinton, and Williams to modify adaptively the synaptic weights. The new network bears a resemblance to the master/slave network of Lapedes and Farber but it is architecturally simpler.