ArticlePDF Available

Learning long-term dependencies with gradient descent is difficult

Authors:
  • Irreverent Labs

Abstract

Recurrent neural networks can be used to map input sequences to output sequences, such as for recognition, production or prediction problems. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. We show why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases. These results expose a trade-off between efficient learning by gradient descent and latching on information for long periods. Based on an understanding of this problem, alternatives to standard gradient descent are considered.
h
t
1
h
t
0
h
t
w
(a)
LT
t
t
LT
(b)
0.5
1
1.5
2
2.5
Wo
0
0.2
0.4
0.6
0.8
1
S
(a)
5
10
15
20
25
30
35
40
45
50
55
60
T
(b)
0.2
0.4
0.6
0.8
1
F
req.
X
β
Γ
Domain of a
t
(a)
X
β
Domain of a
t
Γ
(b)
|M’|>1
|M’|<1
|M’|<1
|M’|>1
0 102030405060708090100
0.025
0.05
0.1
0.2
0.4
0.8
0 102030405060708090100
0 102030405060708090100
0 102030405060708090100
0 102030405060708090100
0 102030405060708090100
Final Classification Error, Latch Problem
# Sequence Presentations, Latch Problem
100
200
400
800
1600
3200
6400
12800
25600
51200
Final Classification Error, 2−Sequence Problem
0.025
0.05
0.1
0.2
0.4
0.8
# Sequence Presentations, 2−Sequence Problem
100
200
400
800
1600
3200
6400
12800
25600
51200
Final Classification Error, Parity Problem
0.025
0.05
0.1
0.2
0.4
0.8
# Sequence Presentations, Parity Problem
100
400
1600
6400
25600
102400
409600
T
T
T
T
T
T
... However, an excessive depth of the network can result in performance stagnation or degradation [6]. Additionally, gradient vanishing or exploding issues may arise in deeper networks, making these networks harder to train [7,8]. To address these issues, the authors in [9] recommended using residual learning to enable the training of significantly deeper architectures. ...
... . Let ∶ R × [0, ] → R be measurable and satisfy (8). If there exists 0 ∈ R such that ( 0 , ⋅) ∈ ∞ (0, ), then there exists a unique solution ( ) to (7). In particular, if satisfies (8) and (9), then there exist a unique solution ( ) to (7). ...
... If there exists 0 ∈ R such that ( 0 , ⋅) ∈ ∞ (0, ), then there exists a unique solution ( ) to (7). In particular, if satisfies (8) and (9), then there exist a unique solution ( ) to (7). ...
... The auto-regressive and recurrent nature of RNNs, often leads to gradient and error accumulation during training, and this can result in the model converging to suboptimal solutions (Dai et al. 2022). While GRU and LSTM mitigate gradient-related issues through gating mechanisms, they may also introduce temporal information loss (Bai et al. 2018;Bengio et al. 1994;Pascanu et al. 2012). When applied to highly nonlinear datasets, such losses can hinder the full potential of DL models. ...
... While various enhancements to LSTMs can alleviate gradient-related challenges, they do not entirely eliminate these issues (Kolen and Kremer 2001;Pascanu et al. 2013). As a result, the persistence of gradient problems continues to hinder the effective learning of longterm dependencies (Bengio et al. 1994;Chung et al. 2014). Recently, transformer-based networks have shown significant potential in time series forecasting tasks (Yin et al. 2023(Yin et al. , 2022a. ...
Article
Full-text available
Urban real-time rainfall-runoff forecasting (URRF) offers an economical and efficient approach to assessing flood risks in urban areas. However, the hydrological processes of urban rainfall are characterized by high nonlinearity and long-term dependencies due to strong uncertainties and significant human influences, making URRF a challenging task in hydrological simulation. Existing methods often fall short in meeting the requirements for real-time response and accuracy. To address these limitations, this paper proposes a novel global-encoder and local-decoder (GL-ED) model. The global encoder extracts global temporal features, while the local decoder focuses on forecasting. A temporal fully connected (TFC) module is introduced within the global encoder to capture the global features of runoff sequences, overcoming the limitation of convolutional operations that primarily focus on local information. Additionally, to tackle the uneven distribution of urban rainfall-runoff data, a novel RD loss function is proposed, combining dynamic time warping (DTW) with RMSE to better guide the training of complex features. The GL-ED model was evaluated using observed urban rainfall events from January 2018 to December 2019 in a 3.52 km2km^2 complex terrain area in Chongqing, China. Experimental results demonstrate that the GL-ED model outperforms conventional deep learning models, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM) networks, in terms of NSE, MAE, and RMSE. In ablation experiments, the GL-ED model trained with the RD loss function achieved an average improvement of 2.51%2.51\% in NSE and 5.62%5.62\% in KGE, while reducing RMSE, MAE, and Pbias by 6.08%6.08\%, 11.31%11.31\%, and 93%93\%, respectively. These findings highlight the model’s capability to provide reliable and accurate real-time rainfall-runoff forecasting, offering significant potential for enhancing urban flood risk management and decision-making processes.
... However, a significant shortcoming of RNNs is that they may be prone to the problem of vanishing gradients when processing long input sequences, which means the model struggles to properly learn patterns from earlier time steps during training. This happens because the model tends to forget earlier inputs as a model moves through the subsequent parts of the sequence [6]. This issue is particularly concerning in early prediction tasks, where information from the initial weeks may be crucial. ...
... Here, ℎ ∈ R is the hidden state at the th time step, denotes the size of the hidden state, and is a non-linear activation function within the RNN. However, RNNs are prone to vanishing gradient problems [6]. Therefore, could alternatively be an LSTM [21] or GRU [11] activation function. ...
... Additionally, the value of χ 1 determines the transition from vanishing gradients (χ 1 < 1), to exploding gradients (χ 1 > 1). These two phases have well-known consequences for training: vanishing gradients hinder learning by causing a long persistence of the initial conditions, while exploding gradients lead to instability in the training dynamics [21,22]. At the transition point, both the gradients are stable and the depth-scale of signal propagation diverges exponentially. ...
Preprint
Full-text available
Understanding the statistical properties of deep neural networks (DNNs) at initialization is crucial for elucidating both their trainability and the intrinsic architectural biases they encode prior to data exposure. Mean-field (MF) analyses have demonstrated that the parameter distribution in randomly initialized networks dictates whether gradients vanish or explode. Concurrently, untrained DNNs were found to exhibit an initial-guessing bias (IGB), in which large regions of the input space are assigned to a single class. In this work, we derive a theoretical proof establishing the correspondence between IGB and previous MF theories, thereby connecting a network prejudice toward specific classes with the conditions for fast and accurate learning. This connection yields the counter-intuitive conclusion: the initialization that optimizes trainability is necessarily biased, rather than neutral. Furthermore, we extend the MF/IGB framework to multi-node activation functions, offering practical guidelines for designing initialization schemes that ensure stable optimization in architectures employing max- and average-pooling layers.
... Consequently, the learned memory from one dataset may not be applicable or effective for others, rendering the transfer of such memory ineffective. Furthermore, inherent challenges in training RNNs, such as vanishing gradients [60] and exploding gradients [61], may exacerbate the difficulty of fine-tuning these networks. ...
Article
Full-text available
The adoption of deep learning in ECG diagnosis is often hindered by the scarcity of large, well-labeled datasets in real-world scenarios, leading to the use of transfer learning to leverage features learned from larger datasets. Yet the prevailing assumption that transfer learning consistently outperforms training from scratch has never been systematically validated. In this study, we conduct the first extensive empirical study on the effectiveness of transfer learning in multi-label ECG classification, by investigating comparing the fine-tuning performance with that of training from scratch, covering a variety of ECG datasets and deep neural networks. Firstly, We confirm that fine-tuning is the preferable choice for small downstream datasets; however, it does not necessarily improve performance. Secondly, the improvement from fine-tuning declines when the downstream dataset grows. With a sufficiently large dataset, training from scratch can achieve comparable performance, albeit requiring a longer training time to catch up. Thirdly, fine-tuning can accelerate convergence, resulting in faster training process and lower computing cost. Finally, we find that transfer learning exhibits better compatibility with convolutional neural networks than with recurrent neural networks, which are the two most prevalent architectures for time-series ECG applications. Our results underscore the importance of transfer learning in ECG diagnosis, yet depending on the amount of available data, researchers may opt not to use it, considering the non-negligible cost associated with pre-training.
... Although the ACT algorithm has made significant progress in solving compound errors, there is still room for improvement in memory capability and long-range dependency modeling, especially in tasks with long time spans. Currently, ACT relies on the traditional Transformer structure, which, although capable of handling time-series data, may encounter issues with long-range dependencies [9] when dealing with tasks spanning long durations. Specifically, the self-attention mechanism in Transformers, while effective at capturing short-term relationships, is not as effective for modeling dependencies across long time steps, which may lead to suboptimal performance in long-duration tasks. ...
Article
Full-text available
Robot manipulation technology has always been a core area of research, and imitation learning algorithms are crucial for improving task accuracy and efficiency. The ACT algorithm proposed by Zhao et al. performs excellently in fine-grained tasks; however, when handling complex tasks with long time spans, its memory capability and ability to capture long-range temporal dependencies need improvement. Additionally, the increased task complexity leads to a surge in computational resource demands. This paper proposes an improved algorithm based on the ACT algorithm, incorporating a long-short distance attention mechanism. The aim is to enhance memory efficiency for long-duration tasks and improve the ability to capture long-range temporal dependencies, optimize the understanding and utilization of long-term dependencies, improve the overall performance of complex tasks, and reduce the consumption of computational resources such as CPU and GPU.
... Neural Networks (RNNs) have been foundational to sequence modeling tasks in machine learning. However, standard RNNs suffer from the vanishing gradient problem, which limits their ability to capture long-range dependencies in sequential data (Bengio et al., 1994;Hochreiter, 1998). Long-short-term memory (LSTM) networks, first introduced by Hochreiter and Schmidhuber (1997), were specifically designed to overcome this limitation through a gating mechanism that regulates information flow through the network. ...
Conference Paper
Full-text available
This study aimed to evaluate the antifungal activity of Heracleum platytaenium essential oil against Pestalotiopsis theae, a significant phytopathogen responsible for leaf spot disease in tea plants. Due to the environmental and resistance issues caused by chemical fungicides, interest in plant-derived antifungal agents has increased. For this purpose, H. platytaenium essential oil was extracted in different years and tested at 5 μL, 10 μL, and 20 μL concentrations using the agar well diffusion method. The results indicated that the antifungal activity of the essential oil increased in a dose-dependent manner, with the highest inhibition observed at the 20 μL concentration. Moreover, some variation in antifungal activity was observed between essential oils extracted in different years, suggesting that the chemical composition of the plant might vary yearly, potentially affecting its antifungal efficacy.These findings suggest that H. platytaenium essential oil has potential as a natural antifungal agent against phytopathogenic fungi. However, further in vivo studies are required to confirm its effectiveness for agricultural applications.
... Comprehensive investigations have been carried out regarding the theoretical underpinnings, design, and practical applications of recurrent neural networks, as documented in the work by Haykin et al. [18], Kolen and Kremer [19], and Medsker and Jain [20]. However, traditional RNNs face challenges when processing longer sequences due to the problem known as "vanishing gradients," leading to difficulties in learning long-term dependencies in the data [21]. ...
Article
Full-text available
As digitization continues to expand nowadays, the accurate capture and comprehension of public sentiment on social media has become vital for various stakeholders such as government, businesses, and researchers. In this paper, we aim to perform sentiment analysis on Tweets and explore effective methods to classify sentiment into joy, sadness, anger, and fear from textual content. We utilized datasets from Kaggle containing textual tweets and employed models such as RNN, LSTM, Transformer, and SVM to compare their performance in sentiment analysis. The findings of this paper may serve as a valuable reference for the selection of models in sentiment analysis, particularly when working with medium-sized datasets. Additionally, they may offer guidance on selecting universally applicable model choices for conducting sentiment analysis across social media platforms.
Article
With the continuous expansion of low-Earth-orbit (LEO) satellite networks, the services within these networks have exhibited diverse and differentiated demand characteristics. Due to the limited onboard resources, efficient network resource allocation is required to ensure high-quality network performance. However, the dynamic topology and differentiated resource requirements for diversified services pose great challenges when existing resource awareness or prediction methods are applied to satellite networks, resulting in poor awareness latency and the inaccurate prediction of resource status. To solve these problems, a network resource allocation method based on awareness–prediction joint compensation is proposed. The method utilizes the node awareness latency as a prediction step and employs a long short-term memory model for resource status prediction. A dynamic compensation model is also proposed to compensate for the prediction results, which is achieved by adjusting compensation weights according to the awareness latencies and prediction accuracies. Furthermore, an efficient, accelerated alternating-direction method of multipliers (ADMM) resource allocation algorithm is proposed with the aim of maximizing the satisfaction of service resources requirements. The simulation results indicate that the relative error between the compensation data and onboard resource status does not exceed 5%, and the resource allocation method can improve the service resource coverage by 15.8%, thus improving the evaluation and allocation capabilities of network resources.
Conference Paper
Full-text available
A simple method for training the dynamical behavior of a neural network is derived. It is applicable to any training problem in discrete-time networks with arbitrary feedback. The algorithm resembles back-propagation in that an error function is minimized using a gradient-based method, but the optimization is carried out in the hidden part of state space either instead of, or in addition to weight space. A straightforward adaptation of this method to feedforward networks offers an alternative to training by conventional back-propagation. Computational results are presented for some simple dynamical training problems, one of which requires response to a signal 100 time steps in the past.
Conference Paper
Full-text available
A learning algorithm is presented that uses internal representations, which are continuous random variables, for the training of multilayer networks whose neurons have Heaviside characteristics. This algorithm is an improvement in that it is applicable to networks with any number of layers of variable weights and does not require `bit flipping' on the internal representations to reduce output error. The algorithm is extended to apply to recurrent networks. Some illustrative results are given
Conference Paper
Full-text available
The authors seek to train recurrent neural networks in order to map input sequences to output sequences, for applications in sequence recognition or production. Results are presented showing that learning long-term dependencies in such recurrent networks using gradient descent is a very difficult task. It is shown how this difficulty arises when robustly latching bits of information with certain attractors. The derivatives of the output at time t with respect to the unit activations at time zero tend rapidly to zero as t increases for most input values. In such a situation, simple gradient descent techniques appear inappropriate. The consideration of alternative optimization methods and architectures is suggested
Chapter
Threshold functions and related operators are widely used as basic elements of adaptive and associative networks [Nakano 72, Amari 72, Hopfield 82]. There exist numerous learning rules for finding a set of weights to achieve a particular correspondence between input-output pairs. But early works in the field have shown that the number of threshold functions (or linearly separable functions) in N binary variables is small compared to the number of all possible boolean mappings in N variables, especially if N is large. This problem is one of the main limitations of most neural networks models where the state is fully specified by the environment during learning: they can only learn linearly separable functions of their inputs. Moreover, a learning procedure which requires the outside world to specify the state of every neuron during the learning session can hardly be considered as a general learning rule because in real-world conditions, only a partial information on the “ideal” network state for each task is available from the environment. It is possible to use a set of so-called “hidden units” [Hinton,Sejnowski,Ackley. 84], without direct interaction with the environment, which can compute intermediate predicates. Unfortunately, the global response depends on the output of a particular hidden unit in a highly non-linear way, moreover the nature of this dependence is influenced by the states of the other cells.
Article
Time is at the heart of many pattern recognition tasks, e.g., speech recognition. However, connectionist learning algorithms to date are not well suited for dealing with time-varying input patterns. This paper introduces a specialized connectionist architecture and corresponding specialization of the backpropagation learning algorithm that operates efficiently on temporal sequences. The key feature of the architecture is a layer of self-connected hidden units that integrate their current value with the new input at each time step to construct a static representation of the temporal input sequence. This architecture avoids two deficiencies found in other models of sequence recognition: first, it reduces the difficulty of temporal credit assignment by focusing the backpropagated error signal; second, it eliminates the need for a buffer to hold the input sequence and/or intermediate activity levels. The latter property is due to the fact that during the forward (activation) phase, incremental activity traces can be locally computed that hold all information necessary for backpropagation in time. It is argued that this architecture should scale better than conventional recurrent architecture should scale better than conventional recurrent architectures with respect to sequence length. The architecture has been used to implement a temporal version of Rumelhart and McClelland’s verb past-tense model D. E. Rumelhart and J. L. McClelland [On learning the past tenses of English verbs, in Parallel distributed processing: Explorations in the microstructure of cognition. Vol. II, MIT Press/Bradford Books, Cambridge, 216-271 (1986)]. The hidden units learn to behave something like Rumelhart and McClelland’s “Wickelphones”, a rich and flexible representation of temporal information.
Chapter
This paper presents a generalization of the perception learning procedure for learning the correct sets of connections for arbitrary networks. The rule, falled the generalized delta rule, is a simple scheme for implementing a gradient descent method for finding weights that minimize the sum squared error of the sytem's performance. The major theoretical contribution of the work is the procedure called error propagation, whereby the gradient can be determined by individual units of the network based only on locally available information. The major empirical contribution of the work is to show that the problem of local minima not serious in this application of gradient descent. Keywords: Learning; networks; Perceptrons; Adaptive systems; Learning machines; and Back propagation