Conference Paper

Pseudorehearsal in actor-critic agents with neural network function approximation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Catastrophic forgetting has a significant negative impact in reinforcement learning. The purpose of this study is to investigate how pseudorehearsal can change performance of an actor-critic agent with neural-network function approximation. We tested agent in a pole balancing task and compared different pseudorehearsal approaches. We have found that pseudorehearsal can assist learning and decrease forgetting.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Neural networks are particularly prone to the dilemma, most often demonstrating low stability with high plasticity, leading to Catastrophic Forgetting (CF) [34] of old information when re-trained with newly-acquired data. Replay of previously-learned examples has been posed as a solution for minimising the detrimental effect of CF, with notable recent successes using generative [35] and pseudorehearsal-based [36] approaches. Where feasible, replay with a subset of the actual original examples is ideal, in a process known simply as rehearsal. ...
Article
Full-text available
Error motion trajectory data are routinely collected on multi-axis machine tools to assess their operational state. There is a wealth of literature devoted to advances in modelling, identification and correction using such data, as well as the collection and processing of alternative data streams for the purpose of machine tool condition monitoring. Until recently, there has been minimal focus on combining these two related fields. This paper presents a general approach to identifying both kinematic and non-kinematic faults in error motion trajectory data, by framing the issue as a generic pattern recognition problem. Because of the typically-sparse nature of datasets in this domain – due to their infrequent, offline collection procedures – the foundation of the approach involves training on a purely simulated dataset, which defines the theoretical fault-states observable in the trajectories. Ensemble methods are investigated and shown to improve the generalisation ability when predicting on experimental data. Machine tools often have unique ‘signatures’ which can significantly-affect their error motion trajectories, which are largely repeatable, but specific to the individual machine. As such, experimentally-obtained data will not necessarily be easily defined in a theoretical simulation. A transfer learning approach is introduced to incorporate experimentally-obtained error motion trajectories into classifiers which were trained primarily on a simulation domain. The approach was shown to significantly improve experimental test set performance, whilst also maintaining all theoretical information learned in the initial, simulation-only training phase. The ultimate approach represents a viable and powerful automated classifier for error motion trajectory data, which can encode theoretical fault-states with efficacy whilst also remain adaptable to machine-specific signatures.
Article
Neural networks can achieve excellent results in a wide variety of applications. However, when they attempt to sequentially learn, they tend to learn the new task while catastrophically forgetting previous ones. We propose a model that overcomes catastrophic forgetting in sequential reinforcement learning by combining ideas from continual learning in both the image classification domain and the reinforcement learning domain. This model features a dual memory system which separates continual learning from reinforcement learning and a pseudo-rehearsal system that “recalls” items representative of previous tasks via a deep generative network. Our model sequentially learns Atari 2600 games without demonstrating catastrophic forgetting and continues to perform above human level on all three games. This result is achieved without: demanding additional storage requirements as the number of tasks increases, storing raw data or revisiting past tasks. In comparison, previous state-of-the-art solutions are substantially more vulnerable to forgetting on these complex deep reinforcement learning tasks.
Conference Paper
Full-text available
Catastrophic forgetting is of special importance in reinforcement learning, as the data distribution is generally non-stationary over time. We study and compare several pseudorehearsal approaches for Q-learning with function approximation in a pole balancing task. We have found that pseudorehearsal seems to assist learning even in such very simple problems, given proper initialization of the rehearsal parameters.
Article
Full-text available
Significance Deep neural networks are currently the most successful machine-learning technique for solving a variety of tasks, including language translation, image classification, and image generation. One weakness of such models is that, unlike humans, they are unable to learn multiple tasks sequentially. In this work we propose a practical solution to train such models sequentially by protecting the weights important for previous tasks. This approach, inspired by synaptic consolidation in neuroscience, enables state of the art results on multiple reinforcement learning problems experienced sequentially.
Conference Paper
Full-text available
We propose a novel biologically plausible actor-critic algorithm using policy gradients in order to achieve practical, model-free reinforcement learning. It does not rely on backpropagation and is the first neural actor-critic relying only on locally available information. We show it has an advantage over pure policy gradients methods for motor learning performance in the polecart problem. We are also able to closely simulate the dopaminergic signaling patterns in rats when confronted with a two cue problem, showing that local, connectionist models can effectively model the functioning of the intrinsic reward system.
Article
Full-text available
A major problem with connectionist networks is that newly-learned information may completely destroy previously-learned information unless the network is continually retrained on the old information. This phenomenon, known as catastrophic forgetting, is unacceptable both for practical purposes and as a model of mind. This paper advances the claim that catastrophic forgetting is in part the result of the overlap of system's distributed representations and can be reduced by reducing this overlap. A simple algorithm, called activation sharpening, is presented that allows a standard feed-forward backpropagation network to develop semi-distributed representations, thereby reducing the problem of catastrophic forgetting. Activation sharpening is discussed in tight of recent work done by other researchers who have experimented with this and other techniques for reducing catastrophic forgetting.
Article
Full-text available
The discovery of facts and practices concerning reinforcement in the past 25 years "have increased our power to predict and control behavior and in so doing have left no doubt of their reality and importance." In the acquisition of a bowling response in pigeons 3 points are relevant: (a) The temporal relationships between behavior and reinforcement are very important. (b) Behavior was set up through successive approximations. (c) Behavior gradual "shapes up" by "reinforcing crude approximations of the final topography instead of waiting for the complete response." The maintenance of behavior through various schedules of reinforcement is discussed. "The world in which man lives may be regarded as an extraordinarily complex set of positive and negative reinforcing contingencies… . In any social situation we must discover who is reinforcing whom with what and to what effect." The modern study of reinforcement is: (a) difficult and relatively expensive; (b) usually single-organism research, in which a statistical program is "unnecessary" and "wrong"; (c) not theoretical. "The new principles and methods of analysis which are emerging from the study of reinforcement may prove to be among the most productive social instruments of the twentieth century." (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
This paper compares the performance of pol-icy gradient techniques with traditional value function approximation methods for rein-forcement learning in a difficult problem do-main. We introduce the Spacewar task, a continuous, stochastic, partially-observable, competitive multi-agent environment. We demonstrate that a neural-network based im-plementation of an online policy gradient al-gorithm (OLGARB (Weaver & Tao, 2001)) is able to perform well in this task and is com-petitive with the more well-established value function approximation algorithms (Sarsa(λ) and Q-learning (Sutton & Barto, 1998)).
Conference Paper
Full-text available
We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic rein- forcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their com- patibility with function approximation methods, which are needed to handle large or innite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further re- duce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal differ- ence learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the rst convergence proofs and the rst fully incremental algorithms.
Conference Paper
Full-text available
We address the problem of computing the optimal Q-function in Markov decision prob- lems with infinite state-space. We analyze the convergence properties of several vari- ations of Q-learning when combined with function approximation, extending the anal- ysis of TD-learning in (Tsitsiklis & Van Roy, 1996a) to stochastic control settings. We identify conditions under which such approx- imate methods converge with probability 1. We conclude with a brief discussion on the general applicability of our results and com- pare them with several related works.
Article
Full-text available
Algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known "actor-critic" (or "adaptive critic") algorithm in the artificial intelligence literature. Distributed asynchronous implementations are considered. The analysis involves two time scale stochastic approximations.
Article
Full-text available
Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable.
Article
Catastrophic forgetting is a major problem for sequential learning in neural networks. One very general solution to this problem, known as ‘pseudorehearsal’, works well in practice for nonlinear networks but has not been analysed before. This paper formalizes pseudorehearsal in linear networks. We show that the method can fail in low dimensions but is guaranteed to succeed in high dimensions under fairly general conditions. In this case an optimal version of the method is equivalent to a simple modification of the ‘delta rule’.
Article
Introduction: Debate continues over the precise causal contribution made by mesolimbic dopamine systems to reward. There are three competing explanatory categories: 'liking', learning, and 'wanting'. Does dopamine mostly mediate the hedonic impact of reward ('liking')? Does it instead mediate learned predictions of future reward, prediction error teaching signals and stamp in associative links (learning)? Or does dopamine motivate the pursuit of rewards by attributing incentive salience to reward-related stimuli ('wanting')? Each hypothesis is evaluated here, and it is suggested that the incentive salience or 'wanting' hypothesis of dopamine function may be consistent with more evidence than either learning or 'liking'. In brief, recent evidence indicates that dopamine is neither necessary nor sufficient to mediate changes in hedonic 'liking' for sensory pleasures. Other recent evidence indicates that dopamine is not needed for new learning, and not sufficient to directly mediate learning by causing teaching or prediction signals. By contrast, growing evidence indicates that dopamine does contribute causally to incentive salience. Dopamine appears necessary for normal 'wanting', and dopamine activation can be sufficient to enhance cue-triggered incentive salience. Drugs of abuse that promote dopamine signals short circuit and sensitize dynamic mesolimbic mechanisms that evolved to attribute incentive salience to rewards. Such drugs interact with incentive salience integrations of Pavlovian associative information with physiological state signals. That interaction sets the stage to cause compulsive 'wanting' in addiction, but also provides opportunities for experiments to disentangle 'wanting', 'liking', and learning hypotheses. Results from studies that exploited those opportunities are described here. Conclusion: In short, dopamine's contribution appears to be chiefly to cause 'wanting' for hedonic rewards, more than 'liking' or learning for those rewards.
Article
Neural networks encounter serious catastrophic forgetting when information is learned sequentially, which is unacceptable for both a model of human memory and practical engineering applications. In this study, we propose a novel biologically inspired dual-network memory model that can significantly reduce catastrophic forgetting. The proposed model consists of two distinct neural networks: hippocampal and neocortical networks. Information is first stored in the hippocampal network, and thereafter, it is transferred to the neocortical network. In the hippocampal network, chaotic behavior of neurons in the CA3 region of the hippocampus and neuronal turnover in the dentate gyrus region are introduced. Chaotic recall by CA3 enables retrieval of stored information in the hippocampal network. Thereafter, information retrieved from the hippocampal network is interleaved with previously stored information and consolidated by using pseudopatterns in the neocortical network. The computer simulation results show the effectiveness of the proposed dual-network memory model.
Article
The principle of estimation and control was introduced and studied independently by Kurano and Mandl under the average return criterion for models in which some of the data depend on an unknown parameter. Kurano and Mandl considered Markov decision models with finite state space and bounded rewards. Conditions are established for the existence of an optimal policy based on a consistent estimator for the unknown parameter which is optimal uniformly in the parameter. These results were extended by Kolonko to semi-Markov models with denumerable state space and unbounded rewards. The present paper considers the same principle of estimation and control for the discounted return criterion. The underlying semi-Markov decision model may have a denumerable state space and unbounded rewards. Conditions are established for the existence of a policy which is asymptotically discount optimal uniformly in the unknown parameter. The essential conditions are continuity and compactness conditions and a multiplicative form of the Foster criterion for positive recurrence of Markov chains formulated here for Markov decision models. An application to the control of an M|G|l-queue is discussed
Article
Reviews the article "Animal Intelligence: An Experimental Study of the Associative Processes in Animals" by E. L. Thorndike. In this monograph are presented the results of some experiments which the author has been carrying on during two years, and some theories which these results seem to support. The subjects of the experiments were dogs, cats and chicks, and the method was to put them, when hungry, in boxes from which they could escape and so get food by manipulating some simple mechanism (e. g., by pulling down a loop of wire, depressing a lever, turning a button). The author reports on the behavior of the animals. The author's conception of mental evolution is briefly explained, and applications of his results to education, anthropology and theoretical psychology are made. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
In this review we explore the topic of sequential learning, where information to be learned and retained arrives in separate episodes over time, in the context of artificial neural networks. Most neural networks handle this kind of task very badly, as new learning completely disrupts information previously learned by the network. This problem, known as "catastrophic forgetting", has received a lot of attention in the literature. We illustrate the catastrophic forgetting effect, and summarise possible solutions. In particular, we review the literature relating to the pseudorehearsal mechanism, which is an effective solution to the catastrophic forgetting problem in back propagation type networks. We then review similar issues of capacity, forgetting, and the use of pseudorehearsal in Hopfield type networks. Finally, we briefly discuss these issues in the context of cognition, and summarise interesting topics for further research.
Article
Research in cognitive psychology has made a significant contribution to our understanding of how acute and chronic stress affect performance. It has done so by identifying some of the factors that contribute to operator error and by suggesting how operators might be trained to respond more effectively in a variety of circumstances. The major purpose of this paper was to review the literature of cognitive psychology as it relates to these questions and issues. Based on the existence of earlier reviews (e.g., Hamilton, & Warburton, 1979; Hockey, 1983) the following investigation was limited to the last 15 years (1988-2002) and restricted to a review of the primary peer-reviewed literature. The results of this examination revealed that while cognitive psychology has contributed in a substantive way to our understanding of stress impact on various cognitive processes, it has also left many questions unanswered. Concerns about how we define and use the term stress and the gaps that remain in our knowledge about the specific effects of stressors on cognitive processes are discussed in the text.
Article
Many interesting problems in reinforcement learning (RL) are continuous and/or high dimensional, and in this instance, RL techniques require the use of function approximators for learning value functions and policies. Often, local linear models have been preferred over distributed nonlinear models for function approximation in RL. We suggest that one reason for the difficulties encountered when using distributed architectures in RL is the problem of negative interference, whereby learning of new data disrupts previously learned mappings. The continuous temporal difference (TD) learning algorithm TD(lambda) was used to learn a value function in a limited-torque pendulum swing-up task using a multilayer perceptron (MLP) network. Three different approaches were examined for learning in the MLP networks; 1) simple gradient descent; 2) vario-eta; and 3) a pseudopattern rehearsal strategy that attempts to reduce the effects of interference. Our results show that MLP networks can be used for value function approximation in this task but require long training times. We also found that vario-eta destabilized learning and resulted in a failure of the learning process to converge. Finally, we showed that the pseudopattern rehearsal strategy drastically improved the speed of learning. The results indicate that interference is a greater problem than ill conditioning for this task.
Article
We suggest that any brain-like (artificial neural network based) learning system will need a sleep-like mechanism for consolidating newly learned information if it wishes to cope with the sequential/ongoing learning of significantly new information. We summarise and explore two possible candidates for a computational account of this consolidation process in Hopfield type networks. The "pseudorehearsal" method is based on the relearning of randomly selected attractors in the network as the new information is added from some second system. This process is supposed to reinforce old information within the network and protect it from the disruption caused by learning new inputs. The "unlearning" method is based on the unlearning of randomly selected attractors in the network after new information has already been learned. This process is supposed to locate and remove the unwanted associations between information that obscure the learned inputs. We suggest that as a computational model of sleep consolidation, the pseudorehearsal approach is better supported by the psychological, evolutionary, and neurophysiological data (in particular accounting for the role of the hippocampus in consolidation).
Article
Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911) The idea of learning to make appropriate responses based on reinforcing events has its roots in early psychological theories such as Thorndike's "law of effect" (quoted above). Although several important contributions were made in the 1950s, 1960s and 1970s by illustrious luminaries such as Bellman, Minsky, Klopf and others (Farley and Clark, 1954; Bellman, 1957; Minsky, 1961; Samuel, 1963; Michie and Chambers, 1968; Grossberg, 1975; Klopf, 1982), the last two decades have wit- nessed perhaps the strongest advances in the mathematical foundations of reinforcement learning, in addition to several impressive demonstrations of the performance of reinforcement learning algo- rithms in real world tasks. The introductory book by Sutton and Barto, two of the most influential and recognized leaders in the field, is therefore both timely and welcome. The book is divided into three parts. In the first part, the authors introduce and elaborate on the es- sential characteristics of the reinforcement learning problem, namely, the problem of learning "poli- cies" or mappings from environmental states to actions so as to maximize the amount of "reward"
Conference Paper
The author describes how genetic algorithms (GAs) were used to create recurrent neural networks to control a series of unstable systems. The systems considered are variations of the pole balancing problem: network controllers with two, one, and zero inputs, variable length pole, multiple poles on one cart, and a jointed pole. GAs were able to quickly evolve networks for the one- and two-input pole balancing problems. Networks with zero inputs were only able to valance poles for a few seconds of simulated time due to the network's inability to maintain accurate estimates of their position and pole angle. Also, work in progress on a two-legged walker is briefly described
A cognitive architecture for the implementation of emotions in computing systems
  • J Vallverdú
  • M Talanov
  • S Distefano
  • M Mazzara
  • A Tchitchigin
  • I Nurgaliev
J. Vallverdú, M. Talanov, S. Distefano, M. Mazzara, A. Tchitchigin, and I. Nurgaliev, "A cognitive architecture for the implementation of emotions in computing systems," Biologically Inspired Cognitive Architectures, vol. 15, no. Supplement C, pp. 34 -40, 2016.
Towards a formalism-based toolkit for automotive applications
  • R Gmehlich
  • K Grau
  • A Iliasov
  • M Jackson
  • F Loesch
  • M Mazzara
R. Gmehlich, K. Grau, A. Iliasov, M. Jackson, F. Loesch, and M. Mazzara, "Towards a formalism-based toolkit for automotive applications," 1st FME Workshop on Formal Methods in Software Engineering (For-maliSE), 2013.