Preprint

Prefrontal Cortex as a Meta-Reinforcement Learning System

Authors:
To read the file of this research, you can request a copy directly from the authors.

Abstract

Over the past twenty years, neuroscience research on reward-based learning has converged on a canonical model, under which the neurotransmitter dopamine ‘stamps in’ associations between situations, actions and rewards by modulating the strength of synaptic connections between neurons. However, a growing number of recent findings have placed this standard model under strain. In the present work, we draw on recent advances in artificial intelligence to introduce a new theory of reward-based learning. Here, the dopamine system trains another part of the brain, the prefrontal cortex, to operate as its own free-standing learning system. This new perspective accommodates the findings that motivated the standard model, but also deals gracefully with a wider range of observations, providing a fresh foundation for future research.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

... DRL framework is constantly being improved by the neuromorphic approaches 8 from which the prefrontal cortex (PFC) is inspired using the Long-Short Term Memory (LSTM) 15 . Although the critic network in meta-Reinforcement Learning (RL) is a LSTM, here, the critic network is a bidirectional array-based LSTM 16 which plays a discriminator role in a FGAN 17 . ...
Preprint
Full-text available
One of the key challenges for classifying multiple cancer types is the complexity of Tumor Protein p53 mutation patterns and its individual effects on tumors. However, far too little attention has been paid to Deep reinforcement Learning on TP53 mutation patterns because of its extremely difficult result interpretations. We introduce a critic network by a long-short term memory, which is appropriated for discriminating the noise samples from a Feedback Generative Adversarial Network and analyzing the actor network. The correlation and analysis of the results in a belief network demonstrates significant relations between mutations and disease risk in cancer subtypes identification. In other words, the results indicate statically significant differences between the primary and secondary subtype groups of the most probable tumor.
... Our proposal for basal ganglia control of information routing in prefrontal cortical networks may also have implications for models of "learning to learn" in the joint prefrontal cortex basal ganglia system, e.g., for models in which plasticity mediated reinforcement learning in the basal ganglia gives rise to an activity-mediated learning system in the prefrontal cortex (Wang et al., , 2018). In our model, the PFC would not only serve as an input to basal ganglia action selection systems, but would also be subject to latch and relay control by the basal ganglia outputs; this might introduce additional possibilities for how an activity-mediated learning system in the PFC could be trained. ...
Preprint
The thalamus appears to be involved in the flexible routing of information among cortical areas, yet the computational implications of such routing are only beginning to be explored. Here we create a connectionist model of how selectively gated cortico-thalamo-cortical relays could underpin both symbolic and sub-symbolic computations. We first show how gateable relays can be used to create a Dynamically Partitionable Auto-Associative Network (DPAAN) (Hayworth, 2012) consisting of a set of cross-connected cortical memory buffers. All buffers and relays in a DPAAN are trained simultaneously to have a common set of stable attractor states that become the symbol vocabulary of the DPAAN. We show via simulations that such a DPAAN can support operations necessary for syntactic rule-based computation, namely buffer-to-buffer copying and equality detection. We then provide each DPAAN module with a multilayer input network trained to map sensory inputs to the DPAAN’s symbol vocabulary, and demonstrate how gateable thalamic relays can provide recall and clamping operations to train this input network by Contrastive Hebbian Learning (CHL) (Xie and Seung, 2003). We suggest that many such DPAAN modules may exist at the highest levels of the brain’s sensory hierarchies and show how a joint snapshot of the contents of multiple DPAAN modules can be stored as a declarative memory in a simple model of the hippocampus. We speculate that such an architecture might first have been ‘discovered’ by evolution as a means to bootstrap learning of more meaningful cortical representations feeding the striatum, eventually leading to a system that could support symbolic computation. Our model serves as a bridging hypothesis for linking controllable thalamo-cortical information routing with computations that could underlie aspects of both learning and symbolic reasoning in the brain.
... GPU acceleration of deep learning primitives has been a major proponent of this success [10], as their massively parallel operation enables rapid processing of layers of independent nodes. Since the biological plausibility of deep neural networks is often disputed [38], interest in integrating the algorithms of deep learning with long-studied ideas in neuroscience has been mounting [29], both as a means to increase machine learning performance and to better model learning and decision-making in biological brains [44]. ...
Preprint
Full-text available
The development of spiking neural network simulation software is a critical component enabling the modeling of neural systems and the development of biologically inspired algorithms. Existing software frameworks support a wide range of neural functionality, software abstraction levels, and hardware devices, yet are typically not suitable for rapid prototyping or application to problems in the domain of machine learning. In this paper, we describe a new Python package for the simulation of spiking neural networks, specifically geared towards machine learning and reinforcement learning. Our software, called BindsNET, enables rapid building and simulation of spiking networks and features user-friendly, concise syntax. BindsNET is built on top of the PyTorch deep neural networks library, enabling fast CPU and GPU computation for large spiking networks. The BindsNET framework can be adjusted to meet the needs of other existing computing and hardware environments, e.g., TensorFlow. We also provide an interface into the OpenAI gym library, allowing for training and evaluation of spiking networks on reinforcement learning problems. We argue that this package facilitates the use of spiking networks for large-scale machine learning experimentation, and show some simple examples of how we envision BindsNET can be used in practice. BindsNET code is available at https://github.com/Hananel-Hazan/bindsnet
... However, recent advances in recurrent neural network (RNN) model algorithms have opened an entirely new avenue to study the putative neural mechanisms underlying various cognitive functions. Crucially, RNN models have successfully reproduced the patterns of neural activity and behavioral output that are observed in vivo, and have generated novel insights into neural circuit function that would otherwise be unattainable through direct experimental measurement [23][24][25][26][27][28][29] . ...
Preprint
Full-text available
Recently it has been proposed that information in short-term memory may not always be stored in persistent neuronal activity, but can be maintained in "activity-silent" hidden states such as synaptic efficacies endowed with short-term plasticity (STP). However, working memory involves manipulation as well as maintenance of information in the absence of external stimuli. In this work, we investigated working memory representation using recurrent neural network (RNN) models trained to perform several working memory dependent tasks. We found that STP can support the short-term maintenance of information provided that the memory delay period is sufficiently short. However, in tasks that require actively manipulating information, persistent neuronal activity naturally emerges from learning, and the amount of persistent neuronal activity scales with the degree of manipulation required. These results shed insight into the current debate on working memory encoding, and suggest that persistent neural activity can vary markedly between tasks used in different experiments.
Article
Full-text available
Repetitive industrial tasks can be easily performed by traditional robotic systems. However, many other works require cognitive knowledge that only humans can provide. Human-Robot Collaboration (HRC) emerges as an ideal concept of co-working between a human operator and a robot, representing one of the most significant subjects for human-life improvement.The ultimate goal is to achieve physical interaction, where handing over an object plays a crucial role for an effective task accomplishment. Considerable research work had been developed in this particular field in recent years, where several solutions were already proposed. Nonetheless, some particular issues regarding Human-Robot Collaboration still hold an open path to truly important research improvements. This paper provides a literature overview, defining the HRC concept, enumerating the distinct human-robot communication channels, and discussing the physical interaction that this collaboration entails. Moreover, future challenges for a natural and intuitive collaboration are exposed: the machine must behave like a human especially in the pre-grasping/grasping phases and the handover procedure should be fluent and bidirectional, for an articulated function development. These are the focus of the near future investigation aiming to shed light on the complex combination of predictive and reactive control mechanisms promoting coordination and understanding. Following recent progress in artificial intelligence, learning exploration stand as the key element to allow the generation of coordinated actions and their shaping by experience.
Preprint
Full-text available
In this work, we analyze the reinstatement mechanism introduced by Ritter et al. (2018) to reveal two classes of neurons that emerge in the agent's working memory (an epLSTM cell) when trained using episodic meta-RL on an episodic variant of the Harlow visual fixation task. Specifically, Abstract neurons encode knowledge shared across tasks, while Episodic neurons carry information relevant for a specific episode's task.
Article
Full-text available
Importance The tools and insights of behavioral neuroscience grow apace, yet their clinical application is lagging. Observations This article suggests that associative learning theory may be the algorithmic bridge to connect a burgeoning understanding of the brain with the challenges to the mind with which all clinicians and researchers are concerned. Conclusions and Relevance Instead of giving up, talking past one another, or resting on the laurels of face validity, a consilient and collaborative approach is suggested: visiting laboratory meetings and clinical rounds and attempting to converse in the language of behavior and cognition to better understand and ultimately treat patients.
Conference Paper
Full-text available
On our path towards artificial general intelligence, video games have become excellent tools for research. Reinforcement learning (RL) algorithms are particularly successful in this domain, with the added benefit of having fairly well established biological foundations. To improve how artificial intelligence research and the cognitive sciences can inform each other, we argue the StarCraft II Learning Environment is an ideal candidate for an environment where humans and artificial agents can be tested on the same tasks. We present an upcoming study using this environment, where the goal is to investigate how RL can be extended to enable abstract human abilities such as moments of insight. We claim this is valuable for advancing our understanding of both artificial and natural intelligence, thereby leading to improved models of player behaviour and for general video game playing.
Article
Full-text available
Planning can be defined as action selection that leverages an internal model of the outcomes likely to follow each possible action. Its neural mechanisms remain poorly understood. Here we adapt recent advances from human research for rats, presenting for the first time an animal task that produces many trials of planned behavior per session, making multitrial rodent experimental tools available to study planning. We use part of this toolkit to address a perennially controversial issue in planning: the role of the dorsal hippocampus. Although prospective hippocampal representations have been proposed to support planning, intact planning in animals with damaged hippocampi has been repeatedly observed. Combining formal algorithmic behavioral analysis with muscimol inactivation, we provide causal evidence directly linking dorsal hippocampus with planning behavior. Our results and methods open the door to new and more detailed investigations of the neural mechanisms of planning in the hippocampus and throughout the brain.
Article
Full-text available
Neuronal reward valuations provide the physiological basis for economic behavior. Yet, how such valuations relate to economic decisions remains unclear. Here we show that the dorsolateral prefrontal cortex (DLPFC) implements a flexible value code based on object-specific valuations by single neurons. As monkeys perform a reward-based foraging task, individual DLPFC neurons signal the value of specific choice objects derived from recent experience. These neuronal object values satisfy principles of competitive choice mechanisms, track performance fluctuations, and follow predictions of a classical behavioral model (Herrnstein’s matching law). Individual neurons dynamically encode both, the updating of object values from recently experienced rewards, and their subsequent conversion to object choices during decision-making. Decoding from unselected populations enables a read-out of motivational and decision variables not emphasized by individual neurons. These findings suggest a dynamic single-neuron and population value code in DLPFC that advances from reward experiences to economic object values and future choices.
Article
Full-text available
Primates display a remarkable ability to adapt to novel situations. Determining what is most pertinent in these situations is not always possible based only on the current sensory inputs, and often also depends on recent inputs and behavioral outputs that contribute to internal states. Thus, one can ask how cortical dynamics generate representations of these complex situations. It has been observed that mixed selectivity in cortical neurons contributes to represent diverse situations defined by a combination of the current stimuli, and that mixed selectivity is readily obtained in randomly connected recurrent networks. In this context, these reservoir networks reproduce the highly recurrent nature of local cortical connectivity. Recombining present and past inputs, random recurrent networks from the reservoir computing framework generate mixed selectivity which provides pre-coded representations of an essentially universal set of contexts. These representations can then be selectively amplified through learning to solve the task at hand. We thus explored their representational power and dynamical properties after training a reservoir to perform a complex cognitive task initially developed for monkeys. The reservoir model inherently displayed a dynamic form of mixed selectivity, key to the representation of the behavioral context over time. The pre-coded representation of context was amplified by training a feedback neuron to explicitly represent this context, thereby reproducing the effect of learning and allowing the model to perform more robustly. This second version of the model demonstrates how a hybrid dynamical regime combining spatio-temporal processing of reservoirs, and input driven attracting dynamics generated by the feedback neuron, can be used to solve a complex cognitive task. We compared reservoir activity to neural activity of dorsal anterior cingulate cortex of monkeys which revealed similar network dynamics. We argue that reservoir computing is a pertinent framework to model local cortical dynamics and their contribution to higher cognitive function.
Article
Full-text available
Dopaminergic (DA) neurons in the midbrain provide rich topographic innervation of the striatum and are central to learning and to generating actions. Despite the importance of this DA innervation, it remains unclear whether and how DA neurons are specialized on the basis of the location of their striatal target. Thus, we sought to compare the function of subpopulations of DA neurons that target distinct striatal subregions in the context of an instrumental reversal learning task. We identified key differences in the encoding of reward and choice in dopamine terminals in dorsal versus ventral striatum: DA terminals in ventral striatum responded more strongly to reward consumption and reward-predicting cues, whereas DA terminals in dorsomedial striatum responded more strongly to contralateral choices. In both cases the terminals encoded a reward prediction error. Our results suggest that the DA modulation of the striatum is spatially organized to support the specialized function of the targeted subregion.
Article
Full-text available
The recently developed 'two-step' behavioural task promises to differentiate model-based from model-free reinforcement learning, while generating neurophysiologically-friendly decision datasets with parametric variation of decision variables. These desirable features have prompted its widespread adoption. Here, we analyse the interactions between a range of different strategies and the structure of transitions and outcomes in order to examine constraints on what can be learned from behavioural performance. The task involves a trade-off between the need for stochasticity, to allow strategies to be discriminated, and a need for determinism, so that it is worth subjects' investment of effort to exploit the contingencies optimally. We show through simulation that under certain conditions model-free strategies can masquerade as being model-based. We first show that seemingly innocuous modifications to the task structure can induce correlations between action values at the start of the trial and the subsequent trial events in such a way that analysis based on comparing successive trials can lead to erroneous conclusions. We confirm the power of a suggested correction to the analysis that can alleviate this problem. We then consider model-free reinforcement learning strategies that exploit correlations between where rewards are obtained and which actions have high expected value. These generate behaviour that appears model-based under these, and also more sophisticated, analyses. Exploiting the full potential of the two-step task as a tool for behavioural neuroscience requires an understanding of these issues.
Article
Full-text available
Correlative studies have strongly linked phasic changes in dopamine activity with reward prediction error signaling. But causal evidence that these brief changes in firing actually serve as error signals to drive associative learning is more tenuous. Although there is direct evidence that brief increases can substitute for positive prediction errors, there is no comparable evidence that similarly brief pauses can substitute for negative prediction errors. In the absence of such evidence, the effect of increases in firing could reflect novelty or salience, variables also correlated with dopamine activity. Here we provide evidence in support of the proposed linkage, showing in a modified Pavlovian over-expectation task that brief pauses in the firing of dopamine neurons in rat ventral tegmental area at the time of reward are sufficient to mimic the effects of endogenous negative prediction errors. These results support the proposal that brief changes in the firing of dopamine neurons serve as full-fledged bidirectional prediction error signals.
Article
Full-text available
The prefrontal cortex (PFC) subserves reasoning in the service of adaptive behavior. Little is known, however, about the architecture of reasoning processes in the PFC. Using computational modeling and neuroimaging, we show here that the human PFC has two concurrent inferential tracks: (i) one from ventromedial to dorsomedial PFC regions that makes probabilistic inferences about the reliability of the ongoing behavioral strategy and arbitrates between adjusting this strategy versus exploring new ones from long-term memory, and (ii) another from polar to lateral PFC regions that makes probabilistic inferences about the reliability of two or three alternative strategies and arbitrates between exploring new strategies versus exploiting these alternative ones. The two tracks interact and, along with the striatum, realize hypothesis testing for accepting versus rejecting newly created strategies.
Article
Full-text available
Prefrontal cortex is thought to have a fundamental role in flexible, context-dependent behaviour, but the exact nature of the computations underlying this role remains largely unknown. In particular, individual prefrontal neurons often generate remarkably complex responses that defy deep understanding of their contribution to behaviour. Here we study prefrontal cortex activity in macaque monkeys trained to flexibly select and integrate noisy sensory inputs towards a choice. We find that the observed complexity and functional roles of single neurons are readily understood in the framework of a dynamical process unfolding at the level of the population. The population dynamics can be reproduced by a trained recurrent neural network, which suggests a previously unknown mechanism for selection and integration of task-relevant inputs. This mechanism indicates that selection and integration are two aspects of a single dynamical process unfolding within the same prefrontal circuits, and potentially provides a novel, general framework for understanding context-dependent computations.
Article
Full-text available
An enduring and richly elaborated dichotomy in cognitive neuroscience is that of reflective versus reflexive decision making and choice. Other literatures refer to the two ends of what is likely to be a spectrum with terms such as goal-directed versus habitual, model-based versus model-free or prospective versus retrospective. One of the most rigorous traditions of experimental work in the field started with studies in rodents and graduated via human versions and enrichments of those experiments to a current state in which new paradigms are probing and challenging the very heart of the distinction. We review four generations of work in this tradition and provide pointers to the forefront of the field's fifth generation.
Article
Full-text available
Situations in which rewards are unexpectedly obtained or withheld represent opportunities for new learning. Often, this learning includes identifying cues that predict reward availability. Unexpected rewards strongly activate midbrain dopamine neurons. This phasic signal is proposed to support learning about antecedent cues by signaling discrepancies between actual and expected outcomes, termed a reward prediction error. However, it is unknown whether dopamine neuron prediction error signaling and cue-reward learning are causally linked. To test this hypothesis, we manipulated dopamine neuron activity in rats in two behavioral procedures, associative blocking and extinction, that illustrate the essential function of prediction errors in learning. We observed that optogenetic activation of dopamine neurons concurrent with reward delivery, mimicking a prediction error, was sufficient to cause long-lasting increases in cue-elicited reward-seeking behavior. Our findings establish a causal role for temporally precise dopamine neuron signaling in cue-reward learning, bridging a critical gap between experimental evidence and influential theoretical frameworks.
Article
Full-text available
Converging evidence suggest that the medial prefrontal cortex (MPFC) is involved in feedback categorization, performance monitoring, and task monitoring, and may contribute to the online regulation of reinforcement learning (RL) parameters that would affect decision-making processes in the lateral prefrontal cortex (LPFC). Previous neurophysiological experiments have shown MPFC activities encoding error likelihood, uncertainty, reward volatility, as well as neural responses categorizing different types of feedback, for instance, distinguishing between choice errors and execution errors. Rushworth and colleagues have proposed that the involvement of MPFC in tracking the volatility of the task could contribute to the regulation of one of RL parameters called the learning rate. We extend this hypothesis by proposing that MPFC could contribute to the regulation of other RL parameters such as the exploration rate and default action values in case of task shifts. Here, we analyze the sensitivity to RL parameters of behavioral performance in two monkey decision-making tasks, one with a deterministic reward schedule and the other with a stochastic one. We show that there exist optimal parameter values specific to each of these tasks, that need to be found for optimal performance and that are usually hand-tuned in computational models. In contrast, automatic online regulation of these parameters using some heuristics can help producing a good, although non-optimal, behavioral performance in each task. We finally describe our computational model of MPFC-LPFC interaction used for online regulation of the exploration rate and its application to a human-robot interaction scenario. There, unexpected uncertainties are produced by the human introducing cued task changes or by cheating. The model enables the robot to autonomously learn to reset exploration in response to such uncertain cues and events. The combined results provide concrete evidence specifying how prefrontal cortical subregions may cooperate to regulate RL parameters. It also shows how such neurophysiologically inspired mechanisms can control advanced robots in the real world. Finally, the model's learning mechanisms that were challenged in the last robotic scenario provide testable predictions on the way monkeys may learn the structure of the task during the pretraining phase of the previous laboratory experiments.
Article
Full-text available
The cortico-basal ganglia network has been proposed to consist of parallel loops serving distinct functions. However, it is still uncertain how the content of processed information varies across different loops and how it is related to the functions of each loop. We investigated this issue by comparing neuronal activity in the dorsolateral (sensorimotor) and dorsomedial (associative) striatum, which have been linked to habitual and goal-directed action selection, respectively, in rats performing a dynamic foraging task. Both regions conveyed significant neural signals for the animal's goal choice and its outcome. Moreover, both regions conveyed similar levels of neural signals for action value before the animal's goal choice and chosen value after the outcome of the animal's choice was revealed. However, a striking difference was found in the persistence of neural signals for the animal's chosen action. Signals for the animal's goal choice persisted in the dorsomedial striatum until the outcome of the animal's next goal choice was revealed, whereas they dissipated rapidly in the dorsolateral striatum. These persistent choice signals might be used for causally linking temporally discontiguous responses and their outcomes in the dorsomedial striatum, thereby contributing to its role in goal-directed action selection.
Article
Full-text available
Activation of dopamine receptors in forebrain regions, for minutes or longer, is known to be sufficient for positive reinforcement of stimuli and actions. However, the firing rate of dopamine neurons is increased for only about 200 milliseconds following natural reward events that are better than expected, a response which has been described as a "reward prediction error" (RPE). Although RPE drives reinforcement learning (RL) in computational models, it has not been possible to directly test whether the transient dopamine signal actually drives RL. Here we have performed optical stimulation of genetically targeted ventral tegmental area (VTA) dopamine neurons expressing Channelrhodopsin-2 (ChR2) in mice. We mimicked the transient activation of dopamine neurons that occurs in response to natural reward by applying a light pulse of 200 ms in VTA. When a single light pulse followed each self-initiated nose poke, it was sufficient in itself to cause operant reinforcement. Furthermore, when optical stimulation was delivered in separate sessions according to a predetermined pattern, it increased locomotion and contralateral rotations, behaviors that are known to result from activation of dopamine neurons. All three of the optically induced operant and locomotor behaviors were tightly correlated with the number of VTA dopamine neurons that expressed ChR2, providing additional evidence that the behavioral responses were caused by activation of dopamine neurons. These results provide strong evidence that the transient activation of dopamine neurons provides a functional reward signal that drives learning, in support of RL theories of dopamine function.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Article
Full-text available
The orbitofrontal cortex has been hypothesized to carry information regarding the value of expected rewards. Such information is essential for associative learning, which relies on comparisons between expected and obtained reward for generating instructive error signals. These error signals are thought to be conveyed by dopamine neurons. To test whether orbitofrontal cortex contributes to these error signals, we recorded from dopamine neurons in orbitofrontal-lesioned rats performing a reward learning task. Lesions caused marked changes in dopaminergic error signaling. However, the effect of lesions was not consistent with a simple loss of information regarding expected value. Instead, without orbitofrontal input, dopaminergic error signals failed to reflect internal information about the impending response that distinguished externally similar states leading to differently valued future rewards. These results are consistent with current conceptualizations of orbitofrontal cortex as supporting model-based behavior and suggest an unexpected role for this information in dopaminergic error signaling.
Article
Full-text available
The mesostriatal dopamine system is prominently implicated in model-free reinforcement learning, with fMRI BOLD signals in ventral striatum notably covarying with model-free prediction errors. However, latent learning and devaluation studies show that behavior also shows hallmarks of model-based planning, and the interaction between model-based and model-free values, prediction errors, and preferences is underexplored. We designed a multistep decision task in which model-based and model-free influences on human choice behavior could be distinguished. By showing that choices reflected both influences we could then test the purity of the ventral striatal BOLD signal as a model-free report. Contrary to expectations, the signal reflected both model-free and model-based predictions in proportions matching those that best explained choice behavior. These results challenge the notion of a separate model-free learner and suggest a more integrated computational architecture for high-level human decision-making.
Article
Full-text available
Author Summary Every decision-making experiment has a structure that specifies how rewards are obtained, which is usually explained to the subject at the beginning of the experiment. Participants frequently fail to act as if they understand the experimental structure, even in tasks as simple as determining which of two biased coins they should choose to maximize the number of trials that produce “heads”. We hypothesize that participants' behavior is not driven by top-down instructions—rather, participants must learn through experience how the rewards are generated. We formalize this hypothesis using a fully rational optimal Bayesian reinforcement learning approach that models optimal structure learning in sequential decision making. In an experimental test of structure learning in humans, we show that humans learn reward structure from experience in a near optimal manner. Our results demonstrate that behavior purported to show that humans are error-prone and suboptimal decision makers can result from an optimal learning approach. Our findings provide a compelling new family of rational hypotheses for behavior previously deemed irrational, including under- and over-exploration.
Article
Full-text available
Maintaining appropriate beliefs about variables needed for effective decision making can be difficult in a dynamic environment. One key issue is the amount of influence that unexpected outcomes should have on existing beliefs. In general, outcomes that are unexpected because of a fundamental change in the environment should carry more influence than outcomes that are unexpected because of persistent environmental stochasticity. Here we use a novel task to characterize how well human subjects follow these principles under a range of conditions. We show that the influence of an outcome depends on both the error made in predicting that outcome and the number of similar outcomes experienced previously. We also show that the exact nature of these tendencies varies considerably across subjects. Finally, we show that these patterns of behavior are consistent with a computationally simple reduction of an ideal-observer model. The model adjusts the influence of newly experienced outcomes according to ongoing estimates of uncertainty and the probability of a fundamental change in the process by which outcomes are generated. A prior that quantifies the expected frequency of such environmental changes accounts for individual variability, including a positive relationship between subjective certainty and the degree to which new information influences existing beliefs. The results suggest that the brain adaptively regulates the influence of decision outcomes on existing beliefs using straightforward updating rules that take into account both recent outcomes and prior expectations about higher-order environmental structure.
Article
Full-text available
The reward value of a stimulus can be learned through two distinct mechanisms: reinforcement learning through repeated stimulus-reward pairings and abstract inference based on knowledge of the task at hand. The reinforcement mechanism is often identified with midbrain dopamine neurons. Here we show that a neural pathway controlling the dopamine system does not rely exclusively on either stimulus-reward pairings or abstract inference but instead uses a combination of the two. We trained monkeys to perform a reward-biased saccade task in which the reward values of two saccade targets were related in a systematic manner. Animals used each trial's reward outcome to learn the values of both targets: the target that had been presented and whose reward outcome had been experienced (experienced value) and the target that had not been presented but whose value could be inferred from the reward statistics of the task (inferred value). We then recorded from three populations of reward-coding neurons: substantia nigra dopamine neurons; a major input to dopamine neurons, the lateral habenula; and neurons that project to the lateral habenula, located in the globus pallidus. All three populations encoded both experienced values and inferred values. In some animals, neurons encoded experienced values more strongly than inferred values, and the animals showed behavioral evidence of learning faster from experience than from inference. Our data indicate that the pallidus-habenula-dopamine pathway signals reward values estimated through both experience and inference.
Article
Full-text available
Game theory analyses optimal strategies for multiple decision makers interacting in a social group. However, the behaviours of individual humans and animals often deviate systematically from the optimal strategies described by game theory. The behaviours of rhesus monkeys (Macaca mulatta) in simple zero-sum games showed similar patterns, but their departures from the optimal strategies were well accounted for by a simple reinforcement-learning algorithm. During a computer-simulated zero-sum game, neurons in the dorsolateral prefrontal cortex often encoded the previous choices of the animal and its opponent as well as the animal's reward history. By contrast, the neurons in the anterior cingulate cortex predominantly encoded the animal's reward history. Using simple competitive games, therefore, we have demonstrated functional specialization between different areas of the primate frontal cortex involved in outcome monitoring and action selection. Temporally extended signals related to the animal's previous choices might facilitate the association between choices and their delayed outcomes, whereas information about the choices of the opponent might be used to estimate the reward expected from a particular action. Finally, signals related to the reward history might be used to monitor the overall success of the animal's current decision-making strategy.
Article
Full-text available
We develop a theoretical framework that shows how mesencephalic dopamine systems could distribute to their targets a signal that represents information about future expectations. In particular, we show how activity in the cerebral cortex can make predictions about future receipt of reward and how fluctuations in the activity levels of neurons in diffuse dopamine systems above and below baseline levels would represent errors in these predictions that are delivered to cortical and subcortical targets. We present a model for how such errors could be constructed in a real brain that is consistent with physiological results for a subset of dopaminergic neurons located in the ventral tegmental area and surrounding dopaminergic neurons. The theory also makes testable predictions about human choice behavior on a simple decision-making task. Furthermore, we show that, through a simple influence on synaptic plasticity, fluctuations in dopamine release can act to change the predictions in an appropriate manner.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Article
Full-text available
Most natural actions are chosen voluntarily from many possible choices. An action is often chosen based on the reward that it is expected to produce. What kind of cellular activity in which area of the cerebral cortex is involved in selecting an action according to the expected reward value? Results of an analysis in monkeys of cellular activity during the performance of reward-based motor selection and the effects of chemical inactivation are presented. We suggest that cells in the rostral cingulate motor area, one of the higher order motor areas in the cortex, play a part in processing the reward information for motor selection.
Article
Full-text available
To what extent do we learn from the positive versus negative outcomes of our decisions? The neuromodulator dopamine plays a key role in these reinforcement learning processes. Patients with Parkinson's disease, who have depleted dopamine in the basal ganglia, are impaired in tasks that require learning from trial and error. Here, we show, using two cognitive procedural learning tasks, that Parkinson's patients off medication are better at learning to avoid choices that lead to negative outcomes than they are at learning from positive outcomes. Dopamine medication reverses this bias, making patients more sensitive to positive than negative outcomes. This pattern was predicted by our biologically based computational model of basal ganglia–dopamine interactions in cognition, which has separate pathways for “Go” and “NoGo” responses that are differentially modulated by positive and negative reinforcement.
Article
Full-text available
Human cognitive control is uniquely flexible and has been shown to depend on prefrontal cortex (PFC). But exactly how the biological mechanisms of the PFC support flexible cognitive control remains a profound mystery. Existing theoretical models have posited powerful task-specific PFC representations, but not how these develop. We show how this can occur when a set of PFC-specific neural mechanisms interact with breadth of experience to self organize abstract rule-like PFC representations that support flexible generalization in novel tasks. The same model is shown to apply to benchmark PFC tasks (Stroop and Wisconsin card sorting), accurately simulating the behavior of neurologically intact and frontally damaged people.
Article
Midbrain dopamine neurons signal reward prediction error (RPE), or actual minus expected reward. The temporal difference (TD) learning model has been a cornerstone in understanding how dopamine RPEs could drive associative learning. Classically, TD learning imparts value to features that serially track elapsed time relative to observable stimuli. In the real world, however, sensory stimuli provide ambiguous information about the hidden state of the environment, leading to the proposal that TD learning might instead compute a value signal based on an inferred distribution of hidden states (a 'belief state'). Here we asked whether dopaminergic signaling supports a TD learning framework that operates over hidden states. We found that dopamine signaling showed a notable difference between two tasks that differed only with respect to whether reward was delivered in a deterministic manner. Our results favor an associative learning rule that combines cached values with hidden-state inference.
Article
Artificial neural networks are remarkably adept at sensory processing, sequence learning and reinforcement learning, but are limited in their ability to represent variables and data structures and to store data over long timescales, owing to the lack of an external memory. Here we introduce a machine learning model called a differentiable neural computer (DNC), which consists of a neural network that can read from and write to an external memory matrix, analogous to the random-access memory in a conventional computer. Like a conventional computer, it can use its memory to represent and manipulate complex data structures, but, like a neural network, it can learn to do so from data. When trained with supervised learning, we demonstrate that a DNC can successfully answer synthetic questions designed to emulate reasoning and inference problems in natural language. We show that it can learn tasks such as finding the shortest path between specified points and inferring the missing links in randomly generated graphs, and then generalize these tasks to specific graphs such as transport networks and family trees. When trained with reinforcement learning, a DNC can complete a moving blocks puzzle in which changing goals are specified by sequences of symbols. Taken together, our results demonstrate that DNCs have the capacity to solve complex, structured tasks that are inaccessible to neural networks without external read-write memory.
Article
[download pdf for free at http://authors.elsevier.com/a/1Tlf63BtfGhopS] Although the orbitofrontal cortex (OFC) has been studied intensely for decades, its precise functions have remained elusive. We recently hypothesized that the OFC contains a “cognitive map” of task space in which the current state of the task is represented, and this representation is especially critical for behavior when states are unobservable from sensory input. To test this idea, we apply pattern-classification techniques to neuroimaging data from humans performing a decision-making task with 16 states. We show that unobservable task states can be decoded from activity in OFC, and decoding accuracy is related to task performance and the occurrence of individual behavioral errors. Moreover, similarity between the neural representations of consecutive states correlates with behavioral accuracy in corresponding state transitions. These results support the idea that OFC represents a cognitive map of task space and establish the feasibility of decoding state representations in humans using non-invasive neuroimaging.
Article
Wereview the psychology and neuroscience of reinforcement learning (RL), which has experienced significant progress in the past two decades, enabled by the comprehensive experimental study of simple learning and decisionmaking tasks. However, one challenge in the study of RL is computational: The simplicity of these tasks ignores important aspects of reinforcement learning in the real world: (a) State spaces are high-dimensional, continuous, and partially observable; this implies that (b) data are relatively sparse and, indeed, precisely the same situation may never be encountered twice; furthermore, (c) rewards depend on the long-term consequences of actions in ways that violate the classical assumptions that make RL tractable. A seemingly distinct challenge is that, cognitively, theories of RL have largely involved procedural and semantic memory, the way in which knowledge about action values or world models extracted gradually from many experiences can drive choice. This focus on semantic memory leaves out many aspects of memory, such as episodic memory, related to the traces of individual events. We suggest that these two challenges are related. The computational challenge can be dealt with, in part, by endowing RL systems with episodic memory, allowing them to (a) efficiently approximate value functions over complex state spaces, (b) learn with very little data, and (c) bridge long-term dependencies between actions and rewards. We review the computational theory underlying this proposal and the empirical evidence to support it. Our proposal suggests that the ubiquitous and diverse roles of memory in RL may function as part of an integrated learning system. Expected final online publication date for the Annual Review of Psychology Volume 68 is January 03, 2017. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Article
Unlabelled: The orbitofrontal cortex (OFC) has been implicated in both the representation of "state," in studies of reinforcement learning and decision making, and also in the representation of "schemas," in studies of episodic memory. Both of these cognitive constructs require a similar inference about the underlying situation or "latent cause" that generates our observations at any given time. The statistically optimal solution to this inference problem is to use Bayes' rule to compute a posterior probability distribution over latent causes. To test whether such a posterior probability distribution is represented in the OFC, we tasked human participants with inferring a probability distribution over four possible latent causes, based on their observations. Using fMRI pattern similarity analyses, we found that BOLD activity in the OFC is best explained as representing the (log-transformed) posterior distribution over latent causes. Furthermore, this pattern explained OFC activity better than other task-relevant alternatives, such as the most probable latent cause, the most recent observation, or the uncertainty over latent causes. Significance statement: Our world is governed by hidden (latent) causes that we cannot observe, but which generate the observations we see. A range of high-level cognitive processes require inference of a probability distribution (or "belief distribution") over the possible latent causes that might be generating our current observations. This is true for reinforcement learning and decision making (where the latent cause comprises the true "state" of the task), and for episodic memory (where memories are believed to be organized by the inferred situation or "schema"). Using fMRI, we show that this belief distribution over latent causes is encoded in patterns of brain activity in the orbitofrontal cortex, an area that has been separately implicated in the representations of both states and schemas.
Article
Often the world is structured such that distinct sensory contexts signify the same abstract rule set. Learning from feedback thus informs us not only about the value of stimulus-action associations but also about which rule set applies. Hierarchical clustering models suggest that learners discover structure in the environment, clustering distinct sensory events into a single latent rule set. Such structure enables a learner to transfer any newly acquired information to other contexts linked to the same rule set, and facilitates re-use of learned knowledge in novel contexts. Here, we show that humans exhibit this transfer, generalization and clustering during learning. Trial-by-trial model-based analysis of EEG signals revealed that subjects' reward expectations incorporated this hierarchical structure; these structured neural signals were predictive of behavioral transfer and clustering. These results further our understanding of how humans learn and generalize flexibly by building abstract, behaviorally relevant representations of the complex, high-dimensional sensory environment.
Article
Environmental stimuli and objects, including rewards, are often processed sequentially in the brain. Recent work suggests that the phasic dopamine reward prediction-error response follows a similar sequential pattern. An initial brief, unselective and highly sensitive increase in activity unspecifically detects a wide range of environmental stimuli, then quickly evolves into the main response component, which reflects subjective reward value and utility. This temporal evolution allows the dopamine reward prediction-error signal to optimally combine speed and accuracy.
Article
In order to choose advantageously in many circumstances, the values of choice alternatives have to be learned from experience. We provide an introduction to theoretical and experimental work on reinforcement learning, that is, trial-and-error learning to obtain rewards or avoid punishments. We introduce one version, the temporal-difference learning model, and review evidence that its predictions relate to the firing properties of midbrain dopamine neurons and to activity recorded with functional neuroimaging in humans. We also present evidence that this computational and neurophysiological mechanism affects human and animal behavior in decision and conditioning tasks.
Article
The application of ideas from computational reinforcement learning has recently enabled dramatic advances in behavioral and neuroscientific research. For the most part, these advances have involved insights concerning the algorithms underlying learning and decision making. In the present article, we call attention to the equally important but relatively neglected question of how problems in learning and decision making are internally represented. To articulate the significance of representation for reinforcement learning we draw on the concept of efficient coding, as developed in perception research. The resulting perspective exposes a range of novel goals for behavioral and neuroscientific research, highlighting in particular the need for research into the statistical structure of naturalistic tasks.
Article
Effective reinforcement learning hinges on having an appropriate state representation. But where does this representation come from? We argue that the brain discovers state representations by trying to infer the latent causal structure of the task at hand, and assigning each latent cause to a separate state. In this paper, we review several implications of this latent cause framework, with a focus on Pavlovian conditioning. The framework suggests that conditioning is not the acquisition of associations between cues and outcomes, but rather the acquisition of associations between latent causes and observable stimuli. A latent cause interpretation of conditioning enables us to begin answering questions that have frustrated classical theories: Why do extinguished responses sometimes return? Why do stimuli presented in compound sometimes summate and sometimes do not? Beyond conditioning, the principles of latent causal inference may provide a general theory of structure learning across cognitive domains.
Article
The contexts for action may be only transiently visible, accessible, and relevant. The cortico-basal ganglia (BG) circuit addresses these demands by allowing the right motor plans to drive action at the right times, via a BG-mediated gate on motor representations. A long-standing hypothesis posits these same circuits are replicated in more rostral brain regions to support gating of cognitive representations. Key evidence now supports the prediction that BG can act as a gate on the input to working memory, as a gate on its output, and as a means of reallocating working memory representations rendered irrelevant by recent events. These discoveries validate key tenets of many computational models, circumscribe motor and cognitive models of recurrent cortical dynamics alone, and identify novel directions for research on the mechanisms of higher-level cognition.
Article
The midbrain dopamine (DA) neurons play a central role in developing appropriate goal-directed behaviors, including the motivation and cognition to develop appropriate actions to obtain a specific outcome. Indeed, subpopulations of DA neurons have been associated with these different functions: the mesolimbic, mesocortical, and nigrostriatal pathways. The mesolimbic and nigrostriatal pathways are an integral part of the basal ganglia through its reciprocal connections to the ventral and dorsal striatum respectively. This chapter reviews the connections of the midbrain DA cells and their role in integrating information across limbic, cognitive and motor functions. Emphasis is placed on the interface between these functional domains within the striatum through corticostriatal connections, through the striato-nigro-striatal connection, and through the lateral habenula projection to the midbrain. Copyright © 2014 IBRO. Published by Elsevier Ltd. All rights reserved.
Article
Phasic increases and decreases in dopamine (DA) transmission encode reward prediction errors thought to facilitate reward-related learning, yet how these signals guide action selection in more complex situations requiring evaluation of different reward remains unclear. We manipulated phasic DA signals while rats performed a risk/reward decision-making task, using temporally discrete stimulation of either the lateral habenula (LHb) or rostromedial tegmental nucleus (RMTg) to suppress DA bursts (confirmed with neurophysiological studies) or the ventral tegmental area (VTA) to override phasic dips. When rats chose between small/certain and larger/risky rewards, LHb or RMTg stimulation, time-locked to delivery of one of these rewards, redirected bias toward the alternative option, whereas VTA stimulation after nonrewarded choices increased risky choice. LHb stimulation prior to choices shifted bias away from more preferred options. Thus, phasic DA signals provide feedback on whether recent actions were rewarded to update decision policies and direct actions toward more desirable reward.
Conference Paper
Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems. This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. We find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy. We conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.
Article
Orbitofrontal cortex (OFC) has long been known to play an important role in decision making. However, the exact nature of that role has remained elusive. Here, we propose a unifying theory of OFC function. We hypothesize that OFC provides an abstraction of currently available information in the form of a labeling of the current task state, which is used for reinforcement learning (RL) elsewhere in the brain. This function is especially critical when task states include unobservable information, for instance, from working memory. We use this framework to explain classic findings in reversal learning, delayed alternation, extinction, and devaluation as well as more recent findings showing the effect of OFC lesions on the firing of dopaminergic neurons in ventral tegmental area (VTA) in rodents performing an RL task. In addition, we generate a number of testable experimental predictions that can distinguish our theory from other accounts of OFC function.
Article
Predicting outcomes is a critical ability of humans and animals. The dopamine reward prediction error hypothesis, the driving force behind the recent progress in neural "value-based" decision making, states that dopamine activity encodes the signals for learning in order to predict a reward, that is, the difference between the actual and predicted reward, called the reward prediction error. However, this hypothesis and its underlying assumptions limit the prediction and its error as reactively triggered by momentary environmental events. Reviewing the assumptions and some of the latest findings, we suggest that the internal state representation is learned to reflect the environmental reward structure, and we propose a new hypothesis - the dopamine reward structural learning hypothesis - in which dopamine activity encodes multiplex signals for learning in order to represent reward structure in the internal state, leading to better reward prediction.
Article
The role that frontal-striatal circuits play in normal behavior remains unclear. Two of the leading hypotheses suggest that these circuits are important for action selection or reinforcement learning. To examine these hypotheses, we carried out an experiment in which monkeys had to select actions in two different task conditions. In the first (random) condition, actions were selected on the basis of perceptual inference. In the second (fixed) condition, the animals used reinforcement from previous trials to select actions. Examination of neural activity showed that the representation of the selected action was stronger in lateral prefrontal cortex (lPFC), and occurred earlier in the lPFC than it did in the dorsal striatum (dSTR). In contrast to this, the representation of action values, in both the random and fixed conditions, was stronger in the dSTR. Thus, the dSTR contains an enriched representation of action value, but it followed frontal cortex in action selection.
Article
Instrumental learning involves corticostriatal circuitry and the dopaminergic system. This system is typically modeled in the reinforcement learning (RL) framework by incrementally accumulating reward values of states and actions. However, human learning also implicates prefrontal cortical mechanisms involved in higher level cognitive functions. The interaction of these systems remains poorly understood, and models of human behavior often ignore working memory (WM) and therefore incorrectly assign behavioral variance to the RL system. Here we designed a task that highlights the profound entanglement of these two processes, even in simple learning problems. By systematically varying the size of the learning problem and delay between stimulus repetitions, we separately extracted WM-specific effects of load and delay on learning. We propose a new computational model that accounts for the dynamic integration of RL and WM processes observed in subjects' behavior. Incorporating capacity-limited WM into the model allowed us to capture behavioral variance that could not be captured in a pure RL framework even if we (implausibly) allowed separate RL systems for each set size. The WM component also allowed for a more reasonable estimation of a single RL process. Finally, we report effects of two genetic polymorphisms having relative specificity for prefrontal and basal ganglia functions. Whereas the COMT gene coding for catechol-O-methyl transferase selectively influenced model estimates of WM capacity, the GPR6 gene coding for G-protein-coupled receptor 6 influenced the RL learning rate. Thus, this study allowed us to specify distinct influences of the high-level and low-level cognitive functions on instrumental learning, beyond the possibilities offered by simple RL models.
Article
A number of recent advances have been achieved in the study of midbrain dopaminergic neurons. Understanding these advances and how they relate to one another requires a deep understanding of the computational models that serve as an explanatory framework and guide ongoing experimental inquiry. This intertwining of theory and experiment now suggests very clearly that the phasic activity of the midbrain dopamine neurons provides a global mechanism for synaptic modification. These synaptic modifications, in turn, provide the mechanistic underpinning for a specific class of reinforcement learning mechanisms that now seem to underlie much of human and animal behavior. This review describes both the critical empirical findings that are at the root of this conclusion and the fantastic theoretical advances from which this conclusion is drawn.
Article
Damage to the frontal lobe can cause severe decision-making impairments. A mechanism that may underlie this is that neurons in the frontal cortex encode many variables that contribute to the valuation of a choice, such as its costs, benefits and probability of success. However, optimal decision-making requires that one considers these variables, not only when faced with the choice, but also when evaluating the outcome of the choice, in order to adapt future behaviour appropriately. To examine the role of the frontal cortex in encoding the value of different choice outcomes, we simultaneously recorded the activity of multiple single neurons in the anterior cingulate cortex (ACC), orbitofrontal cortex (OFC) and lateral prefrontal cortex (LPFC) while subjects evaluated the outcome of choices involving manipulations of probability, payoff and cost. Frontal neurons encoded many of the parameters that enabled the calculation of the value of these variables, including the onset and offset of reward and the amount of work performed, and often encoded the value of outcomes across multiple decision variables. In addition, many neurons encoded both the predicted outcome during the choice phase of the task as well as the experienced outcome in the outcome phase of the task. These patterns of selectivity were more prevalent in ACC relative to OFC and LPFC. These results support a role for the frontal cortex, principally ACC, in selecting between choice alternatives and evaluating the outcome of that selection thereby ensuring that choices are optimal and adaptive.
Article
To make a visual discrimination, the brain must extract relevant information from the retina, represent appropriate variables in the visual cortex and read out this representation to decide which of two or more alternatives is more likely. We recorded from neurons in the dorsolateral prefrontal cortex (areas 8 and 46) of the rhesus monkey while it performed a motion discrimination task. The monkey indicated its judgment of direction by making appropriate eye movements. As the monkey viewed the motion stimulus, the neural response predicted the monkey's subsequent gaze shift, hence its judgment of direction. The response comprised a mixture of high-level oculomotor signals and weaker visual sensory signals that reflected the strength and direction of motion. This combination of sensory integration and motor planning could reflect the conversion of visual motion information into a categorical decision about direction and thus give insight into the neural computations behind a simple cognitive act.
Article
Stimulus-specific persistent neural activity is the neural process underlying active (working) memory. Since its discovery 30 years ago, mnemonic activity has been hypothesized to be sustained by synaptic reverberation in a recurrent circuit. Recently, experimental and modeling work has begun to test the reverberation hypothesis at the cellular level. Moreover, theory has been developed to describe memory storage of an analog stimulus (such as spatial location or eye position), in terms of continuous 'bump attractors' and 'line attractors'. This review summarizes new studies, and discusses insights and predictions from biophysically based models. The stability of a working memory network is recognized as a serious problem; stability can be achieved if reverberation is largely mediated by NMDA receptors at recurrent synapses.
Article
In reinforcement learning (RL), the duality between exploitation and exploration has long been an important issue. This paper presents a new method that controls the balance between exploitation and exploration. Our learning scheme is based on model-based RL, in which the Bayes inference with forgetting effect estimates the state-transition probability of the environment. The balance parameter, which corresponds to the randomness in action selection, is controlled based on variation of action results and perception of environmental change. When applied to maze tasks, our method successfully obtains good controls by adapting to environmental changes. Recently, Usher et al. [Science 283 (1999) 549] has suggested that noradrenergic neurons in the locus coeruleus may control the exploitation-exploration balance in a real brain and that the balance may correspond to the level of animal's selective attention. According to this scenario, we also discuss a possible implementation in the brain.
Article
Meta-parameters in reinforcement learning should be tuned to the environmental dynamics and the animal performance. Here, we propose a biologically plausible meta-reinforcement learning algorithm for tuning these meta-parameters in a dynamic, adaptive manner. We tested our algorithm in both a simulation of a Markov decision task and in a non-linear control task. Our results show that the algorithm robustly finds appropriate meta-parameter values, and controls the meta-parameter time course, in both static and dynamic environments. We suggest that the phasic and tonic components of dopamine neuron firing can encode the signal required for meta-learning of reinforcement learning.