Preprint

Recurrent Neural Network Exploration Strategies During Reinforcement Learning Depend on Network Capacity

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Artificial neural networks constitute simplified computational models of neural circuits that might help understand how the biological brain solves and represents complex tasks. Previous research revealed that recurrent neural networks (RNNs) with 48 hidden units show human-level performance in restless four-armed bandit tasks but differ from humans with respect to the task strategy employed. Here we systematically examined the impact of network capacity (no. of hidden units) on computational mechanisms and performance. Computational modeling was applied to investigate and compare network behavior between capacity levels as well as between RNNs and human learners. Using a task frequently employed in human cognitive neuroscience work as well as in animal systems neuroscience work, we show that high-capacity networks displayed increased directed exploration and attenuated random exploration relative to low-capacity networks. RNNs with 576 hidden units approached human like exploration strategies, but the overall switch rate and the level of perseveration still deviated from human learners. In the context of the resource-rational framework, which posits a trade-off between reward and policy complexity, human learners may devote more resources to solving the task, albeit without performance benefits over RNNs. Taken together, this work reveals the importance of network capacity on exploration strategies during reinforcement learning and therefore contributes to the goal of building neural networks that behave human-like to possibly gain insights into computational mechanisms in human brains.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
A key feature of animal and human decision-making is to balance the exploration of unknown options for information gain (directed exploration) versus selecting known options for immediate reward (exploitation), which is often examined using restless bandit tasks. Recurrent neural network models (RNNs) have recently gained traction in both human and systems neuroscience work on reinforcement learning, due to their ability to show meta-learning of task domains. Here we comprehensively compared the performance of a range of RNN architectures as well as human learners on restless four-armed bandit problems. The best-performing architecture (LSTM network with computation noise) exhibited human-level performance. Computational modeling of behavior first revealed that both human and RNN behavioral data contain signatures of higher-order perseveration, i.e., perseveration beyond the last trial, but this effect was more pronounced in RNNs. In contrast, human learners, but not RNNs, exhibited a positive effect of uncertainty on choice probability (directed exploration). RNN hidden unit dynamics revealed that exploratory choices were associated with a disruption of choice predictive signals during states of low state value, resembling a win-stay-loose-shift strategy, and resonating with previous single unit recording findings in monkey prefrontal cortex. Our results highlight both similarities and differences between exploration behavior as it emerges in meta-learning RNNs, and computational mechanisms identified in cognitive and systems neuroscience work.
Article
Full-text available
Reinforcement learning is a core facet of motivation and alterations have been associated with various mental disorders. To build better models of individual learning, repeated measurement of value-based decision-making is crucial. However, the focus on lab-based assessment of reward learning has limited the number of measurements and the test-retest reliability of many decision-related parameters is therefore unknown. In this paper, we present an open-source cross-platform application Influenca that provides a novel reward learning task complemented by ecological momentary assessment (EMA) of current mental and physiological states for repeated assessment over weeks. In this task, players have to identify the most effective medication by integrating reward values with changing probabilities to win (according to random Gaussian walks). Participants can complete up to 31 runs with 150 trials each. To encourage replay, in-game screens provide feedback on the progress. Using an initial validation sample of 384 players (9729 runs), we found that reinforcement learning parameters such as the learning rate and reward sensitivity show poor to fair intra-class correlations (ICC: 0.22–0.53), indicating substantial within- and between-subject variance. Notably, items assessing the psychological state showed comparable ICCs as reinforcement learning parameters. To conclude, our innovative and openly customizable app framework provides a gamified task that optimizes repeated assessments of reward learning to better quantify intra- and inter-individual differences in value-based decision-making over time.
Article
Full-text available
What drives children to explore and learn when external rewards are uncertain or absent? Across three studies, we tested whether information gain itself acts as an internal reward and suffices to motivate children's actions. We measured 24–56‐month‐olds' persistence in a game where they had to search for an object (animal or toy), which they never find, hidden behind a series of doors, manipulating the degree of uncertainty about which specific object was hidden. We found that children were more persistent in their search when there was higher uncertainty, and therefore, more information to be gained with each action, highlighting the importance of research on artificial intelligence to invest in curiosity‐driven algorithms. Research Highlights Across three studies, we tested whether information gain itself acts as an internal reward and suffices to motivate preschoolers' actions. We measured preschoolers' persistence when searching for an object behind a series of doors, manipulating the uncertainty about which specific object was hidden. We found that preschoolers were more persistent when there was higher uncertainty, and therefore, more information to be gained with each action. Our results highlight the importance of research on artificial intelligence to invest in curiosity‐driven algorithms.
Article
Full-text available
Researchers across cognitive, neuro- and computer sciences increasingly reference ‘human-like’ artificial intelligence and ‘neuroAI’. However, the scope and use of the terms are often inconsistent. Contributed research ranges widely from mimicking behaviour , to testing machine learning methods as neurally plausible hypotheses at the cellular or functional levels, or solving engineering problems. However, it cannot be assumed nor expected that progress on one of these three goals will automatically translate to progress in others. Here, a simple rubric is proposed to clarify the scope of individual contributions, grounded in their commitments to human-like behaviour , neural plausibility or benchmark/engineering/computer science goals. This is clarified using examples of weak and strong neuroAI and human-like agents, and discussing the generative, corroborate and corrective ways in which the three dimensions interact with one another. The author maintains that future progress in artificial intelligence will need strong interactions across the disciplines, with iterative feedback loops and meticulous validity tests—leading to both known and yet-unknown advances that may span decades to come. This article is part of a discussion meeting issue ‘New approaches to 3D vision’.
Article
Full-text available
Gambling disorder (GD) is a behavioral addiction associated with impairments in value-based decision-making and behavioral flexibility and might be linked to changes in the dopamine system. Maximizing long-term rewards requires a flexible tradeoff between the exploitation of known options and the exploration of novel options for information gain. This explorationexploitation trade-off is thought to depend on dopamine neurotransmission. We hypothesized that human gamblers would show a reduction in directed (uncertainty-based) exploration, accompanied by changes in brain activity in a fronto-parietal exploration-related network. Twenty-three frequent, non-treatment seeking gamblers and twenty-three healthy matched controls (all male) performed a four-armed bandit task during functional magnetic resonance imaging (fMRI). Computational modeling using hierarchical Bayesian parameter estimation revealed signatures of directed exploration, random exploration, and perseveration in both groups. Gamblers showed a reduction in directed exploration, whereas random exploration and perseveration were similar between groups. Neuroimaging revealed no evidence for group differences in neural representations of basic task variables (expected value, prediction errors). Our hypothesis of reduced frontal pole (FP) recruitment in gamblers was not supported. Exploratory analyses showed that during directed exploration, gamblers showed reduced parietal cortex and substantia-nigra/ventral-tegmental-area activity. Cross-validated classification analyses revealed that connectivity in an exploration-related network was predictive of group status, suggesting that connectivity patterns might be more predictive of problem gambling than univariate effects. Findings reveal specific reductions of strategic exploration in gamblers that might be linked to altered processing in a fronto-parietal network and/or changes in dopamine neurotransmission implicated in GD.
Article
Full-text available
Explore-exploit decisions require us to trade off the benefits of exploring unknown options to learn more about them, with exploiting known options, for immediate reward. Such decisions are ubiquitous in nature, but from a computational perspective, they are notoriously hard. There is therefore much interest in how humans and animals make these decisions and recently there has been an explosion of research in this area. Here we provide a biased and incomplete snapshot of this field focusing on the major finding that many organisms use two distinct strategies to solve the explore-exploit dilemma: a bias for information (‘directed exploration’) and the randomization of choice (‘random exploration’). We review evidence for the existence of these strategies, their computational properties, their neural implementations, as well as how directed and random exploration vary over the lifespan. We conclude by highlighting open questions in this field that are ripe to both explore and exploit.
Article
Full-text available
Bayesian inference for rank-order problems is frustrated by the absence of an explicit likelihood function. This hurdle can be overcome by assuming a latent normal representation that is consistent with the ordinal information in the data: the observed ranks are conceptualized as an impoverished reflection of an underlying continuous scale, and inference concerns the parameters that govern the latent representation. We apply this generic data-augmentation method to obtain Bayes factors for three popular rank-based tests: the rank sum test, the signed rank test, and Spearman's ρs.
Article
Full-text available
Over the past 20 years, neuroscience research on reward-based learning has converged on a canonical model, under which the neurotransmitter dopamine 'stamps in' associations between situations, actions and rewards by modulating the strength of synaptic connections between neurons. However, a growing number of recent findings have placed this standard model under strain. We now draw on recent advances in artificial intelligence to introduce a new theory of reward-based learning. Here, the dopamine system trains another part of the brain, the prefrontal cortex, to operate as its own free-standing learning system. This new perspective accommodates the findings that motivated the standard model, but also deals gracefully with a wider range of observations, providing a fresh foundation for future research.
Article
Full-text available
Bayesian hypothesis testing presents an attractive alternative to p value hypothesis testing. Part I of this series outlined several advantages of Bayesian hypothesis testing, including the ability to quantify evidence and the ability to monitor and update this evidence as data come in, without the need to know the intention with which the data were collected. Despite these and other practical advantages, Bayesian hypothesis tests are still reported relatively rarely. An important impediment to the widespread adoption of Bayesian tests is arguably the lack of user-friendly software for the run-of-the-mill statistical problems that confront psychologists for the analysis of almost every experiment: the t-test, ANOVA, correlation, regression, and contingency tables. In Part II of this series we introduce JASP (http://www.jasp-stats.org), an open-source, cross-platform, user-friendly graphical software package that allows users to carry out Bayesian hypothesis tests for standard statistical problems. JASP is based in part on the Bayesian analyses implemented in Morey and Rouder’s BayesFactor package for R. Armed with JASP, the practical advantages of Bayesian hypothesis testing are only a mouse click away.
Article
Full-text available
Although models of exploratory decision making implicate a suite of strategies that guide the pursuit of information, the developmental emergence of these strategies remains poorly understood. This study takes an interdisciplinary perspective, merging computational decision making and developmental approaches to characterize age-related shifts in exploratory strategy from adolescence to young adulthood. Participants were 149 12-28-year-olds who completed a computational explore-exploit paradigm that manipulated reward value, information value, and decision horizon (i.e., the utility that information holds for future choices). Strategic directed exploration, defined as information seeking selective for long time horizons, emerged during adolescence and maintained its level through early adulthood. This age difference was partially driven by adolescents valuing immediate reward over new information. Strategic random exploration, defined as stochastic choice behavior selective for long time horizons, was invoked at comparable levels over the age range, and predicted individual differences in attitudes toward risk taking in daily life within the adolescent portion of the sample. Collectively, these findings reveal an expansion of the diversity of strategic exploration over development, implicate distinct mechanisms for directed and random exploratory strategies, and suggest novel mechanisms for adolescent-typical shifts in decision making. (PsycINFO Database Record
Article
Full-text available
Leave-one-out cross-validation (LOO) and the widely applicable information criterion (WAIC) are methods for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulations of the parameter values. LOO and WAIC have various advantages over simpler estimates of predictive error such as AIC and DIC but are less used in practice because they involve additional computational steps. Here we lay out fast and stable computations for LOO and WAIC that can be performed using existing simulation draws. We introduce an efficient computation of LOO using Pareto-smoothed importance sampling (PSIS), a new procedure for regularizing importance weights. Although WAIC is asymptotically equal to LOO, we demonstrate that PSIS-LOO is more robust in the finite case with weak priors or influential observations. As a byproduct of our calculations, we also obtain approximate standard errors for estimated predictive errors and for comparison of predictive errors between two models. We implement the computations in an R package called loo and demonstrate using models fit with the Bayesian inference package Stan.
Article
Full-text available
This article provides a Bayes factor approach to multiway analysis of variance (ANOVA) that allows researchers to state graded evidence for effects or invariances as determined by the data. ANOVA is conceptualized as a hierarchical model where levels are clustered within factors. The development is comprehensive in that it includes Bayes factors for fixed and random effects and for within-subjects, between-subjects, and mixed designs. Different model construction and comparison strategies are discussed, and an example is provided. We show how Bayes factors may be computed with BayesFactor package in R and with the JASP statistical package. (PsycINFO Database Record
Article
Full-text available
Analysis of variance (ANOVA), the workhorse analysis of experimental designs, consists of F-tests of main effects and interactions. Yet, testing, including traditional ANOVA, has been recently critiqued on a number of theoretical and practical grounds. In light of these critiques, model comparison and model selection serve as an attractive alternative. Model comparison differs from testing in that one can support a null or nested model vis-a-vis a more general alternative by penalizing more flexible models. We argue this ability to support more simple models allows for more nuanced theoretical conclusions than provided by traditional ANOVA F-tests. We provide a model comparison strategy and show how ANOVA models may be reparameterized to better address substantive questions in data analysis.
Article
Full-text available
All adaptive organisms face the fundamental tradeoff between pursuing a known reward (exploitation) and sampling lesser-known options in search of something better (exploration). Theory suggests at least two strategies for solving this dilemma: a directed strategy in which choices are explicitly biased toward information seeking, and a random strategy in which decision noise leads to exploration by chance. In this work we investigated the extent to which humans use these two strategies. In our "Horizon task," participants made explore-exploit decisions in two contexts that differed in the number of choices that they would make in the future (the time horizon). Participants were allowed to make either a single choice in each game (horizon 1), or 6 sequential choices (horizon 6), giving them more opportunity to explore. By modeling the behavior in these two conditions, we were able to measure exploration-related changes in decision making and quantify the contributions of the two strategies to behavior. We found that participants were more information seeking and had higher decision noise with the longer horizon, suggesting that humans use both strategies to solve the exploration-exploitation dilemma. We thus conclude that both information seeking and choice variability can be controlled and put to use in the service of exploration. (PsycINFO Database Record (c) 2014 APA, all rights reserved).
Article
Full-text available
Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
Article
Full-text available
The computational framework of reinforcement learning has been used to forward our understanding of the neural mechanisms underlying reward learning and decision-making behavior. It is known that humans vary widely in their performance in decision-making tasks. Here, we used a simple four-armed bandit task in which subjects are almost evenly split into two groups on the basis of their performance: those who do learn to favor choice of the optimal action and those who do not. Using models of reinforcement learning we sought to determine the neural basis of these intrinsic differences in performance by scanning both groups with functional magnetic resonance imaging. We scanned 29 subjects while they performed the reward-based decision-making task. Our results suggest that these two groups differ markedly in the degree to which reinforcement learning signals in the striatum are engaged during task performance. While the learners showed robust prediction error signals in both the ventral and dorsal striatum during learning, the nonlearner group showed a marked absence of such signals. Moreover, the magnitude of prediction error signals in a region of dorsal striatum correlated significantly with a measure of behavioral performance across all subjects. These findings support a crucial role of prediction error signals, likely originating from dopaminergic midbrain neurons, in enabling learning of action selection preferences on the basis of obtained rewards. Thus, spontaneously observed individual differences in decision making performance demonstrate the suggested dependence of this type of learning on the functional integrity of the dopaminergic striatal system in humans.
Article
Full-text available
Any nonassociative reinforcement learning algorithm can be viewed as a method for performing function optimization through (possibly noise-corrupted) sampling of function values. We describe the results of simulations in which the optima of several deterministic functions studied by Ackley (1987) were sought using variants of REINFORCE algorithms (Williams, 1987; 1988). Some of the algorithms used here incorporated additional heuristic features resembling certain aspects of some of the algorithms used in Ackley's studies. Differing levels of performance were achieved by the various algorithms investigated, but a number of them performed at a level comparable to the best found in Ackley's studies on a number of the tasks, in spite of their simplicity. One of these variants, called REINFORCE/MENT, represents a novel but principled approach to reinforcement learning in nontrivial networks which incorporates an entropy maximization strategy. This was found to perform especially well on more hierarchically organized tasks.
Preprint
Natural environments often exhibit various degrees of volatility, ranging from slowly changing to rapidly changing contingencies. How learners adapt to changing environments is a central issue in both reinforcement learning theory and psychology. For example, learners may adapt to changes in volatility by increasing learning if volatility increases, and reducing it if volatility decreases. Computational neuroscience and neural network modeling work suggests that this adaptive capacity may result from a meta-reinforcement learning process (implemented for example in the prefrontal cortex), where past experience endows the system with the capacity to rapidly adapt to environmental conditions in new situations. Here we provide direct evidence for a meta-reinforcement learning account of adaptation to environmental volatility. Recurrent neural networks (RNNs) were trained on a restless four-armed bandit reinforcement learning problem under three different training regimes (low volatility training only, medium volatility training only, or meta-volatility training across a range of volatilities). We show that, in contrast to RNNs trained in the low volatility regime, RNNs trained under the medium or meta-volatility regimes exhibited a superior adaption to environmental volatility. This was reflected in a better performance, and computational modeling of the network’s behavior revealed a more adaptive adjustment of learning and exploration to varying levels of volatility during test. Results extend the meta-RL account to volatility adaptation and confirm past experience as a crucial factor of adaptive decision-making.
Article
The emergence of powerful artificial intelligence (AI) is defining new research directions in neuroscience. To date, this research has focused largely on deep neural networks trained using supervised learning in tasks such as image classification. However, there is another area of recent AI work that has so far received less attention from neuroscientists but that may have profound neuroscientific implications: deep reinforcement learning (RL). Deep RL offers a comprehensive framework for studying the interplay among learning, representation, and decision making, offering to the brain sciences a new set of research tools and a wide range of novel hypotheses. In the present review, we provide a high-level introduction to deep RL, discuss some of its initial applications to neuroscience, and survey its wider implications for research on brain and behavior, concluding with a list of opportunities for next-stage research.
Article
How do humans search for rewards? This question is commonly studied using multi-armed bandit tasks, which require participants to trade off exploration and exploitation. Standard multi-armed bandits assume that each option has an independent reward distribution. However, learning about options independently is unrealistic, since in the real world options often share an underlying structure. We study a class of structured bandit tasks, which we use to probe how generalization guides exploration. In a structured multi-armed bandit, options have a correlation structure dictated by a latent function. We focus on bandits in which rewards are linear functions of an option's spatial position. Across 5 experiments, we find evidence that participants utilize functional structure to guide their exploration, and also exhibit a learning-to-learn effect across rounds, becoming progressively faster at identifying the latent function. Our experiments rule out several heuristic explanations and show that the same findings obtain with non-linear functions. Comparing several models of learning and decision making, we find that the best model of human behavior in our tasks combines three computational mechanisms: (1) function learning, (2) clustering of reward distributions across rounds, and (3) uncertainty-guided exploration. Our results suggest that human reinforcement learning can utilize latent structure in sophisticated ways to improve efficiency.
Article
Bayes/frequentist correspondences between the p-value and the posterior probability of the null hypothesis have been studied in univariate hypothesis testing situations. This paper extends these comparisons to multiple testing and in particular to the Bonferroni multiple testing method, in which p-values are adjusted by multiplying by k, the number of tests considered. In the Bayesian setting, prior assessments may need to be adjusted to account for multiple hypotheses, resulting in corresponding adjustments to the posterior probabilities. Conditions are given for which the adjusted posterior probabilities roughly correspond to Bonferroni adjusted p-values.
Article
Bayes factors have been advocated as superior to pp-values for assessing statistical evidence in data. Despite the advantages of Bayes factors and the drawbacks of pp-values, inference by pp-values is still nearly ubiquitous. One impediment to the adoption of Bayes factors is a lack of practical development, particularly a lack of ready-to-use formulas and algorithms. In this paper, we discuss and expand a set of default Bayes factor tests for ANOVA designs. These tests are based on multivariate generalizations of Cauchy priors on standardized effects, and have the desirable properties of being invariant with respect to linear transformations of measurement units. Moreover, these Bayes factors are computationally convenient, and straightforward sampling algorithms are provided. We cover models with fixed, random, and mixed effects, including random interactions, and do so for within-subject, between-subject, and mixed designs. We extend the discussion to regression models with continuous covariates. We also discuss how these Bayes factors may be applied in nonlinear settings, and show how they are useful in differentiating between the power law and the exponential law of skill acquisition. In sum, the current development makes the computation of Bayes factors straightforward for the vast majority of designs in experimental psychology.
Article
The simple, but general formal theory of fun and intrinsic motivation and creativity (1990-2010) is based on the concept of maximizing intrinsic reward for the active creation or discovery of novel, surprising patterns allowing for improved prediction or data compression. It generalizes the traditional field of active learning, and is related to old, but less formal ideas in aesthetics theory and developmental psychology. It has been argued that the theory explains many essential aspects of intelligence including autonomous development, science, art, music, and humor. This overview first describes theoretically optimal (but not necessarily practical) ways of implementing the basic computational principles on exploratory, intrinsically motivated agents or robots, encouraging them to provoke event sequences exhibiting previously unknown, but learnable algorithmic regularities. Emphasis is put on the importance of limited computational resources for online prediction and compression. Discrete and continuous time formulations are given. Previous practical, but nonoptimal implementations (1991, 1995, and 1997-2002) are reviewed, as well as several recent variants by others (2005-2010). A simplified typology addresses current confusion concerning the precise nature of intrinsic motivation.
Article
How do individuals decide to act based on a rewarding status quo versus an unexplored choice that might yield a better outcome? Recent evidence suggests that individuals may strategically explore as a function of the relative uncertainty about the expected value of options. However, the neural mechanisms supporting uncertainty-driven exploration remain underspecified. The present fMRI study scanned a reinforcement learning task in which participants stop a rotating clock hand in order to win points. Reward schedules were such that expected value could increase, decrease, or remain constant with respect to time. We fit several mathematical models to subject behavior to generate trial-by-trial estimates of exploration as a function of relative uncertainty. These estimates were used to analyze our fMRI data. Results indicate that rostrolateral prefrontal cortex tracks trial-by-trial changes in relative uncertainty, and this pattern distinguished individuals who rely on relative uncertainty for their exploratory decisions versus those who do not.
Article
In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. Therefore, model selection and hyperparameter optimization using these two values are asymptotically equivalent. Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to 2λ/n2\lambda/n, where λ\lambda is the real log canonical threshold and n is the number of training samples. Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion.
Article
Following damage to the ventromedial prefrontal cortex, humans develop a defect in real-life decision-making, which contrasts with otherwise normal intellectual functions. Currently, there is no neuropsychological probe to detect in the laboratory, and the cognitive and neural mechanisms responsible for this defect have resisted explanation. Here, using a novel task which simulates real-life decision-making in the way it factors uncertainty of premises and outcomes, as well as reward and punishment, we find that prefrontal patients, unlike controls, are oblivious to the future consequences of their actions, and seem to be guided by immediate prospects only. This finding offers, for the first time, the possibility of detecting these patients' elusive impairment in the laboratory, measuring it, and investigating its possible causes.
Analysis of Thompson Sampling for the Multi-armed Bandit Problem
  • S Agrawal
  • N Goyal
Agrawal, S., & Goyal, N. (2012). Analysis of Thompson Sampling for the Multi-armed Bandit Problem. JMLR: Workshop and Conference Proceedings, 23, 39.1-39.26.
Modeling Human Exploration Through Resource-Rational Reinforcement Learning
  • M Binz
  • E Schulz
Binz, M., & Schulz, E. (2022). Modeling Human Exploration Through Resource-Rational Reinforcement Learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K.
What Are We Curious about?
  • F Brändle
  • C M Wu
  • E Schulz
Brändle, F., Wu, C. M., & Schulz, E. (2020). What Are We Curious about? Trends in International Conference on Machine Learning (Vol. 48, pp. 1928-1937). PMLR. https://proceedings.mlr.press/v48/mniha16.html
Do not Bet on the Unknown Versus Try to Find Out More: Estimation Uncertainty and "Unexpected Uncertainty" Both Modulate Exploration
  • E Payzan-Lenestour
  • P Bossaerts
Payzan-LeNestour, E., & Bossaerts, P. (2012). Do not Bet on the Unknown Versus Try to Find Out More: Estimation Uncertainty and "Unexpected Uncertainty" Both Modulate Exploration. Frontiers in Neuroscience, 6. https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2012.
Reward-based training of recurrent neural networks for cognitive and value-based tasks. eLife, 6, e21492
  • H F Song
  • G R Yang
  • X.-J Wang
Song, H. F., Yang, G. R., & Wang, X.-J. (2017). Reward-based training of recurrent neural networks for cognitive and value-based tasks. eLife, 6, e21492. https://doi.org/10.7554/eLife.21492
RStan: The R interface to Stan
Stan Development Team. (2023). RStan: The R interface to Stan. R package version 2.32.3 [Computer software].
Stan Reference Manual (Version 2.35)
Stan Development Team. (2024). Stan Reference Manual (Version 2.35). https://mcstan.org/docs/reference-manual/
Learning to reinforcement learn
  • J X Wang
  • Z Kurth-Nelson
  • D Tirumala
  • H Soyer
  • J Z Leibo
  • R Munos
  • C Blundell
  • D Kumaran
  • M Botvinick
Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., & Botvinick, M. (2017). Learning to reinforcement learn (No. arXiv:1611.05763; Issue arXiv:1611.05763). arXiv. http://arxiv.org/abs/1611.05763
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint this version posted
  • L Zhang
  • L Lengersdorff
  • N Mikus
  • J Gläscher
  • C Lamm
Zhang, L., Lengersdorff, L., Mikus, N., Gläscher, J., & Lamm, C. (2020). Using reinforcement learning models in social neuroscience: Frameworks, pitfalls and suggestions of best practices. Social Cognitive and Affective Neuroscience, 15(6), Article 6. https://doi.org/10.1093/scan/nsaa089 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint this version posted March 13, 2025. ; https://doi.org/10.1101/2025.03.13.642959 doi: bioRxiv preprint