ArticlePDF Available

Adaptive properties of differential learning rates for positive and negative outcomes


Abstract and Figures

The concept of the reward prediction error-the difference between reward obtained and reward predicted-continues to be a focal point for much theoretical and experimental work in psychology, cognitive science, and neuroscience. Models that rely on reward prediction errors typically assume a single learning rate for positive and negative prediction errors. However, behavioral data indicate that better-than-expected and worse-than-expected outcomes often do not have symmetric impacts on learning and decision-making. Furthermore, distinct circuits within cortico-striatal loops appear to support learning from positive and negative prediction errors, respectively. Such differential learning rates would be expected to lead to biased reward predictions and therefore suboptimal choice performance. Contrary to this intuition, we show that on static "bandit" choice tasks, differential learning rates can be adaptive. This occurs because asymmetric learning enables a better separation of learned reward probabilities. We show analytically how the optimal learning rate asymmetry depends on the reward distribution and implement a biologically plausible algorithm that adapts the balance of positive and negative learning rates from experience. These results suggest specific adaptive advantages for separate, differential learning rates in simple reinforcement learning settings and provide a novel, normative perspective on the interpretation of associated neural data.
Content may be subject to copyright.
Biological Cybernetics manuscript No.
(will be inserted by the editor)
Adaptive properties of differential learning rates for
positive and negative outcomes
Romain D. Caz´e ·Matthijs A. A. van
der Meer
Received: date / Accepted: date
Abstract The concept of the reward prediction error – the difference between
reward obtained and reward predicted – continues to be a focal point for much
theoretical and experimental work in psychology, cognitive science, and neuro-
science. Models that rely on reward prediction errors typically assume a single
learning rate for positive and negative prediction errors. However, behavioral
data indicate that better-than-expected and worse-than-expected outcomes of-
ten do not have symmetric impacts on learning and decision-making. Further-
more, distinct circuits within cortico-striatal loops appear to support learn-
ing from positive and negative prediction errors respectively. Such differential
learning rates would be expected to lead to biased reward predictions, and
therefore suboptimal choice performance. Contrary to this intuition, we show
that on static “bandit” choice tasks, differential learning rates can be adap-
tive. This occurs because asymmetric learning enables a better separation of
learned reward probabilities. We show analytically how the optimal learning
rate asymmetry depends on the reward distribution, and implement a bio-
logically plausible algorithm that adapts the balance of positive and negative
learning rates from experience. These results suggest specific adaptive advan-
tages for separate, differential learning rates in simple reinforcement learning
RC is supported by an Marie Curie initial training fellowship. MvdM is supported by the
National Science and Engineering Council of Canada (NSERC).
Romain Caz´e
Department of Bioengineering
Imperial College
London, England
Matthijs van der Meer
Department of Biology and Centre for Theoretical Neuroscience
University of Waterloo, Canada
2 Romain D. Caz´e, Matthijs A. A. van der Meer
settings, and provide a novel, normative perspective on the interpretation of
associated neural data.
Keywords Reinforcement learning ·Reward prediction error ·Decision-
making ·Meta-learning ·Basal ganglia
1 Introduction
A central element in the major theories of reinforcement learning is the reward
prediction error (RPE) which, in its simplest form, is the difference between
the amount of received and expected reward [38]. Thus, a positive RPE is gen-
erated when more is received than expected, and conversely, negative RPEs
occur when less is received than expected. Proposals based on RPEs can ex-
plain salient features of Pavlovian and instrumental learning [32] and generalize
to powerful algorithms that can learn about states and actions which are not
themselves rewarded, but result in reward later (e.g. temporal-difference re-
inforcement learning, [37]). Essentially, these algorithms attempt to estimate
from experience the expected reward resulting from states and actions, so that
the amount of reward obtained can be maximized.
RPEs appear to be encoded by the neural activity of several brain areas
known to support learning and motivated behavior, such as a population of
dopaminergic neurons in the ventral tegmental area (VTA) [2]. Projection
targets of these neurons encode action and state values [39], and reinforcement
learning model fits to behavioral and neural data can account for trial-by-trial
changes in both [29]. Taken together, these findings support the familiar notion
that circuits centered on the dopaminergic modulation of activity in cortico-
striatal loops implement a reinforcement learning system. Thus, reinforcement
learning models provide an explicit, computational account of the relationship
between RPEs, estimates of expected reward, and subsequent behavior. This
account is a powerful framework for understanding the neural basis of learning
and decision-making in mechanistic, biological detail.
Many reinforcement learning models implicitly assume that positive and
negative RPEs impact estimation of action and state values through a com-
mon gain factor or learning rate [38,29]. However, converging evidence in-
dicates that learning from positive and negative feedback, including positive
and negative RPEs, in fact relies on dissociable mechanisms in the brain. De-
pending on the specific setting, these distinct mechanisms are referred to as
approach/avoid, go/noGo, or direct/indirect pathways in the basal ganglia,
associated with predominantly D1 and D2-expressing projection neurons in
the striatum, respectively [16,23]. This neural separation of the direct and in-
direct pathways raises the possibility that distinct, differential learning rates
are associated with positive and negative RPEs.
In support of this idea, a number of studies have reported behavioral evi-
dence for differential learning rates in humans [15,13]. From a reinforcement
learning perspective, differential learning rates imply biased estimates of the
amount of expected reward, which in turn can lead to suboptimal decisions.
Adaptive properties of differential learning rates 3
This “irrationality” notion is congruent with observations from psychology
and behavioral economics. For instance, a recent study examined the extent
to which subjects updated their estimates of the likelihood of various events
happening to them (e.g. get cancer, win the lottery) after being told the true
probabilities [35]. Strikingly, it was found that subjects tended to update their
estimates more if the odds were better (in their favour) than they thought.
Such differential updating could underlie the formation of a so-called “opti-
mism bias” [34] and has also been proposed as contributing to other biases
such as risk aversion in mean-variance learners [26,5].
Such behavioral evidence for asymmetric updating following positive (bet-
ter than expected) and negative (worse than expected) outcomes invites nor-
mative questions. Are these biases suboptimal in an absolute sense, but per-
haps the result of limited cognitive resources? Or alternatively, are they op-
timized for situations different than tested? Here, we aim to address these
questions quantitatively in simple reinforcement learning models. We show
that even in simple, static bandit tasks, agents with differential learning rates
can outperform unbiased agents. These results provide a different view on (1)
the interpretation of behavioral results where such biases are often cast as
irrational, and (2) neurophysiological data comparing neural responses associ-
ated with positive and negative prediction errors. Finally, this work suggests
simple and biologically motivated improvements to boost the performance of
commonly used reinforcement learning models.
2 Impact of differential learning rates on value estimation
To model a tractable decision problem where the effects of differential learning
rates for positive and negative RPEs can be explored, we implement agents
that learn the values of actions on probabilistically rewarded “bandit” tasks.
That is, there is only one state with one (this section) or two possible actions
(the next section), the expected value of which the agent learns incrementally
from probabilistic feedback according to a simple delta rule. Unlike the stan-
dard form of typical reinforcement learning algorithms such as Q-learning [40],
however, our agents employ different learning rates for positive and negative
prediction errors ∆Qt= (rtQt):
Qt+1 =Qt+α+∆Qtif ∆Qt0
α∆Qtif ∆Qt<0
Qt, by convention, corresponds to the expected reward of taking a certain
action at time step t(“Q-value”). rtis the actual reward received at time step
t. The usual state and action indices s, a are omitted because in this section,
we only consider one state and one action. Unlike typical “symmetric” update
rules, the existence of two learning rates α+and αcan lead to biased Q-
values, as we will show below.
We first derive an analytical expression for the steady-state Q-value for a
single action with a probabilistic, binary outcome. Without loss of generality,
4 Romain D. Caz´e, Matthijs A. A. van der Meer
we set the outcomes to rt= 1 or and rt= -1, with probabilities pand 1 p
respectively. Then, for sufficiently small αand Q06={−1,1}, if rt= 1 the
outcome is always superior to the predicted Q, whereas if rt=1 the outcome
is always inferior. Thus, the update rule for each trial is the following:
Qt+1 =Qt+α+(1 Qt) with probability p
α(1Qt) with probability 1 p
At steady state, ˆ
Qt+1 =ˆ
Qt=Qwhere ˆ
Qrepresents the mean Qwhen
t→ ∞. This yields:
Q=Q++(1 Q) + (1 p)α(1Q)
If we define xas the ratio between the learning rate for positive over neg-
ative prediction value x=α+, we can rewrite the previous equation and
obtain Q:
Q=px (1 p)
px + (1 p)(1)
A number of properties of this result are worth noting. When α> α+the
Q-values are “pessimistic”, i.e. they are below the true value; whereas when
α+=α, the steady-state Q-value converges, as expected, to the true mean
value for this action (equal to 2p1; see Figure 1 where α+=α). Moreover,
when α+> α, the Q-values are “optimistic”, i.e. they are above the true
value. This bias is illustrated in Figure 1 where α+= 4α. Note also the
complex dependence of the bias on the true mean value. The bias is lower for
low Qvalues than for high Qvalues in the case of a pessimistic agent, while
this is the opposite for an optimistic agent.
To introduce our simulation testbed, we first sought to confirm the above
analytical results numerically. The analytical results predicted with high ac-
curacy the behavior of a modified Q-learning algorithm (5000 iterations of
800 trials each; error bars in Figure 1 show the variance across iterations).
Thus, differential learning rates for positive and negative RPEs lead to biased
estimates of true underlying values, in a heterogeneous manner that depends
on the true mean value as well as the learning rate asymmetry. In the next
section, we examine the impact of this distortion on performance in choice
settings (“two-armed bandits”).
3 Impact of differential learning rates on performances
In this section, we consider the consequences of biased estimates resulting from
differential learning rates in simple choice situations. In particular, we simulate
the performance of three Q-learning agents on two different “two-armed ban-
dit” problems. The “rational” (R) agent has equal learning rates for negative
and positive prediction error α= 0.1; the optimistic (O) agent has a higher
Adaptive properties of differential learning rates 5
Fig. 1 Differential learning rates result in biased estimates of true expected
values. Analytically derived Q-values (black filled circles, Q) for different true values
of Q:0.8,0.6,0.6,0.8 (grey dotted lines) and for different ratios of α+and α. Note
that the true Q-values are the means of probabilistic reward delivery schedules. Error bars,
computed using numerical simulations, show the variance of the estimated Q-values after
800 trials averaged over 5000 runs (the means converge to the analytically derived values).
When the learning rates for negative and positive prediction errors are equal, the derived
Q-values correspond to the true values, whereas they are distorted when the learning rates
are different. A “pessimistic” learner ( α+
<1) underestimates the true values, and an
“optimistic” learner ( α+
>1) overestimates them.
learning rate for positive prediction errors (α+= 0.4) than for negative pre-
diction errors (α= 0.1), and the pessimistic (P) agent has a higher learning
rate for negative (α= 0.4) than for positive prediction errors (α+= 0.1). All
agents use a standard softmax decision rule with fixed temperature β= 0.3
to decide which arm to choose given the Q-values; we employ this decision
rule here because it is the de facto standard in applications of reinforcement
learning models to behavioral and neural data. We discuss the applicability of
these results for the epsilon-greedy decision rule below.
The three agents are tested on two versions of a difficult two-armed bandit
task, with large variance in the two outcomes but small differences in the
means [38]. For simplicity, we again consider binary outcome distributions
{−1,1}, however, the results that follow are quite general (see Discussion). In
the first, “low-reward” task, the respective probabilities of r= 1 are 0.2 and
0.1; in the other “high-reward” task, the two arms are rewarded with reward
probabilities of 0.9 and 0.8 respectively.
When simulated on these two tasks, a striking pattern is apparent (Figure
2a): in the low-reward task, the optimistic agent learns to take the best action
significantly more often than the rational agent, which in turn performs better
than the pessimistic agent (left panel). The agents’ performance is reversed
in the high-reward task (right panel). To understand this pattern of results,
recall that reinforcement learners face a trade-off between exploitation and
6 Romain D. Caz´e, Matthijs A. A. van der Meer
exploration [38]: choosing to exploit the option with the highest estimated Q-
value does not guarantee that there is not an (insufficiently explored) better
option available. This problem is particularly pernicious for probabilistic re-
wards such as those in the current setting, where only noisy estimates of true
underlying values are available.
Fig. 2 Differential learning rates increase performance in specific tasks. (A) Mean
probability of choosing the best arm (averaged over 5000 runs) for the three agents, “ratio-
nal” (R, α+=α, blue line), “optimist” (O, α+> α, green line), and “pessimist” (P,
α+< α, red line). The left plot shows performance on the low-reward task (0.1 and 0.2
probability of reward for the two choices), the right plot performance on the high-reward
task (0.8 and 0.9 probability). Note that in the low-reward task the optimistic agent is the
best performer and the pessimistic agent the worst, but this pattern is reversed for the high-
reward task. (B) Probability of switching after 800 episodes for each agent. This probability
depends on the task: in the low-reward task the optimistic agent is the least likely to switch,
with the pattern reversed for the high-reward task.
To quantify differences in how the agents navigate this exploration versus
exploitation trade-off in the steady state, we estimated the probability that
an agent repeats the same choice between time tand t+ 1, after extended
learning (800 trials). For both tasks, we expect that agent will “stay” more
than “switch” owing to differences in mean expected reward for both choices.
However, as shown in Figure 2b, the different agents have a distinct probability
of switching, and this difference between agents depends on the task. Higher
probabilities of switching (“exploring”) are associated with lower performance
(compare with Figure 2a).
Why does this occur? As shown by Equation 1 and illustrated in Figure
1, differential learning rates result in biased value estimates on probabilistic
tasks such as the one simulated here. To understand the implications of this for
choice situations, the dependence of this bias on the true mean value is critical.
Adaptive properties of differential learning rates 7
In the low-reward task, mean values are low, and therefore the distortion of
these values will be high for the “optimistic” agent (see Figure 1a). Conversely,
in this task, distortion will be low for the “pessimistic” agent (Figure 1b). The
implication of this is that distortion of the true values effectively increases
the contrast between the two choices, leading to (1) increased probability of
choosing the best option using a softmax rule, and (2) increased robustness
to random fluctuations in the outcomes; this latter effect does not rely on the
softmax action selection rule and can also occur for different action selection
rules such as -greedy.
More precisely, for the results in Figure 2, the rational agent tends to
approach the true mean values, with a mean final estimated after 800 trials
approximatively equal to Q1=0.63, Q0=0.84, and in the high reward
approximatively equal to Q1= 0.78, Q0=0.51. These estimated Q-values
are close to the true Q-value for the state, and thus the ∆Q is close to the
true value of 0.2. The estimated ∆Q at steady state for a biased agent due to
the heterogeneous distortion observed in the first part is in excess of 0.5. It is
this distortion (separation) which enables higher steady state performance for
the biased agent. Likewise, distortion in the opposite direction (compression)
is the reason for impaired performance. Specifically, in the low-reward task,
the pessimistic agent will have a smaller ∆Q than the rational agent, because
the pessimistic bias causes saturation as the Q-values approaches 1, making
the two Q-values closer than the true values, ∆Q < 0.06. For the same reason,
the opposite symmetric observation holds for an optimistic agent in the high-
reward task, where this agent has a ∆Q < 0.04.
In addition to this effect on performance in the steady state, differential
learning rates likely also impact the early stages of learning, when the agent
takes its first few choices. This idea is illustrated by the following intuition:
in the high-reward task, an agent will have a tendency to over-exploit its first
choice, because it is likely that this first choice will be rewarded. In the low-
reward task, an agent will have the opposite tendency, it will over-explore its
environment because it is likely that neither arm will provide a reward for the
first few trials. To the extent that this effect contributes to performance, it
may be mitigated by appropriately differential learning rates.
These observations raise the obvious question: can we find optimal learning
rates which maximize ∆Q? Can we find an agent which maximizes this ∆Q
in both tasks? We explore these issues in the following section.
4 Derivation of optimally differential learning rates
From Equation 1, we can compute ∆Qat steady state between the two
p1p0x2+ (p1q0+p0q1)x+q0q1
where the indices correspond to the different choices (bandit arms); the
index of the arm with the highest probability of reward is 1, with 0 indicating
8 Romain D. Caz´e, Matthijs A. A. van der Meer
the other arm. qi= 1 piis the probability of no-reward for arm i. We can
determine the xfor which this rational function is maximal and thus find the
ratio for which ∆Q is maximal:
In the limit where p0p1the optimal xtends to the ratio between
p(no reward) and p(reward). The results in the previous section indicate
that the best steady state performance is achieved when ∆Qis maximal;
thus, we can conclude that the best learning rates for positive (resp. negative)
prediction error should be proportional to the rate of no-reward (resp. rate of
reward). Specifically, in the low-reward task the optimal xis equal to 6. In
the high-reward task an optimal x=1
6. Thus, an optimal xin a given task
is also the one which leads to the worst performance in the other task, where
the optimal xis the inverse. To solve this issue we introduce an agent with a
plastic learning rate (meta-learner), adaptable to tasks with different reward
5 Meta-learning of optimally differential learning rates
We have shown in the previous section that behavior is optimal when the
learning rate for positive (resp. negative) prediction error corresponds to the
probability of no-reward (resp. reward) in a given task. Thus, here we study
an agent which adapts its learning rate for positive and negative prediction
errors. Given our results, we choose as a target α+w×p(no reward) and
αw×p(reward), with the reward and no-reward probabilities estimated
by a running average of reward history. The parameter w(set to 0.1 in Figure
3) replaces the standard learning rate parameter αand captures the sensitivity
of learning to the estimated reward rate. Thus, we define a meta-learning agent
which derives separate learning rates for positive and negative prediction errors
from a running estimate of the task reward rate.
We compared the performance of this meta-learning agent to the rational
agent in both high- and low-reward tasks introduced in the previous section.
As shown in Figure 3, the meta agent outperforms the rational agent in both
tasks. In fact, the meta-agent slightly outperforms the purely pessimistic or
optimistic agent, meaning that in the low reward task this meta-learning agent
is optimistic whereas, in the high reward task, the agent is pessimistic. As
expected the two learning rates converge at steady state toward the mean
probability of reward or no-reward in one given task. For a large range of α
between 0.01 and 0.4 we obtained similar results (Figure 3a). Thus, the meta-
learning agent performed best in both task settings by flexibly adapting its
learning rate based on reward history – it under- or overestimates the true
expected rewards to improve choice performance.
Adaptive properties of differential learning rates 9
Fig. 3 Meta-learning of differential learning rates results in optimal performance
by the same agent across tasks. As in Figure 2, shown is the probability of choosing the
best option in the two different tasks (left: low reward, right: high reward) for two different
agents. In blue (α= 0.1) is the performance of the “rational” agent R, in teal and navy blue
is the performance when α= 0.01 and α= 0.4 respectively, and in violet is the agent with
two plastic learning rates N. This “meta-learning” agent outperforms the rational agent in
both tasks by finding appropriate, differential learning rates based on a running average of
rewards received. This estimate started with a value of 0.5 for both probabilities, and was
updated by keeping track of the number of rewarded and non-rewarded trials (i.e. an infinite
window for the running average, but similar results are obtained for any sufficiently reliable
estimation method).
The results in section 3 suggest that differential learning rates are most
useful in situations where competing choice values are close together. This ef-
fect is illustrated for the meta-learner in Figure 4b) in a task where the reward
probabilities are 0.75 and 0.25, i.e. where reward probabilities are separated by
0.5 rather than 0.1, as explored previously, and where no-reward and reward
probabilities are equal. In this scenario the advantage of differential learning
rates is negligible. A further setting of interest is the extension to more than
two choices: Figure 4c shows the performance of the meta-learner as compared
to fixed learning rate agents for two different three-armed bandits. In this set-
ting the meta-learner outperforms the rational agent by the same margin as
in the two-armed bandit case, demonstrating that the benefits of differential
learning rates are not restricted to binary choice.
6 Discussion
We simulated agents that make decisions based on learned expected values that
systematically deviate from the true outcome magnitudes. This bias, derived
from differential learning rates for positive and negative reward prediction
errors (RPEs), distinguishes our approach from typical reinforcement learning
models with a single learning rate that attempt to learn the true outcome
magnitudes. In contrast, in economics, divergence between objective values
and subjective utilities is foundational [20]. In line with this idea, our agents
can be said to learn something akin to subjective utilities, in the sense that
actions are based on a distorted (subjective) representation of the true values.
Surprisingly, however, we show that subjective or biased representations
based on the biologically motivated idea of differential processing of posi-
10 Romain D. Caz´e, Matthijs A. A. van der Meer
Fig. 4 The performance advantage of differential learning rate agents depends
on task structure. A We plot here the performance of the different agents (Meta-learner,
Rational, Optimist, and Pessimist) where the probabilities of reward are 0.75 and 0.25 for
the two choices. In tasks with more than two choices, differential learning rates continue to
outperform the other agents. BPerformance of agents in a “three-armed bandit” task, with
reward probabilities 0.1; 0.15; 0.2 in the low reward task and 0.8;0.85; 0.9 in the high reward
tive and negative prediction errors can perform objectively better on simple
probabilistic learning tasks. This result suggests that the presence of separate
direct and indirect (or “Go”/“NoGo”, “approach”/“avoid”) pathways in the
nervous system enables adaptive value derived from distinct learning rates in
these pathways. Behavioral evidence for differential learning rates have been
reported in a number of studies [15,13,35] but to our knowledge this work is
the first to explore the implications of these results from a normative perspec-
tive. Similarly, a number of models have employed separate parameters for
learning from positive and negative outcomes (e.g. [9,22]) but these studies
likewise did not explore the raisons d’etre for such an architecture. We show
here that differential learning rates can result in increased separation between
competing action values, leading to a steady-state performance improvement
because of an interaction with probabilistic action selection and stochastic
We also implemented a meta-learning agent to achieve optimal action value
separation adaptively, based on an estimate of the average reward rate. This
agent always outperforms the unbiased agent at steady-state, but it is slower
than the rational agent to reach this higher level of performance. In situations
where reward probabilities are changing, this approach may delay the behav-
ioral response to the change. However, it is well known that “model-free” RL
models in general do not perform well in volatile situations such as serial re-
versal learning [8]. Thus, we view the current proposal as complementary to,
Adaptive properties of differential learning rates 11
and compatible with (within a single RL system) meta-learning of other RL
parameters [10,33]. Examples of such meta-learning include models that adap-
tively regulate overall learning rate [1], the exploration/exploitation trade off
[7,19], and on-line state splitting [31]. A related idea is that of distributed
learning, in which many individual RL models – each with specific parameter
values drawn from a given distribution – are instantiated in parallel and then
compete for behavioral control [11,24]. The meta-learner proposed here can be
implemented in both these ways, but differs from other proposals in examining
normatively the effect of positive and negative prediction errors. In particular,
unlike meta-learners that modify alpha, beta, gamma, our proposal deliber-
ately distorts value estimates. It relaxes the notion that reinforcement learners
should strive for accurate predictions; rather, it should strive for predictions
that are effective after the properties (imperfections, constraints) of e.g. the
action selection stage are taken into account.
The present results were obtained under specific circumstances, using a
so-called “Q-learner” and a softmax action selection rule, with a temperature
parameter equal to 0.3, experiencing static reward distributions. This was done
chosen to demonstrate the main point in a simple setting, commonly in use
both as a RL testbed and in experiments. Nevertheless, the underlying insight
is quite general: in the presence of noisy observations and specific properties of
the decision stage (e.g. softmax), unbiased estimation of expected values may
not lead to optimal performance. Of course, more sophisticated models such
as a Bayesian learner with appropriate (meta-)parameters and priors [41], or a
risk-sensitive reinforcement learner [26] can in principle implement an optimal
solution to the reinforcement learning problems explored here using unbiased
estimates. However, these approaches are more computationally intensive to
implement, and less straightforward to relate to neural signals and neural ar-
chitectures. Q-learning and related models are in widespread use as models for
fitting behavior and neurophysiological data in neuroscience, providing much
evidence that biological learning mechanisms share at least some properties
with these models ([29, 25], but see [17]). Thus, the primary relevance of this
work lies in its application to the interpretation of neuroscience data; we dis-
cuss some examples in the next section.
The broad spectrum of more traditional model-free reinforcement learning
approaches such as the one taken here suggests a number of possible extensions
to the current proposal. For instance, it would be of interest to determine if
differential learning rates can be advantageous in temporal credit assignment
problems (i.e. those requiring a temporal-difference component, learning to
choose actions that are not themselves rewarded but lead to reward later; for
a rare experimental investigation of RL variables on such a task see [6]). We
tentatively suggest that because the reported performance differences arise
because of the stochasticity in outcome distributions, it may be sufficient to
apply differential learning rates only to actions or states associated with actual
rewards. The resulting biased value estimates may then be propagated using
standard temporal-difference methods.
12 Romain D. Caz´e, Matthijs A. A. van der Meer
A number of experimental studies have linked the regulation of differential
learning rates to the neuromodulator dopamine. For instance, Parkinsonian
patients off medication (i.e. with reduced dopamine levels) update RL value
estimates more in response to negative outcomes compared to positive out-
comes; the reverse is the case when subjects are given L-DOPA treatment,
elevating DA (dopamine) levels [15]. A similar effect was found based on ge-
netic polymorphisms in DA receptors [14]. Thus, subjects with relatively low
DA levels behave like our “pessimistic” learner whereas those with high DA
levels are more “optimistic”. Our results therefore suggest that Parkinsonian
patients (off medication) would do well on tasks like our high-reward task, but
do poorly on low-reward tasks.
An important distinction is that between phasic and tonic dopamine, which
are thought to contribute to response vigor and learning respectively. In par-
ticular, tonic dopamine has been suggested to reflect the average reward rate
or “opportunity cost” [27]. Our proposal uses this same estimate for the adap-
tive scaling of learning rates. Thus, in probabilistic learning tasks such as the
ones used here, we would predict that under high reward rates, when tonic
dopamine is high, negative RPEs (reward prediction errors) have more impact
on learning than positive RPEs, because, as we show, this is adaptive in high
reward settings. This is a nontrivial prediction that can be tested with cur-
rent methods; it is critical, however, that the experimental design can separate
effects on learning rate from effects on the exploration-exploitation trade-off,
which has also been linked to tonic DA [19].
Dysregulation of DA levels has also been linked to various manifestations of
addiction, including those involving drugs of abuse [30] and problem gambling
[4]. The application of reinforcement learning models to such behaviors sug-
gests a mis-estimation of choice values lies at the root of dysfunctional decision-
making. Such mis-estimation may arise for instance from a non-compensable
positive prediction error [30], or from inappropriate meta-learning of the adap-
tive learning rates proposed here. In this respect it is particularly interesting
to note that several studies have specifically attributed meta-learning roles to
frontal areas also linked to addiction [7,21, 36]. The applicability of the norma-
tive ideas developed here to these settings would benefit from an empirical test
of positive and negative learning rates in affected individuals, perhaps along
the lines of [3].
Our results also suggest a further alternative to improving reinforcement
learning fits to behavioral and neural data. As noted previously, to improve on
the basic RL model with a single learning rate, some studies have used models
with distinct learning rates for positive and negative prediction errors [13],
requiring the estimation of two free parameters from the data. In contrast,
because our meta-learner sets its learning rates based on observed reward
probabilities, it can potentially account for a similar range of behaviors with
fewer parameters (one meta-parameter winstead of the two learning rates).
How could differential learning rates be implemented in the brain? The
experimental investigation of neural mechanisms for prediction-error based
learning/updating is currently an active area of research. An influential pro-
Adaptive properties of differential learning rates 13
posal arising idea is that RPEs may have different effects on direct (D1) and
indirect (D2) pathways in the striatum [16,23]. In this view, symmetric cod-
ing of RPEs may initially be symmetrically encoded in the firing rate of DA
neurons, but the impact of these RPEs on basal ganglia synaptic weights
may be modulated independently depending on their direction (sign). One
possible mechanism for this may involve the modulation of presynaptic DA
terminals in the striatum by e.g. hippocampal inputs [18]; an idea that can be
tested with current voltammetry methods. Alternatively, the coding of RPEs
by DA neurons may be asymmetric to begin with, due to a smaller dynamic
range in the negative direction from baseline [28]. More radical ideas propose
a complete separation of DA as only supporting learning from positive RPEs
[12]. In any case, these experimental findings demonstrate a striking level of
dissociation of learning pathways for negative and positive prediction errors,
suggestive of broad adaptive value. Our results suggest a normative explana-
tion for these observations, and some encouragement can perhaps be derived
from the demonstration that in (some) situations where rewards are rare, it
pays to be optimistic.
Acknowledgements This work originated at the Okinawa Computational Neuroscience
Course at the Okinawa Institute for Science and Technology (OIST), Japan. We are grateful
to the organizers for providing a stimulating learning environment.
1. Timothy E J Behrens, Mark W Woolrich, Mark E Walton, and Matthew F S Rush-
worth. Learning the value of information in an uncertain world. Nature Neuroscience,
10(9):1214–21, September 2007.
2. Ethan S Bromberg-Martin, Masayuki Matsumoto, and Okihide Hikosaka. Dopamine in
motivational control: rewarding, aversive, and alerting. Neuron, 68(5):815–34, December
3. James F. Cavanagh, Michael J. Frank, and John J B. Allen. Social stress reactivity
alters reward and punishment learning. Soc Cogn Affect Neurosci, 6(3):311–320, Jun
4. Henry W. Chase and Luke Clark. Gambling severity predicts midbrain response to
near-miss outcomes. J Neurosci, 30(18):6180–6187, May 2010.
5. Mathieu D’Acremont and Peter Bossaerts. Neurobiological studies of risk assessment:
a comparison of expected utility and mean-variance approaches. Cognitive, Affective &
Behavioral Neuroscience, 8(4):363–74, December 2008.
6. Nathaniel D. Daw, Samuel J. Gershman, Ben Seymour, Peter Dayan, and Raymond J.
Dolan. Model-based influences on humans’ choices and striatal prediction errors. Neu-
ron, 69(6):1204–1215, Mar 2011.
7. Nathaniel D. Daw, John P. O’Doherty, Peter Dayan, Ben Seymour, and Raymond J.
Dolan. Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–
879, Jun 2006.
8. Peter Dayan and Yael Niv. Reinforcement learning: the good, the bad and the ugly.
Curr Opin Neurobiol, 18(2):185–196, Apr 2008.
9. Bradley B. Doll, W Jake Jacobs, Alan G. Sanfey, and Michael J. Frank. Instructional
control of reinforcement learning: a behavioral and neurocomputational investigation.
Brain Res, 1299:74–94, Nov 2009.
10. K Doya. Metalearning and neuromodulation. Neural Networks, 15(4-6):495–506, 2002.
11. Kenji Doya, Kazuyuki Samejima, Ken-ichi Katagiri, and Mitsuo Kawato. Multiple
model-based reinforcement learning. Neural Comput, 14(6):1347–1369, Jun 2002.
14 Romain D. Caz´e, Matthijs A. A. van der Meer
12. Christopher D. Fiorillo. Two dimensions of value: dopamine neurons represent reward
but not aversiveness. Science, 341(6145):546–549, Aug 2013.
13. M. J. Frank, A. A. Moustafa, H. M. Haughey, T. Curran, and K. E. Hutchison. Ge-
netic triple dissociation reveals multiple roles for dopamine in reinforcement learning.
Proceedings of the National Academy of Sciences, 104(41):16311–16316, 2007.
14. Michael J Frank, Bradley B Doll, Jen Oas-Terpstra, and Francisco Moreno. Prefrontal
and striatal dopaminergic genes predict individual differences in exploration and ex-
ploitation. Nature Neuroscience, 12(8):1062–8, August 2009.
15. Michael J. Frank, Lauren C Seeberger, and Randall C O’reilly. By carrot or by stick:
cognitive reinforcement learning in parkinsonism. Science, 306(5703):1940–3, December
16. C. R. Gerfen, T. M. Engber, L. C. Mahan, Z. Susel, T. N. Chase, F. J. Monsma, Jr.,
and D. R. Sibley. D 1 and D 2 Dopamine Receptor-Regulated Gene Expression of
Striatonigral and Striatopallidal Neurons. Science, 250:1429–1432, December 1990.
17. Samuel J Gershman and Yael Niv. Learning latent structure: carving nature at its
joints. Current Opinion in Neurobiology, 20(2):251–6, April 2010.
18. Anthony A. Grace. Dopamine system dysregulation by the hippocampus: implica-
tions for the pathophysiology and treatment of schizophrenia. Neuropharmacology,
62(3):1342–1348, Mar 2012.
19. Mark D Humphries, Mehdi Khamassi, and Kevin Gurney. Dopaminergic Control of the
Exploration-Exploitation Trade-Off via the Basal Ganglia. Frontiers in Neuroscience,
6(February):9, January 2012.
20. Daniel Kahneman and Amos Tversky. Prospect theory: An analysis of decision under
risk. Econometrica: Journal of the Econometric Society, 47(2):263–292, 1979.
21. Mehdi Khamassi, Pierre Enel, Peter Ford Dominey, and Emmanuel Procyk. Medial
prefrontal cortex and the adaptive regulation of reinforcement learning parameters.
Prog Brain Res, 202:441–464, 2013.
22. Mehdi Khamassi, St´ephane Lall´ee, Pierre Enel, Emmanuel Procyk, and Peter F
Dominey. Robot cognitive control with a neurophysiologically inspired reinforcement
learning model. Frontiers in Neurorobotics, 5(July):1, January 2011.
23. Alexxai V Kravitz, Lynne D Tye, and Anatol C Kreitzer. Distinct roles for direct and
indirect pathway striatal neurons in reinforcement. Nature Neuroscience, pages 4–7,
April 2012.
24. Zeb Kurth-Nelson and A David Redish. Temporal-difference reinforcement learning
with distributed representations. PLoS One, 4(10):e7362, 2009.
25. Tiago V Maia and Michael J Frank. From reinforcement learning models to psychiatric
and neurological disorders. Nature Neuroscience, 14(2):154–62, February 2011.
26. Oliver Mihatsch and Ralph Neuneier. Risk-sensitive reinforcement learning. Machine
Learning, 49:267–290, 2002.
27. Yael Niv, Nathaniel D Daw, Daphna Joel, and Peter Dayan. Tonic dopamine: opportu-
nity costs and the control of response vigor. Psychopharmacology, 191(3):507–20, April
28. Yael Niv, Michael O. Duff, and Peter Dayan. Dopamine, uncertainty and td learning.
Behav Brain Funct, 1:6, May 2005.
29. John P O’Doherty, Alan Hampton, and Hackjin Kim. Model-based fMRI and its ap-
plication to reward learning and decision making. Annals of the New York Academy of
Sciences, 1104:35–53, May 2007.
30. A David Redish. Addiction as a computational process gone awry. Science,
306(5703):1944–1947, Dec 2004.
31. A David Redish, Steve Jensen, Adam Johnson, and Zeb Kurth-Nelson. Reconciling
reinforcement learning models with behavioral extinction and renewal: implications for
addiction, relapse, and problem gambling. Psychol Rev, 114(3):784–805, Jul 2007.
32. Wolfram Schultz. Behavioral theories and the neurophysiology of reward. Annual Re-
view of Psychology, 57:87–115, January 2006.
33. Nicolas Schweighofer and Kenji Doya. Meta-learning in reinforcement learning. Neural
Netw, 16(1):5–9, Jan 2003.
34. Tali Sharot. The optimism bias. Current Biology, 21(23):R941–5, December 2011.
35. Tali Sharot, Christoph W Korn, and Raymond J Dolan. How unrealistic optimism is
maintained in the face of reality. Nature Neuroscience, 14(11):1475–1479, October 2011.
Adaptive properties of differential learning rates 15
36. Amitai Shenhav, Matthew M. Botvinick, and Jonathan D. Cohen. The expected value of
control: an integrative theory of anterior cingulate cortex function. Neuron, 79(2):217–
240, Jul 2013.
37. Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. January
38. R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press,
Cambridge Mass., September 1998.
39. Matthijs van der Meer, Zeb Kurth-Nelson, and a David Redish. Information processing
in decision-making systems. The Neuroscientist, 18(4):342–59, August 2012.
40. Cristopher Watkins. Learning from delayed rewards. PhD thesis, May 1989.
41. Angela J. Yu. Adaptive behavior: humans act as bayesian learners. Curr Biol,
17(22):R977–R980, Nov 2007.
... Besides this evidence, the potential adaptive value of positivity and confirmation biases has been stressed by several other authors. For instance, a few simulation studies suggest that, across a wide range of environments, virtual RL agents with a positivity bias [14], [15] or with a confirmation bias [16], [17] perform better than their counterparts. Therefore, confirmation bias seems to enhance individual decision-making rather than harm it. ...
... In short, there are no other agents to feed it counterfactual information. Cazé & van der Meer had previously shown that a positivity bias leads to poorer performance for a single agent in a rich environment [14]. But it was not obvious that a confirmation bias would also lead to poorer performance in groups of two or three agents. ...
Humans tend to give more weight to information confirming their beliefs than to information that disconfirms them. Nevertheless, this apparent irrationality has been shown to improve individual decision-making under uncertainty. However, little is known about this bias' impact on collective decision-making. Here, we investigate the conditions under which confirmation bias is beneficial or detrimental to collective decision-making. To do so, we develop a Collective Asymmetric Reinforcement Learning (CARL) model in which artificial agents observe others' actions and rewards, and update this information asymmetrically. We use agent-based simulations to study how confirmation bias affects collective performance on a two-armed bandit task, and how resource scarcity, group size and bias strength modulate this effect. We find that a confirmation bias benefits group learning across a wide range of resource-scarcity conditions. Moreover, we discover that, past a critical bias strength, resource abundance favors the emergence of two different performance regimes, one of which is suboptimal. In addition, we find that this regime bifurcation comes with polarization in small groups of agents. Overall, our results suggest the existence of an optimal, moderate level of confirmation bias for collective decision-making.
... This is partly due to its tight intertwining with learning and inference processes (Findling and Wyart, 2021), which makes it difficult to disentangle them. In humans facing stochastic decision-making tasks with non-stationary reward probabilities, choice variability has been investigated in terms of regulation of the learning rate in response to volatility (Behrens et al., 2007;Cazé and Van Der Meer, 2013). Importantly, if in a learning task, an animal's performance is seen to deteriorate, this can arguably be explained either by a decrease in its ability to learn and identify the best action, or by a reduced tendency to actually use what it has learnt to guide its action. ...
Full-text available
In uncertain environments in which resources fluctuate continuously, animals must permanently decide whether to exploit what they currently believe to be their best option, or instead explore potential alternatives in case better opportunities are in fact available. While such a trade-off has been extensively studied in pretrained animals facing non-stationary decision-making tasks, it is yet unknown how they progressively tune it while progressively learning the task structure during pretraining. Here, we compared the ability of different computational models to account for long-term changes in the behaviour of 24 rats while they learned to choose a rewarded lever in a three-armed bandit task across 24 days of pretraining. We found that the day-by-day evolution of rat performance and win-shift tendency revealed a progressive stabilization of the way they regulated the exploration-exploitation trade-off. We successfully captured these behavioural adaptations using a meta-learning model in which the exploration-exploitation trade-off is controlled by the animal’s average reward rate.
... When not constrained to belief revision in response to perceptions of self-efficacy, optimism bias appears to be robust to experimental designs and a reinforcement learning (RL) framework for understanding these biases has been advanced (Palminteri & Lebreton, 2022). This RL framework aligns with optimistic and confirmatory biases by suggesting that confirmatory and optimistic RL algorithms confer an adaptative response in the presence of resource poor and volatile environments (Cazé & van der Meer, 2013;Palminteri & Lebreton, 2022). Similarly, the analogue of a "psychological immune system" has been put forward to suggest that these cognitive biases arise to promote a psychological homeostasis in which beliefs about the self are asymmetrically influenced by information that confirms optimistic priors about oneself to maintain neutral and positive emotional states (Schaller & Park, 2011;Vaz et al., 2021). ...
Background and objectives: Understanding how individuals integrate new information to form beliefs under changing emotional conditions is crucial to describing decision-making processes. Previous research suggests that although most people demonstrate bias toward optimistic appraisals of new information when updating beliefs, individuals with dysphoric psychiatric conditions (e.g., major depression) do not demonstrate this same bias. Despite these findings, limited research has investigated the relationship between affective states and belief updating processes. Methods: We induced neutral and sad moods in participants and had them complete a belief-updating paradigm by estimating the likelihood of negative future events happening to them, viewing the actual likelihood, and then re-estimating their perceived likelihood. Results: We observed that individuals updated their beliefs more after receiving desirable information relative to undesirable information under neutral conditions. Further, we found that individuals did not demonstrate unrealistic optimism under negative affective conditions. Limitations: This study incorporated a population of university students under laboratory conditions and would benefit from replication and extension in clinical populations and naturalistic settings. Conclusions: These findings suggest that momentary fluctuations in mood affect how individuals integrate information to form beliefs.
... that, depending on the distribution of reward probabilities, asymmetric learning rates for gains and losses can be adaptive (Cazé & Van Der Meer, 2013). Accordingly, dual-learning-rate models often fit data from simple bandit tasks better than the basic algorithm. ...
Full-text available
It has recently been suggested that parameter estimates of computational models can be used to understand individual differences at the process level. One area of research in which this approach, called computational phenotyping, has taken hold is computational psychiatry. One requirement for successful computational phenotyping is that behavior and parameters are stable over time. Surprisingly, the test–retest reliability of behavior and model parameters remains unknown for most experimental tasks and models. The present study seeks to close this gap by investigating the test–retest reliability of canonical reinforcement learning models in the context of two often-used learning paradigms: a two-armed bandit and a reversal learning task. We tested independent cohorts for the two tasks ( N = 69 and N = 47) via an online testing platform with a between-test interval of five weeks. Whereas reliability was high for personality and cognitive measures (with ICCs ranging from .67 to .93), it was generally poor for the parameter estimates of the reinforcement learning models (with ICCs ranging from .02 to .52 for the bandit task and from .01 to .71 for the reversal learning task). Given that simulations indicated that our procedures could detect high test–retest reliability, this suggests that a significant proportion of the variability must be ascribed to the participants themselves. In support of that hypothesis, we show that mood (stress and happiness) can partly explain within-participant variability. Taken together, these results are critical for current practices in computational phenotyping and suggest that individual variability should be taken into account in the future development of the field.
... It is worth noting that similar attentional weights for the informative feature and non-informative feature 1 does not correspond to similar learning or decision weights for these features, because only the informative feature carries information about reward. One possible explanation for this bias towards non-informative dimension 1 was the asymmetric learning rates (Fig. 5C), which led participants to update their values to a lesser extent after a lack of reward, causing them to learn biased value representations (Palminteri and Lebreton, 2022;Katahira, 2018;Cazé and van der Meer, 2013). This would also explain why the choice history of non-informative feature 1 had a strong effect on participants' ongoing choices (Fig. 2B). ...
Real-world choice options have many features or attributes, whereas the reward outcome from those options only depends on a few features/attributes. Identifying and attending to such informative features while ignoring irrelevant information can speed up learning and improve decision making. Most previous studies on reward learning and decision making, however, use tasks in which only one feature predicts reward outcome. Therefore, it is unclear how we learn and make decisions in multi-dimensional environments where multiple features and conjunctions of features are predictive of reward, and more importantly, how selective attention contributes to these processes. Here, we examined human behavior during a three-dimensional learning task in which reward outcomes for different stimuli could be predicted based on a combination of an informative feature and the conjunction of the other two features, the informative conjunction. Using multiple approaches, we found that choice behavior and estimated reward probabilities were best described by a model that learned the predictive values of both the informative feature and the informative conjunction. Moreover, attention was controlled by the difference in these values in a cooperative manner such that attention depended on the integrated feature and conjunction values, and the resulting attention weights modulated the learning of both. Finally, attention modulated learning by increasing the learning rate on attended features and conjunctions but had little effect on decision making. Together, our results suggest that when learning in high-dimensional environments, humans direct their attention not only to selectively process reward-predictive attributes, but also to find parsimonious representations of the reward contingencies to achieve more efficient learning.
In a recent paper, Burton et al. claim that individuals update beliefs to a greater extent when learning an event is less likely compared to more likely than expected. Here, we investigate Burton’s et al.’s, findings. First, we show how Burton et al.’s data do not in fact support a belief update bias for neutral events. Next, in an attempt to replicate their findings, we collect a new data set employing the original belief update task design, but with neutral events. A belief update bias for neutral events is not observed. Finally, we highlight the statistical errors and confounds in Burton et al.’s design and analysis. This includes mis-specifying a reinforcement learning approach to model the data and failing to follow standard computational model fitting sanity checks such as parameter recovery, model comparison and out of sample prediction. Together, the results find little evidence for biased updating for neutral events.
Full-text available
We systematically misjudge our own performance in simple economic tasks. First, we generally overestimate our ability to make correct choices-a bias called overconfidence. Second, we are more confident in our choices when we seek gains than when we try to avoid losses-a bias we refer to as the valence-induced confidence bias. Strikingly, these two biases are also present in reinforcement-learning (RL) contexts, despite the fact that outcomes are provided trial-by-trial and could, in principle, be used to recalibrate confidence judgments online. How confidence biases emerge and are maintained in reinforcement-learning contexts is thus puzzling and still unaccounted for. To explain this paradox, we propose that confidence biases stem from learning biases, and test this hypothesis using data from multiple experiments, where we concomitantly assessed instrumental choices and confidence judgments, during learning and transfer phases. Our results first show that participants' choices in both tasks are best accounted for by a reinforcement-learning model featuring context-dependent learning and confirmatory updating. We then demonstrate that the complex, biased pattern of confidence judgments elicited during both tasks can be explained by an overweighting of the learned value of the chosen option in the computation of confidence judgments. We finally show that, consequently, the individual learning model parameters responsible for the learning biases-confirmatory updating and outcome context-dependency-are predictive of the individual metacognitive biases. We conclude suggesting that the metacognitive biases originate from fundamentally biased learning computations. (PsycInfo Database Record (c) 2023 APA, all rights reserved).
Full-text available
The basal ganglia (BG) contribute to reinforcement learning (RL) and decision making, but unlike artificial RL agents, it relies on complex circuitry and dynamic dopamine modulaton of opponent striatal pathways to do so. We develop the OpAL* model to assess the normative advantages of this circuitry. In OpAL*, learning induces opponent pathways to differentially emphasize the history of positive or negative outcomes for each action. Dynamic DA modulation then amplifies the pathway most tuned for the task environment. This efficient coding mechanism avoids a vexing explore-exploit tradeoff that plagues traditional RL models in sparse reward environments. OpAL* exhibits robust advantages over alternative models, particularly in environments with sparse reward and large action spaces. These advantages depend on opponent and nonlinear Hebbian plasticity mechanisms previously thought to be pathological. Finally, OpAL* captures risky choice patterns arising from DA and environmental manipulations across species, suggesting that they result from a normative biological mechanism.
Event-related potentials that follow feedback in reinforcement learning tasks have been proposed to reflect neural encoding of prediction errors. Prior research has shown that in the interval of 240-340 ms multiple different prediction error encodings appear to co-occur, including a value signal carrying signed quantitative prediction error and a valence signal merely carrying sign. The effects used to identify these two encoders, respectively a sign main effect and a sign × size interaction, do not reliably discriminate them. A full discrimination is made possible by comparing tasks in which the reinforcer available on a given trial is set to be either appetitive or aversive against tasks where a trial allows the possibility of either. This study presents a meta-analysis of reinforcement learning experiments, the majority of which presented the possibility of winning or losing money. Value and valence encodings were identified by conventional difference wave methodology but additionally by an analysis of their predicted behavior using a Bayesian analysis that incorporated nulls into the evidence for each encoder. The results suggest that a valence encoding, sensitive only to the available outcomes on the trial at hand precedes a later value encoding sensitive to the outcomes available in the wider experimental context. The implications of this for modeling computational processes of reinforcement learning in humans are discussed.
Full-text available
Converging evidence suggest that the medial prefrontal cortex (MPFC) is involved in feedback categorization, performance monitoring, and task monitoring, and may contribute to the online regulation of reinforcement learning (RL) parameters that would affect decision-making processes in the lateral prefrontal cortex (LPFC). Previous neurophysiological experiments have shown MPFC activities encoding error likelihood, uncertainty, reward volatility, as well as neural responses categorizing different types of feedback, for instance, distinguishing between choice errors and execution errors. Rushworth and colleagues have proposed that the involvement of MPFC in tracking the volatility of the task could contribute to the regulation of one of RL parameters called the learning rate. We extend this hypothesis by proposing that MPFC could contribute to the regulation of other RL parameters such as the exploration rate and default action values in case of task shifts. Here, we analyze the sensitivity to RL parameters of behavioral performance in two monkey decision-making tasks, one with a deterministic reward schedule and the other with a stochastic one. We show that there exist optimal parameter values specific to each of these tasks, that need to be found for optimal performance and that are usually hand-tuned in computational models. In contrast, automatic online regulation of these parameters using some heuristics can help producing a good, although non-optimal, behavioral performance in each task. We finally describe our computational model of MPFC-LPFC interaction used for online regulation of the exploration rate and its application to a human-robot interaction scenario. There, unexpected uncertainties are produced by the human introducing cued task changes or by cheating. The model enables the robot to autonomously learn to reset exploration in response to such uncertain cues and events. The combined results provide concrete evidence specifying how prefrontal cortical subregions may cooperate to regulate RL parameters. It also shows how such neurophysiologically inspired mechanisms can control advanced robots in the real world. Finally, the model's learning mechanisms that were challenged in the last robotic scenario provide testable predictions on the way monkeys may learn the structure of the task during the pretraining phase of the previous laboratory experiments.
Full-text available
Dopamine signaling is implicated in reinforcement learning, but the neural substrates targeted by dopamine are poorly understood. We bypassed dopamine signaling itself and tested how optogenetic activation of dopamine D1 or D2 receptor–expressing striatal projection neurons influenced reinforcement learning in mice. Stimulating D1 receptor–expressing neurons induced persistent reinforcement, whereas stimulating D2 receptor–expressing neurons induced transient punishment, indicating that activation of these circuits is sufficient to modify the probability of performing future actions.
Full-text available
Decisions result from an interaction between multiple functional systems acting in parallel to process information in very different ways, each with strengths and weaknesses. In this review, the authors address three action-selection components of decision-making: The Pavlovian system releases an action from a limited repertoire of potential actions, such as approaching learned stimuli. Like the Pavlovian system, the habit system is computationally fast but, unlike the Pavlovian system permits arbitrary stimulus-action pairings. These associations are a "forward'' mechanism; when a situation is recognized, the action is released. In contrast, the deliberative system is flexible but takes time to process. The deliberative system uses knowledge of the causal structure of the world to search into the future, planning actions to maximize expected rewards. Deliberation depends on the ability to imagine future possibilities, including novel situations, and it allows decisions to be taken without having previously experienced the options. Various anatomical structures have been identified that carry out the information processing of each of these systems: hippocampus constitutes a map of the world that can be used for searching/imagining the future; dorsal striatal neurons represent situation-action associations; and ventral striatum maintains value representations for all three systems. Each system presents vulnerabilities to pathologies that can manifest as psychiatric disorders. Understanding these systems and their relation to neuroanatomy opens up a deeper way to treat the structural problems underlying various disorders.
Whereas reward (appetitiveness) and aversiveness (punishment) have been distinguished as two discrete dimensions within psychology and behavior, physiological and computational models of their neural representation have treated them as opposite sides of a single continuous dimension of “value.” Here, I show that although dopamine neurons of the primate ventral midbrain are activated by evidence for reward and suppressed by evidence against reward, they are insensitive to aversiveness. This indicates that reward and aversiveness are represented independently as two dimensions, even by neurons that are closely related to motor function. Because theory and experiment support the existence of opponent neural representations for value, the present results imply four types of value-sensitive neurons corresponding to reward-ON (dopamine), reward-OFF, aversive-ON, and aversive-OFF.
The dorsal anterior cingulate cortex (dACC) has a near-ubiquitous presence in the neuroscience of cognitive control. It has been implicated in a diversity of functions, from reward processing and performance monitoring to the execution of control and action selection. Here, we propose that this diversity can be understood in terms of a single underlying function: allocation of control based on an evaluation of the expected value of control (EVC). We present a normative model of EVC that integrates three critical factors: the expected payoff from a controlled process, the amount of control that must be invested to achieve that payoff, and the cost in terms of cognitive effort. We propose that dACC integrates this information, using it to determine whether, where and how much control to allocate. We then consider how the EVC model can explain the diverse array of findings concerning dACC function.
Most reinforcement learning algorithms optimize the expected return of a Markov Decision Problem. Practice has taught us the lesson that this criterion is not always the most suitable because many applications require robust control strategies which also take into account the variance of the return. Classical control literature provides several techniques to deal with risk-sensitive optimization goals like the so-called worst-case optimality criterion exclusively focusing on risk-avoiding policies or classical risk-sensitive control, which transforms the returns by exponential utility functions. While the first approach is typically too restrictive, the latter suffers from the absence of an obvious way to design a corresponding model-free reinforcement learning algorithm.Our risk-sensitive reinforcement learning algorithm is based on a very different philosophy. Instead of transforming the return of the process, we transform the temporal differences during learning. While our approach reflects important properties of the classical exponential utility framework, we avoid its serious drawbacks for learning. Based on an extended set of optimality equations we are able to formulate risk-sensitive versions of various well-known reinforcement learning algorithms which converge with probability one under the usual conditions.