ArticlePDF Available

Adaptive properties of differential learning rates for positive and negative outcomes

Authors:

Abstract and Figures

The concept of the reward prediction error-the difference between reward obtained and reward predicted-continues to be a focal point for much theoretical and experimental work in psychology, cognitive science, and neuroscience. Models that rely on reward prediction errors typically assume a single learning rate for positive and negative prediction errors. However, behavioral data indicate that better-than-expected and worse-than-expected outcomes often do not have symmetric impacts on learning and decision-making. Furthermore, distinct circuits within cortico-striatal loops appear to support learning from positive and negative prediction errors, respectively. Such differential learning rates would be expected to lead to biased reward predictions and therefore suboptimal choice performance. Contrary to this intuition, we show that on static "bandit" choice tasks, differential learning rates can be adaptive. This occurs because asymmetric learning enables a better separation of learned reward probabilities. We show analytically how the optimal learning rate asymmetry depends on the reward distribution and implement a biologically plausible algorithm that adapts the balance of positive and negative learning rates from experience. These results suggest specific adaptive advantages for separate, differential learning rates in simple reinforcement learning settings and provide a novel, normative perspective on the interpretation of associated neural data.
Content may be subject to copyright.
Biological Cybernetics manuscript No.
(will be inserted by the editor)
Adaptive properties of differential learning rates for
positive and negative outcomes
Romain D. Caz´e ·Matthijs A. A. van
der Meer
Received: date / Accepted: date
Abstract The concept of the reward prediction error – the difference between
reward obtained and reward predicted – continues to be a focal point for much
theoretical and experimental work in psychology, cognitive science, and neuro-
science. Models that rely on reward prediction errors typically assume a single
learning rate for positive and negative prediction errors. However, behavioral
data indicate that better-than-expected and worse-than-expected outcomes of-
ten do not have symmetric impacts on learning and decision-making. Further-
more, distinct circuits within cortico-striatal loops appear to support learn-
ing from positive and negative prediction errors respectively. Such differential
learning rates would be expected to lead to biased reward predictions, and
therefore suboptimal choice performance. Contrary to this intuition, we show
that on static “bandit” choice tasks, differential learning rates can be adap-
tive. This occurs because asymmetric learning enables a better separation of
learned reward probabilities. We show analytically how the optimal learning
rate asymmetry depends on the reward distribution, and implement a bio-
logically plausible algorithm that adapts the balance of positive and negative
learning rates from experience. These results suggest specific adaptive advan-
tages for separate, differential learning rates in simple reinforcement learning
RC is supported by an Marie Curie initial training fellowship. MvdM is supported by the
National Science and Engineering Council of Canada (NSERC).
Romain Caz´e
Department of Bioengineering
Imperial College
London, England
r.caze@imperial.ac.uk
Matthijs van der Meer
Department of Biology and Centre for Theoretical Neuroscience
University of Waterloo, Canada
mvdm@uwaterloo.ca
2 Romain D. Caz´e, Matthijs A. A. van der Meer
settings, and provide a novel, normative perspective on the interpretation of
associated neural data.
Keywords Reinforcement learning ·Reward prediction error ·Decision-
making ·Meta-learning ·Basal ganglia
1 Introduction
A central element in the major theories of reinforcement learning is the reward
prediction error (RPE) which, in its simplest form, is the difference between
the amount of received and expected reward [38]. Thus, a positive RPE is gen-
erated when more is received than expected, and conversely, negative RPEs
occur when less is received than expected. Proposals based on RPEs can ex-
plain salient features of Pavlovian and instrumental learning [32] and generalize
to powerful algorithms that can learn about states and actions which are not
themselves rewarded, but result in reward later (e.g. temporal-difference re-
inforcement learning, [37]). Essentially, these algorithms attempt to estimate
from experience the expected reward resulting from states and actions, so that
the amount of reward obtained can be maximized.
RPEs appear to be encoded by the neural activity of several brain areas
known to support learning and motivated behavior, such as a population of
dopaminergic neurons in the ventral tegmental area (VTA) [2]. Projection
targets of these neurons encode action and state values [39], and reinforcement
learning model fits to behavioral and neural data can account for trial-by-trial
changes in both [29]. Taken together, these findings support the familiar notion
that circuits centered on the dopaminergic modulation of activity in cortico-
striatal loops implement a reinforcement learning system. Thus, reinforcement
learning models provide an explicit, computational account of the relationship
between RPEs, estimates of expected reward, and subsequent behavior. This
account is a powerful framework for understanding the neural basis of learning
and decision-making in mechanistic, biological detail.
Many reinforcement learning models implicitly assume that positive and
negative RPEs impact estimation of action and state values through a com-
mon gain factor or learning rate [38,29]. However, converging evidence in-
dicates that learning from positive and negative feedback, including positive
and negative RPEs, in fact relies on dissociable mechanisms in the brain. De-
pending on the specific setting, these distinct mechanisms are referred to as
approach/avoid, go/noGo, or direct/indirect pathways in the basal ganglia,
associated with predominantly D1 and D2-expressing projection neurons in
the striatum, respectively [16,23]. This neural separation of the direct and in-
direct pathways raises the possibility that distinct, differential learning rates
are associated with positive and negative RPEs.
In support of this idea, a number of studies have reported behavioral evi-
dence for differential learning rates in humans [15,13]. From a reinforcement
learning perspective, differential learning rates imply biased estimates of the
amount of expected reward, which in turn can lead to suboptimal decisions.
Adaptive properties of differential learning rates 3
This “irrationality” notion is congruent with observations from psychology
and behavioral economics. For instance, a recent study examined the extent
to which subjects updated their estimates of the likelihood of various events
happening to them (e.g. get cancer, win the lottery) after being told the true
probabilities [35]. Strikingly, it was found that subjects tended to update their
estimates more if the odds were better (in their favour) than they thought.
Such differential updating could underlie the formation of a so-called “opti-
mism bias” [34] and has also been proposed as contributing to other biases
such as risk aversion in mean-variance learners [26,5].
Such behavioral evidence for asymmetric updating following positive (bet-
ter than expected) and negative (worse than expected) outcomes invites nor-
mative questions. Are these biases suboptimal in an absolute sense, but per-
haps the result of limited cognitive resources? Or alternatively, are they op-
timized for situations different than tested? Here, we aim to address these
questions quantitatively in simple reinforcement learning models. We show
that even in simple, static bandit tasks, agents with differential learning rates
can outperform unbiased agents. These results provide a different view on (1)
the interpretation of behavioral results where such biases are often cast as
irrational, and (2) neurophysiological data comparing neural responses associ-
ated with positive and negative prediction errors. Finally, this work suggests
simple and biologically motivated improvements to boost the performance of
commonly used reinforcement learning models.
2 Impact of differential learning rates on value estimation
To model a tractable decision problem where the effects of differential learning
rates for positive and negative RPEs can be explored, we implement agents
that learn the values of actions on probabilistically rewarded “bandit” tasks.
That is, there is only one state with one (this section) or two possible actions
(the next section), the expected value of which the agent learns incrementally
from probabilistic feedback according to a simple delta rule. Unlike the stan-
dard form of typical reinforcement learning algorithms such as Q-learning [40],
however, our agents employ different learning rates for positive and negative
prediction errors ∆Qt= (rtQt):
Qt+1 =Qt+α+∆Qtif ∆Qt0
α∆Qtif ∆Qt<0
Qt, by convention, corresponds to the expected reward of taking a certain
action at time step t(“Q-value”). rtis the actual reward received at time step
t. The usual state and action indices s, a are omitted because in this section,
we only consider one state and one action. Unlike typical “symmetric” update
rules, the existence of two learning rates α+and αcan lead to biased Q-
values, as we will show below.
We first derive an analytical expression for the steady-state Q-value for a
single action with a probabilistic, binary outcome. Without loss of generality,
4 Romain D. Caz´e, Matthijs A. A. van der Meer
we set the outcomes to rt= 1 or and rt= -1, with probabilities pand 1 p
respectively. Then, for sufficiently small αand Q06={−1,1}, if rt= 1 the
outcome is always superior to the predicted Q, whereas if rt=1 the outcome
is always inferior. Thus, the update rule for each trial is the following:
Qt+1 =Qt+α+(1 Qt) with probability p
α(1Qt) with probability 1 p
At steady state, ˆ
Qt+1 =ˆ
Qt=Qwhere ˆ
Qrepresents the mean Qwhen
t→ ∞. This yields:
Q=Q++(1 Q) + (1 p)α(1Q)
If we define xas the ratio between the learning rate for positive over neg-
ative prediction value x=α+, we can rewrite the previous equation and
obtain Q:
Q=px (1 p)
px + (1 p)(1)
A number of properties of this result are worth noting. When α> α+the
Q-values are “pessimistic”, i.e. they are below the true value; whereas when
α+=α, the steady-state Q-value converges, as expected, to the true mean
value for this action (equal to 2p1; see Figure 1 where α+=α). Moreover,
when α+> α, the Q-values are “optimistic”, i.e. they are above the true
value. This bias is illustrated in Figure 1 where α+= 4α. Note also the
complex dependence of the bias on the true mean value. The bias is lower for
low Qvalues than for high Qvalues in the case of a pessimistic agent, while
this is the opposite for an optimistic agent.
To introduce our simulation testbed, we first sought to confirm the above
analytical results numerically. The analytical results predicted with high ac-
curacy the behavior of a modified Q-learning algorithm (5000 iterations of
800 trials each; error bars in Figure 1 show the variance across iterations).
Thus, differential learning rates for positive and negative RPEs lead to biased
estimates of true underlying values, in a heterogeneous manner that depends
on the true mean value as well as the learning rate asymmetry. In the next
section, we examine the impact of this distortion on performance in choice
settings (“two-armed bandits”).
3 Impact of differential learning rates on performances
In this section, we consider the consequences of biased estimates resulting from
differential learning rates in simple choice situations. In particular, we simulate
the performance of three Q-learning agents on two different “two-armed ban-
dit” problems. The “rational” (R) agent has equal learning rates for negative
and positive prediction error α= 0.1; the optimistic (O) agent has a higher
Adaptive properties of differential learning rates 5
Fig. 1 Differential learning rates result in biased estimates of true expected
values. Analytically derived Q-values (black filled circles, Q) for different true values
of Q:0.8,0.6,0.6,0.8 (grey dotted lines) and for different ratios of α+and α. Note
that the true Q-values are the means of probabilistic reward delivery schedules. Error bars,
computed using numerical simulations, show the variance of the estimated Q-values after
800 trials averaged over 5000 runs (the means converge to the analytically derived values).
When the learning rates for negative and positive prediction errors are equal, the derived
Q-values correspond to the true values, whereas they are distorted when the learning rates
are different. A “pessimistic” learner ( α+
α
<1) underestimates the true values, and an
“optimistic” learner ( α+
α
>1) overestimates them.
learning rate for positive prediction errors (α+= 0.4) than for negative pre-
diction errors (α= 0.1), and the pessimistic (P) agent has a higher learning
rate for negative (α= 0.4) than for positive prediction errors (α+= 0.1). All
agents use a standard softmax decision rule with fixed temperature β= 0.3
to decide which arm to choose given the Q-values; we employ this decision
rule here because it is the de facto standard in applications of reinforcement
learning models to behavioral and neural data. We discuss the applicability of
these results for the epsilon-greedy decision rule below.
The three agents are tested on two versions of a difficult two-armed bandit
task, with large variance in the two outcomes but small differences in the
means [38]. For simplicity, we again consider binary outcome distributions
{−1,1}, however, the results that follow are quite general (see Discussion). In
the first, “low-reward” task, the respective probabilities of r= 1 are 0.2 and
0.1; in the other “high-reward” task, the two arms are rewarded with reward
probabilities of 0.9 and 0.8 respectively.
When simulated on these two tasks, a striking pattern is apparent (Figure
2a): in the low-reward task, the optimistic agent learns to take the best action
significantly more often than the rational agent, which in turn performs better
than the pessimistic agent (left panel). The agents’ performance is reversed
in the high-reward task (right panel). To understand this pattern of results,
recall that reinforcement learners face a trade-off between exploitation and
6 Romain D. Caz´e, Matthijs A. A. van der Meer
exploration [38]: choosing to exploit the option with the highest estimated Q-
value does not guarantee that there is not an (insufficiently explored) better
option available. This problem is particularly pernicious for probabilistic re-
wards such as those in the current setting, where only noisy estimates of true
underlying values are available.
Fig. 2 Differential learning rates increase performance in specific tasks. (A) Mean
probability of choosing the best arm (averaged over 5000 runs) for the three agents, “ratio-
nal” (R, α+=α, blue line), “optimist” (O, α+> α, green line), and “pessimist” (P,
α+< α, red line). The left plot shows performance on the low-reward task (0.1 and 0.2
probability of reward for the two choices), the right plot performance on the high-reward
task (0.8 and 0.9 probability). Note that in the low-reward task the optimistic agent is the
best performer and the pessimistic agent the worst, but this pattern is reversed for the high-
reward task. (B) Probability of switching after 800 episodes for each agent. This probability
depends on the task: in the low-reward task the optimistic agent is the least likely to switch,
with the pattern reversed for the high-reward task.
To quantify differences in how the agents navigate this exploration versus
exploitation trade-off in the steady state, we estimated the probability that
an agent repeats the same choice between time tand t+ 1, after extended
learning (800 trials). For both tasks, we expect that agent will “stay” more
than “switch” owing to differences in mean expected reward for both choices.
However, as shown in Figure 2b, the different agents have a distinct probability
of switching, and this difference between agents depends on the task. Higher
probabilities of switching (“exploring”) are associated with lower performance
(compare with Figure 2a).
Why does this occur? As shown by Equation 1 and illustrated in Figure
1, differential learning rates result in biased value estimates on probabilistic
tasks such as the one simulated here. To understand the implications of this for
choice situations, the dependence of this bias on the true mean value is critical.
Adaptive properties of differential learning rates 7
In the low-reward task, mean values are low, and therefore the distortion of
these values will be high for the “optimistic” agent (see Figure 1a). Conversely,
in this task, distortion will be low for the “pessimistic” agent (Figure 1b). The
implication of this is that distortion of the true values effectively increases
the contrast between the two choices, leading to (1) increased probability of
choosing the best option using a softmax rule, and (2) increased robustness
to random fluctuations in the outcomes; this latter effect does not rely on the
softmax action selection rule and can also occur for different action selection
rules such as -greedy.
More precisely, for the results in Figure 2, the rational agent tends to
approach the true mean values, with a mean final estimated after 800 trials
approximatively equal to Q1=0.63, Q0=0.84, and in the high reward
approximatively equal to Q1= 0.78, Q0=0.51. These estimated Q-values
are close to the true Q-value for the state, and thus the ∆Q is close to the
true value of 0.2. The estimated ∆Q at steady state for a biased agent due to
the heterogeneous distortion observed in the first part is in excess of 0.5. It is
this distortion (separation) which enables higher steady state performance for
the biased agent. Likewise, distortion in the opposite direction (compression)
is the reason for impaired performance. Specifically, in the low-reward task,
the pessimistic agent will have a smaller ∆Q than the rational agent, because
the pessimistic bias causes saturation as the Q-values approaches 1, making
the two Q-values closer than the true values, ∆Q < 0.06. For the same reason,
the opposite symmetric observation holds for an optimistic agent in the high-
reward task, where this agent has a ∆Q < 0.04.
In addition to this effect on performance in the steady state, differential
learning rates likely also impact the early stages of learning, when the agent
takes its first few choices. This idea is illustrated by the following intuition:
in the high-reward task, an agent will have a tendency to over-exploit its first
choice, because it is likely that this first choice will be rewarded. In the low-
reward task, an agent will have the opposite tendency, it will over-explore its
environment because it is likely that neither arm will provide a reward for the
first few trials. To the extent that this effect contributes to performance, it
may be mitigated by appropriately differential learning rates.
These observations raise the obvious question: can we find optimal learning
rates which maximize ∆Q? Can we find an agent which maximizes this ∆Q
in both tasks? We explore these issues in the following section.
4 Derivation of optimally differential learning rates
From Equation 1, we can compute ∆Qat steady state between the two
choices:
∆Q=2(p1q0p0q1)x
p1p0x2+ (p1q0+p0q1)x+q0q1
where the indices correspond to the different choices (bandit arms); the
index of the arm with the highest probability of reward is 1, with 0 indicating
8 Romain D. Caz´e, Matthijs A. A. van der Meer
the other arm. qi= 1 piis the probability of no-reward for arm i. We can
determine the xfor which this rational function is maximal and thus find the
ratio for which ∆Q is maximal:
x=q0q1
p0p1
In the limit where p0p1the optimal xtends to the ratio between
p(no reward) and p(reward). The results in the previous section indicate
that the best steady state performance is achieved when ∆Qis maximal;
thus, we can conclude that the best learning rates for positive (resp. negative)
prediction error should be proportional to the rate of no-reward (resp. rate of
reward). Specifically, in the low-reward task the optimal xis equal to 6. In
the high-reward task an optimal x=1
6. Thus, an optimal xin a given task
is also the one which leads to the worst performance in the other task, where
the optimal xis the inverse. To solve this issue we introduce an agent with a
plastic learning rate (meta-learner), adaptable to tasks with different reward
distributions.
5 Meta-learning of optimally differential learning rates
We have shown in the previous section that behavior is optimal when the
learning rate for positive (resp. negative) prediction error corresponds to the
probability of no-reward (resp. reward) in a given task. Thus, here we study
an agent which adapts its learning rate for positive and negative prediction
errors. Given our results, we choose as a target α+w×p(no reward) and
αw×p(reward), with the reward and no-reward probabilities estimated
by a running average of reward history. The parameter w(set to 0.1 in Figure
3) replaces the standard learning rate parameter αand captures the sensitivity
of learning to the estimated reward rate. Thus, we define a meta-learning agent
which derives separate learning rates for positive and negative prediction errors
from a running estimate of the task reward rate.
We compared the performance of this meta-learning agent to the rational
agent in both high- and low-reward tasks introduced in the previous section.
As shown in Figure 3, the meta agent outperforms the rational agent in both
tasks. In fact, the meta-agent slightly outperforms the purely pessimistic or
optimistic agent, meaning that in the low reward task this meta-learning agent
is optimistic whereas, in the high reward task, the agent is pessimistic. As
expected the two learning rates converge at steady state toward the mean
probability of reward or no-reward in one given task. For a large range of α
between 0.01 and 0.4 we obtained similar results (Figure 3a). Thus, the meta-
learning agent performed best in both task settings by flexibly adapting its
learning rate based on reward history – it under- or overestimates the true
expected rewards to improve choice performance.
Adaptive properties of differential learning rates 9
Fig. 3 Meta-learning of differential learning rates results in optimal performance
by the same agent across tasks. As in Figure 2, shown is the probability of choosing the
best option in the two different tasks (left: low reward, right: high reward) for two different
agents. In blue (α= 0.1) is the performance of the “rational” agent R, in teal and navy blue
is the performance when α= 0.01 and α= 0.4 respectively, and in violet is the agent with
two plastic learning rates N. This “meta-learning” agent outperforms the rational agent in
both tasks by finding appropriate, differential learning rates based on a running average of
rewards received. This estimate started with a value of 0.5 for both probabilities, and was
updated by keeping track of the number of rewarded and non-rewarded trials (i.e. an infinite
window for the running average, but similar results are obtained for any sufficiently reliable
estimation method).
The results in section 3 suggest that differential learning rates are most
useful in situations where competing choice values are close together. This ef-
fect is illustrated for the meta-learner in Figure 4b) in a task where the reward
probabilities are 0.75 and 0.25, i.e. where reward probabilities are separated by
0.5 rather than 0.1, as explored previously, and where no-reward and reward
probabilities are equal. In this scenario the advantage of differential learning
rates is negligible. A further setting of interest is the extension to more than
two choices: Figure 4c shows the performance of the meta-learner as compared
to fixed learning rate agents for two different three-armed bandits. In this set-
ting the meta-learner outperforms the rational agent by the same margin as
in the two-armed bandit case, demonstrating that the benefits of differential
learning rates are not restricted to binary choice.
6 Discussion
We simulated agents that make decisions based on learned expected values that
systematically deviate from the true outcome magnitudes. This bias, derived
from differential learning rates for positive and negative reward prediction
errors (RPEs), distinguishes our approach from typical reinforcement learning
models with a single learning rate that attempt to learn the true outcome
magnitudes. In contrast, in economics, divergence between objective values
and subjective utilities is foundational [20]. In line with this idea, our agents
can be said to learn something akin to subjective utilities, in the sense that
actions are based on a distorted (subjective) representation of the true values.
Surprisingly, however, we show that subjective or biased representations
based on the biologically motivated idea of differential processing of posi-
10 Romain D. Caz´e, Matthijs A. A. van der Meer
Fig. 4 The performance advantage of differential learning rate agents depends
on task structure. A We plot here the performance of the different agents (Meta-learner,
Rational, Optimist, and Pessimist) where the probabilities of reward are 0.75 and 0.25 for
the two choices. In tasks with more than two choices, differential learning rates continue to
outperform the other agents. BPerformance of agents in a “three-armed bandit” task, with
reward probabilities 0.1; 0.15; 0.2 in the low reward task and 0.8;0.85; 0.9 in the high reward
task.
tive and negative prediction errors can perform objectively better on simple
probabilistic learning tasks. This result suggests that the presence of separate
direct and indirect (or “Go”/“NoGo”, “approach”/“avoid”) pathways in the
nervous system enables adaptive value derived from distinct learning rates in
these pathways. Behavioral evidence for differential learning rates have been
reported in a number of studies [15,13,35] but to our knowledge this work is
the first to explore the implications of these results from a normative perspec-
tive. Similarly, a number of models have employed separate parameters for
learning from positive and negative outcomes (e.g. [9,22]) but these studies
likewise did not explore the raisons d’etre for such an architecture. We show
here that differential learning rates can result in increased separation between
competing action values, leading to a steady-state performance improvement
because of an interaction with probabilistic action selection and stochastic
rewards.
We also implemented a meta-learning agent to achieve optimal action value
separation adaptively, based on an estimate of the average reward rate. This
agent always outperforms the unbiased agent at steady-state, but it is slower
than the rational agent to reach this higher level of performance. In situations
where reward probabilities are changing, this approach may delay the behav-
ioral response to the change. However, it is well known that “model-free” RL
models in general do not perform well in volatile situations such as serial re-
versal learning [8]. Thus, we view the current proposal as complementary to,
Adaptive properties of differential learning rates 11
and compatible with (within a single RL system) meta-learning of other RL
parameters [10,33]. Examples of such meta-learning include models that adap-
tively regulate overall learning rate [1], the exploration/exploitation trade off
[7,19], and on-line state splitting [31]. A related idea is that of distributed
learning, in which many individual RL models – each with specific parameter
values drawn from a given distribution – are instantiated in parallel and then
compete for behavioral control [11,24]. The meta-learner proposed here can be
implemented in both these ways, but differs from other proposals in examining
normatively the effect of positive and negative prediction errors. In particular,
unlike meta-learners that modify alpha, beta, gamma, our proposal deliber-
ately distorts value estimates. It relaxes the notion that reinforcement learners
should strive for accurate predictions; rather, it should strive for predictions
that are effective after the properties (imperfections, constraints) of e.g. the
action selection stage are taken into account.
The present results were obtained under specific circumstances, using a
so-called “Q-learner” and a softmax action selection rule, with a temperature
parameter equal to 0.3, experiencing static reward distributions. This was done
chosen to demonstrate the main point in a simple setting, commonly in use
both as a RL testbed and in experiments. Nevertheless, the underlying insight
is quite general: in the presence of noisy observations and specific properties of
the decision stage (e.g. softmax), unbiased estimation of expected values may
not lead to optimal performance. Of course, more sophisticated models such
as a Bayesian learner with appropriate (meta-)parameters and priors [41], or a
risk-sensitive reinforcement learner [26] can in principle implement an optimal
solution to the reinforcement learning problems explored here using unbiased
estimates. However, these approaches are more computationally intensive to
implement, and less straightforward to relate to neural signals and neural ar-
chitectures. Q-learning and related models are in widespread use as models for
fitting behavior and neurophysiological data in neuroscience, providing much
evidence that biological learning mechanisms share at least some properties
with these models ([29, 25], but see [17]). Thus, the primary relevance of this
work lies in its application to the interpretation of neuroscience data; we dis-
cuss some examples in the next section.
The broad spectrum of more traditional model-free reinforcement learning
approaches such as the one taken here suggests a number of possible extensions
to the current proposal. For instance, it would be of interest to determine if
differential learning rates can be advantageous in temporal credit assignment
problems (i.e. those requiring a temporal-difference component, learning to
choose actions that are not themselves rewarded but lead to reward later; for
a rare experimental investigation of RL variables on such a task see [6]). We
tentatively suggest that because the reported performance differences arise
because of the stochasticity in outcome distributions, it may be sufficient to
apply differential learning rates only to actions or states associated with actual
rewards. The resulting biased value estimates may then be propagated using
standard temporal-difference methods.
12 Romain D. Caz´e, Matthijs A. A. van der Meer
A number of experimental studies have linked the regulation of differential
learning rates to the neuromodulator dopamine. For instance, Parkinsonian
patients off medication (i.e. with reduced dopamine levels) update RL value
estimates more in response to negative outcomes compared to positive out-
comes; the reverse is the case when subjects are given L-DOPA treatment,
elevating DA (dopamine) levels [15]. A similar effect was found based on ge-
netic polymorphisms in DA receptors [14]. Thus, subjects with relatively low
DA levels behave like our “pessimistic” learner whereas those with high DA
levels are more “optimistic”. Our results therefore suggest that Parkinsonian
patients (off medication) would do well on tasks like our high-reward task, but
do poorly on low-reward tasks.
An important distinction is that between phasic and tonic dopamine, which
are thought to contribute to response vigor and learning respectively. In par-
ticular, tonic dopamine has been suggested to reflect the average reward rate
or “opportunity cost” [27]. Our proposal uses this same estimate for the adap-
tive scaling of learning rates. Thus, in probabilistic learning tasks such as the
ones used here, we would predict that under high reward rates, when tonic
dopamine is high, negative RPEs (reward prediction errors) have more impact
on learning than positive RPEs, because, as we show, this is adaptive in high
reward settings. This is a nontrivial prediction that can be tested with cur-
rent methods; it is critical, however, that the experimental design can separate
effects on learning rate from effects on the exploration-exploitation trade-off,
which has also been linked to tonic DA [19].
Dysregulation of DA levels has also been linked to various manifestations of
addiction, including those involving drugs of abuse [30] and problem gambling
[4]. The application of reinforcement learning models to such behaviors sug-
gests a mis-estimation of choice values lies at the root of dysfunctional decision-
making. Such mis-estimation may arise for instance from a non-compensable
positive prediction error [30], or from inappropriate meta-learning of the adap-
tive learning rates proposed here. In this respect it is particularly interesting
to note that several studies have specifically attributed meta-learning roles to
frontal areas also linked to addiction [7,21, 36]. The applicability of the norma-
tive ideas developed here to these settings would benefit from an empirical test
of positive and negative learning rates in affected individuals, perhaps along
the lines of [3].
Our results also suggest a further alternative to improving reinforcement
learning fits to behavioral and neural data. As noted previously, to improve on
the basic RL model with a single learning rate, some studies have used models
with distinct learning rates for positive and negative prediction errors [13],
requiring the estimation of two free parameters from the data. In contrast,
because our meta-learner sets its learning rates based on observed reward
probabilities, it can potentially account for a similar range of behaviors with
fewer parameters (one meta-parameter winstead of the two learning rates).
How could differential learning rates be implemented in the brain? The
experimental investigation of neural mechanisms for prediction-error based
learning/updating is currently an active area of research. An influential pro-
Adaptive properties of differential learning rates 13
posal arising idea is that RPEs may have different effects on direct (D1) and
indirect (D2) pathways in the striatum [16,23]. In this view, symmetric cod-
ing of RPEs may initially be symmetrically encoded in the firing rate of DA
neurons, but the impact of these RPEs on basal ganglia synaptic weights
may be modulated independently depending on their direction (sign). One
possible mechanism for this may involve the modulation of presynaptic DA
terminals in the striatum by e.g. hippocampal inputs [18]; an idea that can be
tested with current voltammetry methods. Alternatively, the coding of RPEs
by DA neurons may be asymmetric to begin with, due to a smaller dynamic
range in the negative direction from baseline [28]. More radical ideas propose
a complete separation of DA as only supporting learning from positive RPEs
[12]. In any case, these experimental findings demonstrate a striking level of
dissociation of learning pathways for negative and positive prediction errors,
suggestive of broad adaptive value. Our results suggest a normative explana-
tion for these observations, and some encouragement can perhaps be derived
from the demonstration that in (some) situations where rewards are rare, it
pays to be optimistic.
Acknowledgements This work originated at the Okinawa Computational Neuroscience
Course at the Okinawa Institute for Science and Technology (OIST), Japan. We are grateful
to the organizers for providing a stimulating learning environment.
References
1. Timothy E J Behrens, Mark W Woolrich, Mark E Walton, and Matthew F S Rush-
worth. Learning the value of information in an uncertain world. Nature Neuroscience,
10(9):1214–21, September 2007.
2. Ethan S Bromberg-Martin, Masayuki Matsumoto, and Okihide Hikosaka. Dopamine in
motivational control: rewarding, aversive, and alerting. Neuron, 68(5):815–34, December
2010.
3. James F. Cavanagh, Michael J. Frank, and John J B. Allen. Social stress reactivity
alters reward and punishment learning. Soc Cogn Affect Neurosci, 6(3):311–320, Jun
2011.
4. Henry W. Chase and Luke Clark. Gambling severity predicts midbrain response to
near-miss outcomes. J Neurosci, 30(18):6180–6187, May 2010.
5. Mathieu D’Acremont and Peter Bossaerts. Neurobiological studies of risk assessment:
a comparison of expected utility and mean-variance approaches. Cognitive, Affective &
Behavioral Neuroscience, 8(4):363–74, December 2008.
6. Nathaniel D. Daw, Samuel J. Gershman, Ben Seymour, Peter Dayan, and Raymond J.
Dolan. Model-based influences on humans’ choices and striatal prediction errors. Neu-
ron, 69(6):1204–1215, Mar 2011.
7. Nathaniel D. Daw, John P. O’Doherty, Peter Dayan, Ben Seymour, and Raymond J.
Dolan. Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–
879, Jun 2006.
8. Peter Dayan and Yael Niv. Reinforcement learning: the good, the bad and the ugly.
Curr Opin Neurobiol, 18(2):185–196, Apr 2008.
9. Bradley B. Doll, W Jake Jacobs, Alan G. Sanfey, and Michael J. Frank. Instructional
control of reinforcement learning: a behavioral and neurocomputational investigation.
Brain Res, 1299:74–94, Nov 2009.
10. K Doya. Metalearning and neuromodulation. Neural Networks, 15(4-6):495–506, 2002.
11. Kenji Doya, Kazuyuki Samejima, Ken-ichi Katagiri, and Mitsuo Kawato. Multiple
model-based reinforcement learning. Neural Comput, 14(6):1347–1369, Jun 2002.
14 Romain D. Caz´e, Matthijs A. A. van der Meer
12. Christopher D. Fiorillo. Two dimensions of value: dopamine neurons represent reward
but not aversiveness. Science, 341(6145):546–549, Aug 2013.
13. M. J. Frank, A. A. Moustafa, H. M. Haughey, T. Curran, and K. E. Hutchison. Ge-
netic triple dissociation reveals multiple roles for dopamine in reinforcement learning.
Proceedings of the National Academy of Sciences, 104(41):16311–16316, 2007.
14. Michael J Frank, Bradley B Doll, Jen Oas-Terpstra, and Francisco Moreno. Prefrontal
and striatal dopaminergic genes predict individual differences in exploration and ex-
ploitation. Nature Neuroscience, 12(8):1062–8, August 2009.
15. Michael J. Frank, Lauren C Seeberger, and Randall C O’reilly. By carrot or by stick:
cognitive reinforcement learning in parkinsonism. Science, 306(5703):1940–3, December
2004.
16. C. R. Gerfen, T. M. Engber, L. C. Mahan, Z. Susel, T. N. Chase, F. J. Monsma, Jr.,
and D. R. Sibley. D 1 and D 2 Dopamine Receptor-Regulated Gene Expression of
Striatonigral and Striatopallidal Neurons. Science, 250:1429–1432, December 1990.
17. Samuel J Gershman and Yael Niv. Learning latent structure: carving nature at its
joints. Current Opinion in Neurobiology, 20(2):251–6, April 2010.
18. Anthony A. Grace. Dopamine system dysregulation by the hippocampus: implica-
tions for the pathophysiology and treatment of schizophrenia. Neuropharmacology,
62(3):1342–1348, Mar 2012.
19. Mark D Humphries, Mehdi Khamassi, and Kevin Gurney. Dopaminergic Control of the
Exploration-Exploitation Trade-Off via the Basal Ganglia. Frontiers in Neuroscience,
6(February):9, January 2012.
20. Daniel Kahneman and Amos Tversky. Prospect theory: An analysis of decision under
risk. Econometrica: Journal of the Econometric Society, 47(2):263–292, 1979.
21. Mehdi Khamassi, Pierre Enel, Peter Ford Dominey, and Emmanuel Procyk. Medial
prefrontal cortex and the adaptive regulation of reinforcement learning parameters.
Prog Brain Res, 202:441–464, 2013.
22. Mehdi Khamassi, St´ephane Lall´ee, Pierre Enel, Emmanuel Procyk, and Peter F
Dominey. Robot cognitive control with a neurophysiologically inspired reinforcement
learning model. Frontiers in Neurorobotics, 5(July):1, January 2011.
23. Alexxai V Kravitz, Lynne D Tye, and Anatol C Kreitzer. Distinct roles for direct and
indirect pathway striatal neurons in reinforcement. Nature Neuroscience, pages 4–7,
April 2012.
24. Zeb Kurth-Nelson and A David Redish. Temporal-difference reinforcement learning
with distributed representations. PLoS One, 4(10):e7362, 2009.
25. Tiago V Maia and Michael J Frank. From reinforcement learning models to psychiatric
and neurological disorders. Nature Neuroscience, 14(2):154–62, February 2011.
26. Oliver Mihatsch and Ralph Neuneier. Risk-sensitive reinforcement learning. Machine
Learning, 49:267–290, 2002.
27. Yael Niv, Nathaniel D Daw, Daphna Joel, and Peter Dayan. Tonic dopamine: opportu-
nity costs and the control of response vigor. Psychopharmacology, 191(3):507–20, April
2007.
28. Yael Niv, Michael O. Duff, and Peter Dayan. Dopamine, uncertainty and td learning.
Behav Brain Funct, 1:6, May 2005.
29. John P O’Doherty, Alan Hampton, and Hackjin Kim. Model-based fMRI and its ap-
plication to reward learning and decision making. Annals of the New York Academy of
Sciences, 1104:35–53, May 2007.
30. A David Redish. Addiction as a computational process gone awry. Science,
306(5703):1944–1947, Dec 2004.
31. A David Redish, Steve Jensen, Adam Johnson, and Zeb Kurth-Nelson. Reconciling
reinforcement learning models with behavioral extinction and renewal: implications for
addiction, relapse, and problem gambling. Psychol Rev, 114(3):784–805, Jul 2007.
32. Wolfram Schultz. Behavioral theories and the neurophysiology of reward. Annual Re-
view of Psychology, 57:87–115, January 2006.
33. Nicolas Schweighofer and Kenji Doya. Meta-learning in reinforcement learning. Neural
Netw, 16(1):5–9, Jan 2003.
34. Tali Sharot. The optimism bias. Current Biology, 21(23):R941–5, December 2011.
35. Tali Sharot, Christoph W Korn, and Raymond J Dolan. How unrealistic optimism is
maintained in the face of reality. Nature Neuroscience, 14(11):1475–1479, October 2011.
Adaptive properties of differential learning rates 15
36. Amitai Shenhav, Matthew M. Botvinick, and Jonathan D. Cohen. The expected value of
control: an integrative theory of anterior cingulate cortex function. Neuron, 79(2):217–
240, Jul 2013.
37. Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. January
1984.
38. R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press,
Cambridge Mass., September 1998.
39. Matthijs van der Meer, Zeb Kurth-Nelson, and a David Redish. Information processing
in decision-making systems. The Neuroscientist, 18(4):342–59, August 2012.
40. Cristopher Watkins. Learning from delayed rewards. PhD thesis, May 1989.
41. Angela J. Yu. Adaptive behavior: humans act as bayesian learners. Curr Biol,
17(22):R977–R980, Nov 2007.
... However, their normative status has not been systematically assessed and compared yet. In fact, while some studies have investigated the optimality of asymmetric update, they have focused either on quite specific environments 14 , or feedback information regimes 15 and they have not compared it with gradual perseveration. The optimality of gradual perseveration has seldom been investigated, and in any case not in bandit tasks 16,17 . ...
... In this paper, we address this question by deriving a new variant of an evolutionary algorithm and simulating agents in several variants of two-armed bandit tasks, similar to those used in the lab to empirically study reinforcement learning in humans and animals. In addition to the crucial inclusion of gradual perseveration and the utilization of evolutionary simulations, our analysis differs from previous ones by our focus on partial feedback regimen and considering a broad range of task's outcome contingencies 14 . More specifically, across simulations we manipulated the task difficulty (difference in value between options), richness (the average value of the two options), how many times a given pair of options was presented and, finally, task volatility, in terms of reversal frequency and probability distributions. ...
... Across the stable environments (Figure 4), we observed that the positivity bias was systematical selected by our evolutionary algorithm in all the environments and all the reboots, with only one exception: we noted that + decreases with increasing richness of the environment, while − increases, suggesting a decreasing optimism as the average expected value increases to the point that in the richest environment (45%/95%), the update asymmetry favors a pessimism bias in 64% of the reboots ( Table 2). This results reflects the fact that is generally better to learn more from rare outcomes (i.e., in an environment in which positive outcomes are commonrich environmentthe agents should pay more attention to the negative ones 14 ). We also observed that reducing the learning period was associated with an increase of both the average learning rate and the positivity bias, which reflect the fact that in shorter task the agents must learn quickly and cannot wait to accumulate a lot of experience. ...
Preprint
Full-text available
The tendency of repeating past choices more often than expected from the history of outcomes has been repeatedly empirically observed in reinforcement learning experiments. It can be explained by at least two computational processes: asymmetric update and (gradual) choice perseveration. A recent meta-analysis showed that both mechanisms are detectable in human reinforcement learning. However, while their descriptive value seems to be well established, they have not been compared regarding their possible adaptive value. In this study, we address this gap by simulating reinforcement learning agents in a variety of environments with a new variant of an evolutionary algorithm. Our results show that positivity bias (in the form of asymmetric update) is evolutionary stable in many situations, while the emergence of gradual perseveration is less systematic and robust. Overall, our results illustrate that biases can be adaptive and selected by evolution, in an environment-specific manner.
... Besides this evidence, the potential adaptive value of positivity and confirmation biases has been stressed by several other authors. For instance, a few simulation studies suggest that, across a wide range of environments, virtual RL agents with a positivity bias [14,15] or with a confirmation bias [16,17] perform better than their counterparts. Therefore, confirmation bias seems to enhance individual decision-making rather than harm it. ...
... Further analysis (see S1 Text) reveals that small groups behave like optimistic agents: they inflate both options' Q-values. A previous study [14] demonstrated that such optimism is beneficial in poor environments but not in abundant ones, as Q-values' inflation saturates when approaching 1. For this reason, small groups of confirmatory agents end up with smaller Q-value gaps in rich environments, which negatively impacts their performance. ...
... In short, there are no other agents to feed it counterfactual information. Cazé & van der Meer had previously shown that a positivity bias leads to poorer performance for a single agent in a rich environment [14]. But it was not obvious that a confirmation bias would also lead to poorer performance in groups of two or three agents. ...
Article
Full-text available
Humans tend to give more weight to information confirming their beliefs than to information that disconfirms them. Nevertheless, this apparent irrationality has been shown to improve individual decision-making under uncertainty. However, little is known about this bias’ impact on decision-making in a social context. Here, we investigate the conditions under which confirmation bias is beneficial or detrimental to decision-making under social influence. To do so, we develop a Collective Asymmetric Reinforcement Learning (CARL) model in which artificial agents observe others’ actions and rewards, and update this information asymmetrically. We use agent-based simulations to study how confirmation bias affects collective performance on a two-armed bandit task, and how resource scarcity, group size and bias strength modulate this effect. We find that a confirmation bias benefits group learning across a wide range of resource-scarcity conditions. Moreover, we discover that, past a critical bias strength, resource abundance favors the emergence of two different performance regimes, one of which is suboptimal. In addition, we find that this regime bifurcation comes with polarization in small groups of agents. Overall, our results suggest the existence of an optimal, moderate level of confirmation bias for decision-making in a social context.
... Therefore, STL predicts that participants would increase their target number of pumps following the collection of a reward, and reduce their target number of pumps following an unwanted balloon burst. Crucially, STL implements separate learning rates for wins ( vwin ) and losses ( vloss ) to account for the distinct degrees of sensitivity to rewards and punishments (Cazé & van der Meer, 2013;Corr, 2004;Frank et al., 2007;Gray, 1975;Lefebvre et al., 2017;Niv et al., 2011;Sharot et al., 2011) and the differential neural mechanisms that implement approach and avoidance learning (Daw et al., 2002;O'Doherty et al., 2001;Schultz, 2016;Palminteri & Pessiglione, 2017;Seymour et al., 2007). ...
... These models were originally developed to reliably and meaningfully characterise learning during sequential decisions. Crucially, the model distinguishes learning from positive and negative feedback by estimating differential learning rates to account for the distinct sensitivity to rewards and punishments (Cazé & van der Meer, 2013;Corr, 2004;Frank et al., 2007;Gray, 1975;Lefebvre et al., 2017;Niv et al., 2011;Sharot et al., 2011) and the separate neural processes facilitating approach and avoidance learning (Daw et al., 2002;O'Doherty et al., 2001;Schultz, 2016;Palminteri & Pessiglione, 2017;Seymour et al., 2007). well-established phenomenon that learning rates increase under heightened levels of uncertainty Browning et al., 2015;. ...
Preprint
Do we preferentially learn from positive rather than negative decision outcomes? Previous studies indicated that such bias characterises learning during simple reward learning tasks. However, no research has yet confirmed whether learning bias is also present during sequential decision making under uncertainty. To fill this gap, we utilised a complex yet ecologically valid paradigm, the Balloon Analogue Risk Task (BART), which measures risk-taking propensity under uncertainty in everyday decision making. Comparing learning from positive and negative outcomes in the BART has been made possible by the Scaled Target Learning model, which characterises both risk-taking propensity and sensitivity to wins and losses. For the first time, we applied this model to a modified BART paradigm with different levels of perceived uncertainty. Crucially, our analyses revealed learning bias during high levels of uncertainty, under which condition bias was negatively tied to task performance. Furthermore, increased sensitivity to wins compared to losses was linked to more risk-seeking behaviour across all conditions, suggesting that learning bias could mediate risky behaviour. Overall, our results contribute to a more accurate characterisation of reward learning behaviour and suggest that learning bias arises when the level of perceived uncertainty surges.
... NeuroImage: Clinical 45 (2025) 103762 outcomes, respectively. Models that produce the learning rates for the signed PE allow an asymmetric effect of better or worse (than expected) outcome on learning (Cazé and van der Meer, 2013). Furthermore, the learning rates can be modeled separately for each trial type. ...
Article
Full-text available
Individuals engaging in problem drinking show impaired proactive pain avoidance. As successful pain avoidance is intrinsically rewarding, this impairment suggests reward deficiency, as hypothesized for those with alcohol and substance misuse. Nevertheless, how reward circuit dysfunctions impact avoidance learning and contribute to drinking behavior remains poorly understood. Here, we combined functional imaging and a probabilistic learning go/nogo task to examine the neural processes underlying proactive pain avoidance learning in 103 adult drinkers. We hypothesized that greater drinking severity would be associated with poorer avoidance learning and that the deficits would be accompanied by weakened activity and connectivity of the reward circuit. Our behavioral findings indeed showed a negative relationship between drinking severity and learning from successful pain avoidance. We identified hypoactivation of the posterior cingulate cortex (PCC), a brain region important in avoidance, as the neural correlate of lower learning rate in association with problem drinking. The reward circuit, including the medial orbitofrontal cortex, ventral tegmental area, and substantia nigra, also exhibited diminished activation and connectivity with the PCC with greater drinking severity and learning deficits. Finally, path modeling suggested a pathway in which problem drinking disengaged the reward circuit. The weakened circuit subsequently induced PCC hypoactivation, resulting in poorer pain avoidance learning. As the learning dysfunction worsened alcohol use, the pathway represents a self-perpetuating cycle of drinking and distress. Together, these findings substantiate a role of reward deficiency in problem drinkers’ compromised proactive avoidance, thus identifying a potential target for intervention aimed at mitigating harmful alcohol use.
... While such effects can lead to biases in certain contexts, they might have evolved to strengthen the robustness of learning and decision-making. One line of research argues that positivity and confirmation biases lead to increased differences in reward estimates between options [41][42][43] . For example, when the true underlying reward probability of a favorable option is 0.6, and the probability of the worse option is 0.4, such biases can increase the subjective difference between them so that, for example, the favorable option has a subjective probability of 0.7. ...
Article
Full-text available
Learning allows humans and other animals to make predictions about the environment that facilitate adaptive behavior. Casting learning as predictive inference can shed light on normative cognitive mechanisms that improve predictions under uncertainty. Drawing on normative learning models, we illustrate how learning should be adjusted to different sources of uncertainty, including perceptual uncertainty, risk, and uncertainty due to environmental changes. Such models explain many hallmarks of human learning in terms of specific statistical considerations that come into play when updating predictions under uncertainty. However, humans also display systematic learning biases that deviate from normative models, as studied in computational psychiatry. Some biases can be explained as normative inference conditioned on inaccurate prior assumptions about the environment, while others reflect approximations to Bayesian inference aimed at reducing cognitive demands. These biases offer insights into cognitive mechanisms underlying learning and how they might go awry in psychiatric illness.
... Behavioral data were analyzed in the assumption that each mouse continuously attributed an action value to each hole, as previously described [13,34,48]. Action values were determined according to the Differential Learning Rate Q-learning model (DLR-Q) [9,14,29,48], which can be summarized as follows: ...
Article
Full-text available
The large-conductance calcium-and voltage-activated potassium (BK) channels, encoded by the KCNMA1 gene, play important roles in neuronal function. Mutations in KCNMA1 have been found in patients with various neurodevel-opmental features, including intellectual disability, autism spectrum disorder (ASD), or attention deficit hyperactivity disorder (ADHD). Previous studies of KCNMA1 knockout mice have suggested altered activity patterns and behav-ioral flexibility, but it remained unclear whether these changes primarily affect immediate behavioral adaptation or longer-term learning processes. Using a 5-armed bandit task (5-ABT) and a novel Δrepeat rate analysis method that considers individual baseline choice tendencies, we investigated immediate trial-by-trial Win-Stay-Lose-Shift (WSLS) strategies and learning rates across multiple trials in KCNMA1 knockout (KCNMA1 −/−) mice. Three key findings emerged: (1) Unlike wildtype mice, which showed increased Δrepeat rates after rewards and decreased rates after losses, KCNMA1 −/− mice exhibited impaired WSLS behavior, (2) KCNMA1 −/− mice displayed shortened response intervals after unrewarded trials, and (3) despite these short-term behavioral impairments, their learning rates and task accuracy remained comparable to wildtype mice, with significantly shorter task completion times. These results suggest that BK channel dysfunction primarily alters immediate behavioral responses to outcomes in the next trial rather than affecting long-term learning capabilities. These findings and our analytical method may help identify behavioral phenotypes in animal models of both BK channel-related and other neurodevelopmental disorders.
... Our winning model is a two learning rates Rescorla-Wagner model (2lr-RW) (82), a simple variant of the RW model (12). In this model, two different learning rates (α+ and α-) are assigned to the positive and negative prediction errors respectively (Equation 1). ...
Preprint
Full-text available
Maladaptive responses to uncertainty, including excessive risk avoidance, are linked to a range of mental disorders. One expression of these is a pro-variance bias (PVB), wherein risk-seeking manifests in a preference for choosing options with higher variances/uncertainty. Here, using a magnitude learning task, we provide a behavioural and neural account of PVB in humans. We show that individual differences in PVB are captured by a computational model that includes asymmetric learning rates, allowing differential learning from positive prediction errors (PPEs) and negative prediction errors (NPEs). Using high-resolution 7T functional magnetic resonance imaging (fMRI), we identify distinct neural responses to PPEs and NPEs in value-sensitive regions including habenula (Hb), ventral tegmental area (VTA), nucleus accumbens (NAcc), and ventral medial prefrontal cortex (vmPFC). Prediction error signals in NAcc and vmPFC were boosted for high variance options. NPEs responses in NAcc were associated with a negative bias in learning rates linked to a stronger negative Hb-VTA functional coupling during NPE encoding. A mediation analysis revealed this coupling influenced NAcc responses to NPEs via an impact on learning rates. These findings implicate Hb-VTA coupling in the emergence of risk preferences during learning, with implications for psychopathology.
Article
Recent evidence highlights that monetary rewards can increase the precision at which healthy human volunteers can detect small changes in the intensity of thermal noxious stimuli, contradicting the idea that rewards exert a broad inhibiting influence on pain perception. This effect was stronger with contingent rewards compared with noncontingent rewards, suggesting a successful learning process. In the present study, we implemented a model comparison approach that aimed to improve our understanding of the mechanisms that underlie thermal noxious discrimination in humans. In a between-subject design, 54 healthy human volunteers took part in a pain discrimination task with monetary rewards either contingent or noncontingent on successful discrimination of small changes in the intensity painful heat stimulation. We used models from 2 traditions in decision-making research, perceptual decision-making, and instrumental learning. Replicating the previous findings, only rewards contingent on behavior enhanced pain discrimination. Drift diffusion modelling revealed increased sensory signal strength and decreased response caution and nondecision times as mechanisms underlying this effect of contingent rewards on pain discrimination. In addition, reinforcement learning models indicated a temporal evolution of discriminative abilities reflected by a trial-by-trial increase of perceived signal strength only with contingent rewards but not with noncontingent rewards. Modelling of separate learning rates for positive and negative prediction errors indicated that this temporal evolution of discriminative abilities was driven by positive reward prediction errors. These results might indicate increased sensitivity towards better-than-expected outcomes in the temporal adaptation of pain discrimination abilities to a rewarding context in humans.
Article
Real-world choice options have many features or attributes, whereas the reward outcome from those options only depends on a few features or attributes. It has been shown that humans learn and combine feature-based with more complex conjunction-based learning to tackle challenges of learning in naturalistic reward environments. However, it remains unclear how different learning strategies interact to determine what features or conjunctions should be attended to and control choice behavior, and how subsequent attentional modulations influence future learning and choice. To address these questions, we examined the behavior of male and female human participants during a three-dimensional learning task in which reward outcomes for different stimuli could be predicted based on a combination of an informative feature and conjunction. Using multiple approaches, we found that both choice behavior and reward probabilities estimated by participants were most accurately described by attention-modulated models that learned the predictive values of both the informative feature and the informative conjunction. Specifically, in the reinforcement learning model that best fit choice data, attention was controlled by the difference in the integrated feature and conjunction values. The resulting attention weights modulated learning by increasing the learning rate on attended features and conjunctions. Critically, modulating decision-making by attention weights did not improve the fit of data, providing little evidence for direct attentional effects on choice. These results suggest that in multidimensional environments, humans direct their attention not only to selectively process reward-predictive attributes but also to find parsimonious representations of the reward contingencies for more efficient learning.
Article
Full-text available
Converging evidence suggest that the medial prefrontal cortex (MPFC) is involved in feedback categorization, performance monitoring, and task monitoring, and may contribute to the online regulation of reinforcement learning (RL) parameters that would affect decision-making processes in the lateral prefrontal cortex (LPFC). Previous neurophysiological experiments have shown MPFC activities encoding error likelihood, uncertainty, reward volatility, as well as neural responses categorizing different types of feedback, for instance, distinguishing between choice errors and execution errors. Rushworth and colleagues have proposed that the involvement of MPFC in tracking the volatility of the task could contribute to the regulation of one of RL parameters called the learning rate. We extend this hypothesis by proposing that MPFC could contribute to the regulation of other RL parameters such as the exploration rate and default action values in case of task shifts. Here, we analyze the sensitivity to RL parameters of behavioral performance in two monkey decision-making tasks, one with a deterministic reward schedule and the other with a stochastic one. We show that there exist optimal parameter values specific to each of these tasks, that need to be found for optimal performance and that are usually hand-tuned in computational models. In contrast, automatic online regulation of these parameters using some heuristics can help producing a good, although non-optimal, behavioral performance in each task. We finally describe our computational model of MPFC-LPFC interaction used for online regulation of the exploration rate and its application to a human-robot interaction scenario. There, unexpected uncertainties are produced by the human introducing cued task changes or by cheating. The model enables the robot to autonomously learn to reset exploration in response to such uncertain cues and events. The combined results provide concrete evidence specifying how prefrontal cortical subregions may cooperate to regulate RL parameters. It also shows how such neurophysiologically inspired mechanisms can control advanced robots in the real world. Finally, the model's learning mechanisms that were challenged in the last robotic scenario provide testable predictions on the way monkeys may learn the structure of the task during the pretraining phase of the previous laboratory experiments.
Article
Full-text available
Dopamine signaling is implicated in reinforcement learning, but the neural substrates targeted by dopamine are poorly understood. We bypassed dopamine signaling itself and tested how optogenetic activation of dopamine D1 or D2 receptor–expressing striatal projection neurons influenced reinforcement learning in mice. Stimulating D1 receptor–expressing neurons induced persistent reinforcement, whereas stimulating D2 receptor–expressing neurons induced transient punishment, indicating that activation of these circuits is sufficient to modify the probability of performing future actions.
Article
Full-text available
Decisions result from an interaction between multiple functional systems acting in parallel to process information in very different ways, each with strengths and weaknesses. In this review, the authors address three action-selection components of decision-making: The Pavlovian system releases an action from a limited repertoire of potential actions, such as approaching learned stimuli. Like the Pavlovian system, the habit system is computationally fast but, unlike the Pavlovian system permits arbitrary stimulus-action pairings. These associations are a "forward'' mechanism; when a situation is recognized, the action is released. In contrast, the deliberative system is flexible but takes time to process. The deliberative system uses knowledge of the causal structure of the world to search into the future, planning actions to maximize expected rewards. Deliberation depends on the ability to imagine future possibilities, including novel situations, and it allows decisions to be taken without having previously experienced the options. Various anatomical structures have been identified that carry out the information processing of each of these systems: hippocampus constitutes a map of the world that can be used for searching/imagining the future; dorsal striatal neurons represent situation-action associations; and ventral striatum maintains value representations for all three systems. Each system presents vulnerabilities to pathologies that can manifest as psychiatric disorders. Understanding these systems and their relation to neuroanatomy opens up a deeper way to treat the structural problems underlying various disorders.
Article
Whereas reward (appetitiveness) and aversiveness (punishment) have been distinguished as two discrete dimensions within psychology and behavior, physiological and computational models of their neural representation have treated them as opposite sides of a single continuous dimension of “value.” Here, I show that although dopamine neurons of the primate ventral midbrain are activated by evidence for reward and suppressed by evidence against reward, they are insensitive to aversiveness. This indicates that reward and aversiveness are represented independently as two dimensions, even by neurons that are closely related to motor function. Because theory and experiment support the existence of opponent neural representations for value, the present results imply four types of value-sensitive neurons corresponding to reward-ON (dopamine), reward-OFF, aversive-ON, and aversive-OFF.
Article
The dorsal anterior cingulate cortex (dACC) has a near-ubiquitous presence in the neuroscience of cognitive control. It has been implicated in a diversity of functions, from reward processing and performance monitoring to the execution of control and action selection. Here, we propose that this diversity can be understood in terms of a single underlying function: allocation of control based on an evaluation of the expected value of control (EVC). We present a normative model of EVC that integrates three critical factors: the expected payoff from a controlled process, the amount of control that must be invested to achieve that payoff, and the cost in terms of cognitive effort. We propose that dACC integrates this information, using it to determine whether, where and how much control to allocate. We then consider how the EVC model can explain the diverse array of findings concerning dACC function.
Article
Most reinforcement learning algorithms optimize the expected return of a Markov Decision Problem. Practice has taught us the lesson that this criterion is not always the most suitable because many applications require robust control strategies which also take into account the variance of the return. Classical control literature provides several techniques to deal with risk-sensitive optimization goals like the so-called worst-case optimality criterion exclusively focusing on risk-avoiding policies or classical risk-sensitive control, which transforms the returns by exponential utility functions. While the first approach is typically too restrictive, the latter suffers from the absence of an obvious way to design a corresponding model-free reinforcement learning algorithm.Our risk-sensitive reinforcement learning algorithm is based on a very different philosophy. Instead of transforming the return of the process, we transform the temporal differences during learning. While our approach reflects important properties of the classical exponential utility framework, we avoid its serious drawbacks for learning. Based on an extended set of optimality equations we are able to formulate risk-sensitive versions of various well-known reinforcement learning algorithms which converge with probability one under the usual conditions.