Content uploaded by Romain Daniel Cazé
Author content
All content in this area was uploaded by Romain Daniel Cazé on Dec 29, 2013
Content may be subject to copyright.
Biological Cybernetics manuscript No.
(will be inserted by the editor)
Adaptive properties of differential learning rates for
positive and negative outcomes
Romain D. Caz´e ·Matthijs A. A. van
der Meer
Received: date / Accepted: date
Abstract The concept of the reward prediction error – the difference between
reward obtained and reward predicted – continues to be a focal point for much
theoretical and experimental work in psychology, cognitive science, and neuro-
science. Models that rely on reward prediction errors typically assume a single
learning rate for positive and negative prediction errors. However, behavioral
data indicate that better-than-expected and worse-than-expected outcomes of-
ten do not have symmetric impacts on learning and decision-making. Further-
more, distinct circuits within cortico-striatal loops appear to support learn-
ing from positive and negative prediction errors respectively. Such differential
learning rates would be expected to lead to biased reward predictions, and
therefore suboptimal choice performance. Contrary to this intuition, we show
that on static “bandit” choice tasks, differential learning rates can be adap-
tive. This occurs because asymmetric learning enables a better separation of
learned reward probabilities. We show analytically how the optimal learning
rate asymmetry depends on the reward distribution, and implement a bio-
logically plausible algorithm that adapts the balance of positive and negative
learning rates from experience. These results suggest specific adaptive advan-
tages for separate, differential learning rates in simple reinforcement learning
RC is supported by an Marie Curie initial training fellowship. MvdM is supported by the
National Science and Engineering Council of Canada (NSERC).
Romain Caz´e
Department of Bioengineering
Imperial College
London, England
r.caze@imperial.ac.uk
Matthijs van der Meer
Department of Biology and Centre for Theoretical Neuroscience
University of Waterloo, Canada
mvdm@uwaterloo.ca
2 Romain D. Caz´e, Matthijs A. A. van der Meer
settings, and provide a novel, normative perspective on the interpretation of
associated neural data.
Keywords Reinforcement learning ·Reward prediction error ·Decision-
making ·Meta-learning ·Basal ganglia
1 Introduction
A central element in the major theories of reinforcement learning is the reward
prediction error (RPE) which, in its simplest form, is the difference between
the amount of received and expected reward [38]. Thus, a positive RPE is gen-
erated when more is received than expected, and conversely, negative RPEs
occur when less is received than expected. Proposals based on RPEs can ex-
plain salient features of Pavlovian and instrumental learning [32] and generalize
to powerful algorithms that can learn about states and actions which are not
themselves rewarded, but result in reward later (e.g. temporal-difference re-
inforcement learning, [37]). Essentially, these algorithms attempt to estimate
from experience the expected reward resulting from states and actions, so that
the amount of reward obtained can be maximized.
RPEs appear to be encoded by the neural activity of several brain areas
known to support learning and motivated behavior, such as a population of
dopaminergic neurons in the ventral tegmental area (VTA) [2]. Projection
targets of these neurons encode action and state values [39], and reinforcement
learning model fits to behavioral and neural data can account for trial-by-trial
changes in both [29]. Taken together, these findings support the familiar notion
that circuits centered on the dopaminergic modulation of activity in cortico-
striatal loops implement a reinforcement learning system. Thus, reinforcement
learning models provide an explicit, computational account of the relationship
between RPEs, estimates of expected reward, and subsequent behavior. This
account is a powerful framework for understanding the neural basis of learning
and decision-making in mechanistic, biological detail.
Many reinforcement learning models implicitly assume that positive and
negative RPEs impact estimation of action and state values through a com-
mon gain factor or learning rate [38,29]. However, converging evidence in-
dicates that learning from positive and negative feedback, including positive
and negative RPEs, in fact relies on dissociable mechanisms in the brain. De-
pending on the specific setting, these distinct mechanisms are referred to as
approach/avoid, go/noGo, or direct/indirect pathways in the basal ganglia,
associated with predominantly D1 and D2-expressing projection neurons in
the striatum, respectively [16,23]. This neural separation of the direct and in-
direct pathways raises the possibility that distinct, differential learning rates
are associated with positive and negative RPEs.
In support of this idea, a number of studies have reported behavioral evi-
dence for differential learning rates in humans [15,13]. From a reinforcement
learning perspective, differential learning rates imply biased estimates of the
amount of expected reward, which in turn can lead to suboptimal decisions.
Adaptive properties of differential learning rates 3
This “irrationality” notion is congruent with observations from psychology
and behavioral economics. For instance, a recent study examined the extent
to which subjects updated their estimates of the likelihood of various events
happening to them (e.g. get cancer, win the lottery) after being told the true
probabilities [35]. Strikingly, it was found that subjects tended to update their
estimates more if the odds were better (in their favour) than they thought.
Such differential updating could underlie the formation of a so-called “opti-
mism bias” [34] and has also been proposed as contributing to other biases
such as risk aversion in mean-variance learners [26,5].
Such behavioral evidence for asymmetric updating following positive (bet-
ter than expected) and negative (worse than expected) outcomes invites nor-
mative questions. Are these biases suboptimal in an absolute sense, but per-
haps the result of limited cognitive resources? Or alternatively, are they op-
timized for situations different than tested? Here, we aim to address these
questions quantitatively in simple reinforcement learning models. We show
that even in simple, static bandit tasks, agents with differential learning rates
can outperform unbiased agents. These results provide a different view on (1)
the interpretation of behavioral results where such biases are often cast as
irrational, and (2) neurophysiological data comparing neural responses associ-
ated with positive and negative prediction errors. Finally, this work suggests
simple and biologically motivated improvements to boost the performance of
commonly used reinforcement learning models.
2 Impact of differential learning rates on value estimation
To model a tractable decision problem where the effects of differential learning
rates for positive and negative RPEs can be explored, we implement agents
that learn the values of actions on probabilistically rewarded “bandit” tasks.
That is, there is only one state with one (this section) or two possible actions
(the next section), the expected value of which the agent learns incrementally
from probabilistic feedback according to a simple delta rule. Unlike the stan-
dard form of typical reinforcement learning algorithms such as Q-learning [40],
however, our agents employ different learning rates for positive and negative
prediction errors ∆Qt= (rt−Qt):
Qt+1 =Qt+α+∆Qtif ∆Qt≥0
α−∆Qtif ∆Qt<0
Qt, by convention, corresponds to the expected reward of taking a certain
action at time step t(“Q-value”). rtis the actual reward received at time step
t. The usual state and action indices s, a are omitted because in this section,
we only consider one state and one action. Unlike typical “symmetric” update
rules, the existence of two learning rates α+and α−can lead to biased Q-
values, as we will show below.
We first derive an analytical expression for the steady-state Q-value for a
single action with a probabilistic, binary outcome. Without loss of generality,
4 Romain D. Caz´e, Matthijs A. A. van der Meer
we set the outcomes to rt= 1 or and rt= -1, with probabilities pand 1 −p
respectively. Then, for sufficiently small αand Q06={−1,1}, if rt= 1 the
outcome is always superior to the predicted Q, whereas if rt=−1 the outcome
is always inferior. Thus, the update rule for each trial is the following:
Qt+1 =Qt+α+(1 −Qt) with probability p
α−(−1−Qt) with probability 1 −p
At steady state, ˆ
Qt+1 =ˆ
Qt=Q∞where ˆ
Qrepresents the mean Qwhen
t→ ∞. This yields:
Q∞=Q∞+pα+(1 −Q∞) + (1 −p)α−(−1−Q∞)
If we define xas the ratio between the learning rate for positive over neg-
ative prediction value x=α+/α−, we can rewrite the previous equation and
obtain Q∞:
Q∞=px −(1 −p)
px + (1 −p)(1)
A number of properties of this result are worth noting. When α−> α+the
Q-values are “pessimistic”, i.e. they are below the true value; whereas when
α+=α−, the steady-state Q-value converges, as expected, to the true mean
value for this action (equal to 2p−1; see Figure 1 where α+=α−). Moreover,
when α+> α−, the Q-values are “optimistic”, i.e. they are above the true
value. This bias is illustrated in Figure 1 where α+= 4α−. Note also the
complex dependence of the bias on the true mean value. The bias is lower for
low Qvalues than for high Qvalues in the case of a pessimistic agent, while
this is the opposite for an optimistic agent.
To introduce our simulation testbed, we first sought to confirm the above
analytical results numerically. The analytical results predicted with high ac-
curacy the behavior of a modified Q-learning algorithm (5000 iterations of
800 trials each; error bars in Figure 1 show the variance across iterations).
Thus, differential learning rates for positive and negative RPEs lead to biased
estimates of true underlying values, in a heterogeneous manner that depends
on the true mean value as well as the learning rate asymmetry. In the next
section, we examine the impact of this distortion on performance in choice
settings (“two-armed bandits”).
3 Impact of differential learning rates on performances
In this section, we consider the consequences of biased estimates resulting from
differential learning rates in simple choice situations. In particular, we simulate
the performance of three Q-learning agents on two different “two-armed ban-
dit” problems. The “rational” (R) agent has equal learning rates for negative
and positive prediction error α= 0.1; the optimistic (O) agent has a higher
Adaptive properties of differential learning rates 5
Fig. 1 Differential learning rates result in biased estimates of true expected
values. Analytically derived Q-values (black filled circles, Q∞) for different true values
of Q:0.8,0.6,−0.6,−0.8 (grey dotted lines) and for different ratios of α+and α−. Note
that the true Q-values are the means of probabilistic reward delivery schedules. Error bars,
computed using numerical simulations, show the variance of the estimated Q-values after
800 trials averaged over 5000 runs (the means converge to the analytically derived values).
When the learning rates for negative and positive prediction errors are equal, the derived
Q-values correspond to the true values, whereas they are distorted when the learning rates
are different. A “pessimistic” learner ( α+
α
−<1) underestimates the true values, and an
“optimistic” learner ( α+
α
−>1) overestimates them.
learning rate for positive prediction errors (α+= 0.4) than for negative pre-
diction errors (α−= 0.1), and the pessimistic (P) agent has a higher learning
rate for negative (α−= 0.4) than for positive prediction errors (α+= 0.1). All
agents use a standard softmax decision rule with fixed temperature β= 0.3
to decide which arm to choose given the Q-values; we employ this decision
rule here because it is the de facto standard in applications of reinforcement
learning models to behavioral and neural data. We discuss the applicability of
these results for the epsilon-greedy decision rule below.
The three agents are tested on two versions of a difficult two-armed bandit
task, with large variance in the two outcomes but small differences in the
means [38]. For simplicity, we again consider binary outcome distributions
{−1,1}, however, the results that follow are quite general (see Discussion). In
the first, “low-reward” task, the respective probabilities of r= 1 are 0.2 and
0.1; in the other “high-reward” task, the two arms are rewarded with reward
probabilities of 0.9 and 0.8 respectively.
When simulated on these two tasks, a striking pattern is apparent (Figure
2a): in the low-reward task, the optimistic agent learns to take the best action
significantly more often than the rational agent, which in turn performs better
than the pessimistic agent (left panel). The agents’ performance is reversed
in the high-reward task (right panel). To understand this pattern of results,
recall that reinforcement learners face a trade-off between exploitation and
6 Romain D. Caz´e, Matthijs A. A. van der Meer
exploration [38]: choosing to exploit the option with the highest estimated Q-
value does not guarantee that there is not an (insufficiently explored) better
option available. This problem is particularly pernicious for probabilistic re-
wards such as those in the current setting, where only noisy estimates of true
underlying values are available.
Fig. 2 Differential learning rates increase performance in specific tasks. (A) Mean
probability of choosing the best arm (averaged over 5000 runs) for the three agents, “ratio-
nal” (R, α+=α−, blue line), “optimist” (O, α+> α−, green line), and “pessimist” (P,
α+< α−, red line). The left plot shows performance on the low-reward task (0.1 and 0.2
probability of reward for the two choices), the right plot performance on the high-reward
task (0.8 and 0.9 probability). Note that in the low-reward task the optimistic agent is the
best performer and the pessimistic agent the worst, but this pattern is reversed for the high-
reward task. (B) Probability of switching after 800 episodes for each agent. This probability
depends on the task: in the low-reward task the optimistic agent is the least likely to switch,
with the pattern reversed for the high-reward task.
To quantify differences in how the agents navigate this exploration versus
exploitation trade-off in the steady state, we estimated the probability that
an agent repeats the same choice between time tand t+ 1, after extended
learning (800 trials). For both tasks, we expect that agent will “stay” more
than “switch” owing to differences in mean expected reward for both choices.
However, as shown in Figure 2b, the different agents have a distinct probability
of switching, and this difference between agents depends on the task. Higher
probabilities of switching (“exploring”) are associated with lower performance
(compare with Figure 2a).
Why does this occur? As shown by Equation 1 and illustrated in Figure
1, differential learning rates result in biased value estimates on probabilistic
tasks such as the one simulated here. To understand the implications of this for
choice situations, the dependence of this bias on the true mean value is critical.
Adaptive properties of differential learning rates 7
In the low-reward task, mean values are low, and therefore the distortion of
these values will be high for the “optimistic” agent (see Figure 1a). Conversely,
in this task, distortion will be low for the “pessimistic” agent (Figure 1b). The
implication of this is that distortion of the true values effectively increases
the contrast between the two choices, leading to (1) increased probability of
choosing the best option using a softmax rule, and (2) increased robustness
to random fluctuations in the outcomes; this latter effect does not rely on the
softmax action selection rule and can also occur for different action selection
rules such as -greedy.
More precisely, for the results in Figure 2, the rational agent tends to
approach the true mean values, with a mean final estimated after 800 trials
approximatively equal to Q1=−0.63, Q0=−0.84, and in the high reward
approximatively equal to Q1= 0.78, Q0=−0.51. These estimated Q-values
are close to the true Q-value for the state, and thus the ∆Q is close to the
true value of 0.2. The estimated ∆Q at steady state for a biased agent due to
the heterogeneous distortion observed in the first part is in excess of 0.5. It is
this distortion (separation) which enables higher steady state performance for
the biased agent. Likewise, distortion in the opposite direction (compression)
is the reason for impaired performance. Specifically, in the low-reward task,
the pessimistic agent will have a smaller ∆Q than the rational agent, because
the pessimistic bias causes saturation as the Q-values approaches −1, making
the two Q-values closer than the true values, ∆Q < 0.06. For the same reason,
the opposite symmetric observation holds for an optimistic agent in the high-
reward task, where this agent has a ∆Q < 0.04.
In addition to this effect on performance in the steady state, differential
learning rates likely also impact the early stages of learning, when the agent
takes its first few choices. This idea is illustrated by the following intuition:
in the high-reward task, an agent will have a tendency to over-exploit its first
choice, because it is likely that this first choice will be rewarded. In the low-
reward task, an agent will have the opposite tendency, it will over-explore its
environment because it is likely that neither arm will provide a reward for the
first few trials. To the extent that this effect contributes to performance, it
may be mitigated by appropriately differential learning rates.
These observations raise the obvious question: can we find optimal learning
rates which maximize ∆Q? Can we find an agent which maximizes this ∆Q
in both tasks? We explore these issues in the following section.
4 Derivation of optimally differential learning rates
From Equation 1, we can compute ∆Q∞at steady state between the two
choices:
∆Q∞=2(p1q0−p0q1)x
p1p0x2+ (p1q0+p0q1)x+q0q1
where the indices correspond to the different choices (bandit arms); the
index of the arm with the highest probability of reward is 1, with 0 indicating
8 Romain D. Caz´e, Matthijs A. A. van der Meer
the other arm. qi= 1 −piis the probability of no-reward for arm i. We can
determine the xfor which this rational function is maximal and thus find the
ratio for which ∆Q is maximal:
x∗=√q0q1
√p0p1
In the limit where p0→p1the optimal x∗tends to the ratio between
p(no −reward) and p(reward). The results in the previous section indicate
that the best steady state performance is achieved when ∆Q∞is maximal;
thus, we can conclude that the best learning rates for positive (resp. negative)
prediction error should be proportional to the rate of no-reward (resp. rate of
reward). Specifically, in the low-reward task the optimal x∗is equal to 6. In
the high-reward task an optimal x∗=1
6. Thus, an optimal x∗in a given task
is also the one which leads to the worst performance in the other task, where
the optimal x∗is the inverse. To solve this issue we introduce an agent with a
plastic learning rate (meta-learner), adaptable to tasks with different reward
distributions.
5 Meta-learning of optimally differential learning rates
We have shown in the previous section that behavior is optimal when the
learning rate for positive (resp. negative) prediction error corresponds to the
probability of no-reward (resp. reward) in a given task. Thus, here we study
an agent which adapts its learning rate for positive and negative prediction
errors. Given our results, we choose as a target α+→w×p(no −reward) and
α−→w×p(reward), with the reward and no-reward probabilities estimated
by a running average of reward history. The parameter w(set to 0.1 in Figure
3) replaces the standard learning rate parameter αand captures the sensitivity
of learning to the estimated reward rate. Thus, we define a meta-learning agent
which derives separate learning rates for positive and negative prediction errors
from a running estimate of the task reward rate.
We compared the performance of this meta-learning agent to the rational
agent in both high- and low-reward tasks introduced in the previous section.
As shown in Figure 3, the meta agent outperforms the rational agent in both
tasks. In fact, the meta-agent slightly outperforms the purely pessimistic or
optimistic agent, meaning that in the low reward task this meta-learning agent
is optimistic whereas, in the high reward task, the agent is pessimistic. As
expected the two learning rates converge at steady state toward the mean
probability of reward or no-reward in one given task. For a large range of α
between 0.01 and 0.4 we obtained similar results (Figure 3a). Thus, the meta-
learning agent performed best in both task settings by flexibly adapting its
learning rate based on reward history – it under- or overestimates the true
expected rewards to improve choice performance.
Adaptive properties of differential learning rates 9
Fig. 3 Meta-learning of differential learning rates results in optimal performance
by the same agent across tasks. As in Figure 2, shown is the probability of choosing the
best option in the two different tasks (left: low reward, right: high reward) for two different
agents. In blue (α= 0.1) is the performance of the “rational” agent R, in teal and navy blue
is the performance when α= 0.01 and α= 0.4 respectively, and in violet is the agent with
two plastic learning rates N. This “meta-learning” agent outperforms the rational agent in
both tasks by finding appropriate, differential learning rates based on a running average of
rewards received. This estimate started with a value of 0.5 for both probabilities, and was
updated by keeping track of the number of rewarded and non-rewarded trials (i.e. an infinite
window for the running average, but similar results are obtained for any sufficiently reliable
estimation method).
The results in section 3 suggest that differential learning rates are most
useful in situations where competing choice values are close together. This ef-
fect is illustrated for the meta-learner in Figure 4b) in a task where the reward
probabilities are 0.75 and 0.25, i.e. where reward probabilities are separated by
0.5 rather than 0.1, as explored previously, and where no-reward and reward
probabilities are equal. In this scenario the advantage of differential learning
rates is negligible. A further setting of interest is the extension to more than
two choices: Figure 4c shows the performance of the meta-learner as compared
to fixed learning rate agents for two different three-armed bandits. In this set-
ting the meta-learner outperforms the rational agent by the same margin as
in the two-armed bandit case, demonstrating that the benefits of differential
learning rates are not restricted to binary choice.
6 Discussion
We simulated agents that make decisions based on learned expected values that
systematically deviate from the true outcome magnitudes. This bias, derived
from differential learning rates for positive and negative reward prediction
errors (RPEs), distinguishes our approach from typical reinforcement learning
models with a single learning rate that attempt to learn the true outcome
magnitudes. In contrast, in economics, divergence between objective values
and subjective utilities is foundational [20]. In line with this idea, our agents
can be said to learn something akin to subjective utilities, in the sense that
actions are based on a distorted (subjective) representation of the true values.
Surprisingly, however, we show that subjective or biased representations
based on the biologically motivated idea of differential processing of posi-
10 Romain D. Caz´e, Matthijs A. A. van der Meer
Fig. 4 The performance advantage of differential learning rate agents depends
on task structure. A We plot here the performance of the different agents (Meta-learner,
Rational, Optimist, and Pessimist) where the probabilities of reward are 0.75 and 0.25 for
the two choices. In tasks with more than two choices, differential learning rates continue to
outperform the other agents. BPerformance of agents in a “three-armed bandit” task, with
reward probabilities 0.1; 0.15; 0.2 in the low reward task and 0.8;0.85; 0.9 in the high reward
task.
tive and negative prediction errors can perform objectively better on simple
probabilistic learning tasks. This result suggests that the presence of separate
direct and indirect (or “Go”/“NoGo”, “approach”/“avoid”) pathways in the
nervous system enables adaptive value derived from distinct learning rates in
these pathways. Behavioral evidence for differential learning rates have been
reported in a number of studies [15,13,35] but to our knowledge this work is
the first to explore the implications of these results from a normative perspec-
tive. Similarly, a number of models have employed separate parameters for
learning from positive and negative outcomes (e.g. [9,22]) but these studies
likewise did not explore the raisons d’etre for such an architecture. We show
here that differential learning rates can result in increased separation between
competing action values, leading to a steady-state performance improvement
because of an interaction with probabilistic action selection and stochastic
rewards.
We also implemented a meta-learning agent to achieve optimal action value
separation adaptively, based on an estimate of the average reward rate. This
agent always outperforms the unbiased agent at steady-state, but it is slower
than the rational agent to reach this higher level of performance. In situations
where reward probabilities are changing, this approach may delay the behav-
ioral response to the change. However, it is well known that “model-free” RL
models in general do not perform well in volatile situations such as serial re-
versal learning [8]. Thus, we view the current proposal as complementary to,
Adaptive properties of differential learning rates 11
and compatible with (within a single RL system) meta-learning of other RL
parameters [10,33]. Examples of such meta-learning include models that adap-
tively regulate overall learning rate [1], the exploration/exploitation trade off
[7,19], and on-line state splitting [31]. A related idea is that of distributed
learning, in which many individual RL models – each with specific parameter
values drawn from a given distribution – are instantiated in parallel and then
compete for behavioral control [11,24]. The meta-learner proposed here can be
implemented in both these ways, but differs from other proposals in examining
normatively the effect of positive and negative prediction errors. In particular,
unlike meta-learners that modify alpha, beta, gamma, our proposal deliber-
ately distorts value estimates. It relaxes the notion that reinforcement learners
should strive for accurate predictions; rather, it should strive for predictions
that are effective after the properties (imperfections, constraints) of e.g. the
action selection stage are taken into account.
The present results were obtained under specific circumstances, using a
so-called “Q-learner” and a softmax action selection rule, with a temperature
parameter equal to 0.3, experiencing static reward distributions. This was done
chosen to demonstrate the main point in a simple setting, commonly in use
both as a RL testbed and in experiments. Nevertheless, the underlying insight
is quite general: in the presence of noisy observations and specific properties of
the decision stage (e.g. softmax), unbiased estimation of expected values may
not lead to optimal performance. Of course, more sophisticated models such
as a Bayesian learner with appropriate (meta-)parameters and priors [41], or a
risk-sensitive reinforcement learner [26] can in principle implement an optimal
solution to the reinforcement learning problems explored here using unbiased
estimates. However, these approaches are more computationally intensive to
implement, and less straightforward to relate to neural signals and neural ar-
chitectures. Q-learning and related models are in widespread use as models for
fitting behavior and neurophysiological data in neuroscience, providing much
evidence that biological learning mechanisms share at least some properties
with these models ([29, 25], but see [17]). Thus, the primary relevance of this
work lies in its application to the interpretation of neuroscience data; we dis-
cuss some examples in the next section.
The broad spectrum of more traditional model-free reinforcement learning
approaches such as the one taken here suggests a number of possible extensions
to the current proposal. For instance, it would be of interest to determine if
differential learning rates can be advantageous in temporal credit assignment
problems (i.e. those requiring a temporal-difference component, learning to
choose actions that are not themselves rewarded but lead to reward later; for
a rare experimental investigation of RL variables on such a task see [6]). We
tentatively suggest that because the reported performance differences arise
because of the stochasticity in outcome distributions, it may be sufficient to
apply differential learning rates only to actions or states associated with actual
rewards. The resulting biased value estimates may then be propagated using
standard temporal-difference methods.
12 Romain D. Caz´e, Matthijs A. A. van der Meer
A number of experimental studies have linked the regulation of differential
learning rates to the neuromodulator dopamine. For instance, Parkinsonian
patients off medication (i.e. with reduced dopamine levels) update RL value
estimates more in response to negative outcomes compared to positive out-
comes; the reverse is the case when subjects are given L-DOPA treatment,
elevating DA (dopamine) levels [15]. A similar effect was found based on ge-
netic polymorphisms in DA receptors [14]. Thus, subjects with relatively low
DA levels behave like our “pessimistic” learner whereas those with high DA
levels are more “optimistic”. Our results therefore suggest that Parkinsonian
patients (off medication) would do well on tasks like our high-reward task, but
do poorly on low-reward tasks.
An important distinction is that between phasic and tonic dopamine, which
are thought to contribute to response vigor and learning respectively. In par-
ticular, tonic dopamine has been suggested to reflect the average reward rate
or “opportunity cost” [27]. Our proposal uses this same estimate for the adap-
tive scaling of learning rates. Thus, in probabilistic learning tasks such as the
ones used here, we would predict that under high reward rates, when tonic
dopamine is high, negative RPEs (reward prediction errors) have more impact
on learning than positive RPEs, because, as we show, this is adaptive in high
reward settings. This is a nontrivial prediction that can be tested with cur-
rent methods; it is critical, however, that the experimental design can separate
effects on learning rate from effects on the exploration-exploitation trade-off,
which has also been linked to tonic DA [19].
Dysregulation of DA levels has also been linked to various manifestations of
addiction, including those involving drugs of abuse [30] and problem gambling
[4]. The application of reinforcement learning models to such behaviors sug-
gests a mis-estimation of choice values lies at the root of dysfunctional decision-
making. Such mis-estimation may arise for instance from a non-compensable
positive prediction error [30], or from inappropriate meta-learning of the adap-
tive learning rates proposed here. In this respect it is particularly interesting
to note that several studies have specifically attributed meta-learning roles to
frontal areas also linked to addiction [7,21, 36]. The applicability of the norma-
tive ideas developed here to these settings would benefit from an empirical test
of positive and negative learning rates in affected individuals, perhaps along
the lines of [3].
Our results also suggest a further alternative to improving reinforcement
learning fits to behavioral and neural data. As noted previously, to improve on
the basic RL model with a single learning rate, some studies have used models
with distinct learning rates for positive and negative prediction errors [13],
requiring the estimation of two free parameters from the data. In contrast,
because our meta-learner sets its learning rates based on observed reward
probabilities, it can potentially account for a similar range of behaviors with
fewer parameters (one meta-parameter winstead of the two learning rates).
How could differential learning rates be implemented in the brain? The
experimental investigation of neural mechanisms for prediction-error based
learning/updating is currently an active area of research. An influential pro-
Adaptive properties of differential learning rates 13
posal arising idea is that RPEs may have different effects on direct (D1) and
indirect (D2) pathways in the striatum [16,23]. In this view, symmetric cod-
ing of RPEs may initially be symmetrically encoded in the firing rate of DA
neurons, but the impact of these RPEs on basal ganglia synaptic weights
may be modulated independently depending on their direction (sign). One
possible mechanism for this may involve the modulation of presynaptic DA
terminals in the striatum by e.g. hippocampal inputs [18]; an idea that can be
tested with current voltammetry methods. Alternatively, the coding of RPEs
by DA neurons may be asymmetric to begin with, due to a smaller dynamic
range in the negative direction from baseline [28]. More radical ideas propose
a complete separation of DA as only supporting learning from positive RPEs
[12]. In any case, these experimental findings demonstrate a striking level of
dissociation of learning pathways for negative and positive prediction errors,
suggestive of broad adaptive value. Our results suggest a normative explana-
tion for these observations, and some encouragement can perhaps be derived
from the demonstration that in (some) situations where rewards are rare, it
pays to be optimistic.
Acknowledgements This work originated at the Okinawa Computational Neuroscience
Course at the Okinawa Institute for Science and Technology (OIST), Japan. We are grateful
to the organizers for providing a stimulating learning environment.
References
1. Timothy E J Behrens, Mark W Woolrich, Mark E Walton, and Matthew F S Rush-
worth. Learning the value of information in an uncertain world. Nature Neuroscience,
10(9):1214–21, September 2007.
2. Ethan S Bromberg-Martin, Masayuki Matsumoto, and Okihide Hikosaka. Dopamine in
motivational control: rewarding, aversive, and alerting. Neuron, 68(5):815–34, December
2010.
3. James F. Cavanagh, Michael J. Frank, and John J B. Allen. Social stress reactivity
alters reward and punishment learning. Soc Cogn Affect Neurosci, 6(3):311–320, Jun
2011.
4. Henry W. Chase and Luke Clark. Gambling severity predicts midbrain response to
near-miss outcomes. J Neurosci, 30(18):6180–6187, May 2010.
5. Mathieu D’Acremont and Peter Bossaerts. Neurobiological studies of risk assessment:
a comparison of expected utility and mean-variance approaches. Cognitive, Affective &
Behavioral Neuroscience, 8(4):363–74, December 2008.
6. Nathaniel D. Daw, Samuel J. Gershman, Ben Seymour, Peter Dayan, and Raymond J.
Dolan. Model-based influences on humans’ choices and striatal prediction errors. Neu-
ron, 69(6):1204–1215, Mar 2011.
7. Nathaniel D. Daw, John P. O’Doherty, Peter Dayan, Ben Seymour, and Raymond J.
Dolan. Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–
879, Jun 2006.
8. Peter Dayan and Yael Niv. Reinforcement learning: the good, the bad and the ugly.
Curr Opin Neurobiol, 18(2):185–196, Apr 2008.
9. Bradley B. Doll, W Jake Jacobs, Alan G. Sanfey, and Michael J. Frank. Instructional
control of reinforcement learning: a behavioral and neurocomputational investigation.
Brain Res, 1299:74–94, Nov 2009.
10. K Doya. Metalearning and neuromodulation. Neural Networks, 15(4-6):495–506, 2002.
11. Kenji Doya, Kazuyuki Samejima, Ken-ichi Katagiri, and Mitsuo Kawato. Multiple
model-based reinforcement learning. Neural Comput, 14(6):1347–1369, Jun 2002.
14 Romain D. Caz´e, Matthijs A. A. van der Meer
12. Christopher D. Fiorillo. Two dimensions of value: dopamine neurons represent reward
but not aversiveness. Science, 341(6145):546–549, Aug 2013.
13. M. J. Frank, A. A. Moustafa, H. M. Haughey, T. Curran, and K. E. Hutchison. Ge-
netic triple dissociation reveals multiple roles for dopamine in reinforcement learning.
Proceedings of the National Academy of Sciences, 104(41):16311–16316, 2007.
14. Michael J Frank, Bradley B Doll, Jen Oas-Terpstra, and Francisco Moreno. Prefrontal
and striatal dopaminergic genes predict individual differences in exploration and ex-
ploitation. Nature Neuroscience, 12(8):1062–8, August 2009.
15. Michael J. Frank, Lauren C Seeberger, and Randall C O’reilly. By carrot or by stick:
cognitive reinforcement learning in parkinsonism. Science, 306(5703):1940–3, December
2004.
16. C. R. Gerfen, T. M. Engber, L. C. Mahan, Z. Susel, T. N. Chase, F. J. Monsma, Jr.,
and D. R. Sibley. D 1 and D 2 Dopamine Receptor-Regulated Gene Expression of
Striatonigral and Striatopallidal Neurons. Science, 250:1429–1432, December 1990.
17. Samuel J Gershman and Yael Niv. Learning latent structure: carving nature at its
joints. Current Opinion in Neurobiology, 20(2):251–6, April 2010.
18. Anthony A. Grace. Dopamine system dysregulation by the hippocampus: implica-
tions for the pathophysiology and treatment of schizophrenia. Neuropharmacology,
62(3):1342–1348, Mar 2012.
19. Mark D Humphries, Mehdi Khamassi, and Kevin Gurney. Dopaminergic Control of the
Exploration-Exploitation Trade-Off via the Basal Ganglia. Frontiers in Neuroscience,
6(February):9, January 2012.
20. Daniel Kahneman and Amos Tversky. Prospect theory: An analysis of decision under
risk. Econometrica: Journal of the Econometric Society, 47(2):263–292, 1979.
21. Mehdi Khamassi, Pierre Enel, Peter Ford Dominey, and Emmanuel Procyk. Medial
prefrontal cortex and the adaptive regulation of reinforcement learning parameters.
Prog Brain Res, 202:441–464, 2013.
22. Mehdi Khamassi, St´ephane Lall´ee, Pierre Enel, Emmanuel Procyk, and Peter F
Dominey. Robot cognitive control with a neurophysiologically inspired reinforcement
learning model. Frontiers in Neurorobotics, 5(July):1, January 2011.
23. Alexxai V Kravitz, Lynne D Tye, and Anatol C Kreitzer. Distinct roles for direct and
indirect pathway striatal neurons in reinforcement. Nature Neuroscience, pages 4–7,
April 2012.
24. Zeb Kurth-Nelson and A David Redish. Temporal-difference reinforcement learning
with distributed representations. PLoS One, 4(10):e7362, 2009.
25. Tiago V Maia and Michael J Frank. From reinforcement learning models to psychiatric
and neurological disorders. Nature Neuroscience, 14(2):154–62, February 2011.
26. Oliver Mihatsch and Ralph Neuneier. Risk-sensitive reinforcement learning. Machine
Learning, 49:267–290, 2002.
27. Yael Niv, Nathaniel D Daw, Daphna Joel, and Peter Dayan. Tonic dopamine: opportu-
nity costs and the control of response vigor. Psychopharmacology, 191(3):507–20, April
2007.
28. Yael Niv, Michael O. Duff, and Peter Dayan. Dopamine, uncertainty and td learning.
Behav Brain Funct, 1:6, May 2005.
29. John P O’Doherty, Alan Hampton, and Hackjin Kim. Model-based fMRI and its ap-
plication to reward learning and decision making. Annals of the New York Academy of
Sciences, 1104:35–53, May 2007.
30. A David Redish. Addiction as a computational process gone awry. Science,
306(5703):1944–1947, Dec 2004.
31. A David Redish, Steve Jensen, Adam Johnson, and Zeb Kurth-Nelson. Reconciling
reinforcement learning models with behavioral extinction and renewal: implications for
addiction, relapse, and problem gambling. Psychol Rev, 114(3):784–805, Jul 2007.
32. Wolfram Schultz. Behavioral theories and the neurophysiology of reward. Annual Re-
view of Psychology, 57:87–115, January 2006.
33. Nicolas Schweighofer and Kenji Doya. Meta-learning in reinforcement learning. Neural
Netw, 16(1):5–9, Jan 2003.
34. Tali Sharot. The optimism bias. Current Biology, 21(23):R941–5, December 2011.
35. Tali Sharot, Christoph W Korn, and Raymond J Dolan. How unrealistic optimism is
maintained in the face of reality. Nature Neuroscience, 14(11):1475–1479, October 2011.
Adaptive properties of differential learning rates 15
36. Amitai Shenhav, Matthew M. Botvinick, and Jonathan D. Cohen. The expected value of
control: an integrative theory of anterior cingulate cortex function. Neuron, 79(2):217–
240, Jul 2013.
37. Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. January
1984.
38. R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press,
Cambridge Mass., September 1998.
39. Matthijs van der Meer, Zeb Kurth-Nelson, and a David Redish. Information processing
in decision-making systems. The Neuroscientist, 18(4):342–59, August 2012.
40. Cristopher Watkins. Learning from delayed rewards. PhD thesis, May 1989.
41. Angela J. Yu. Adaptive behavior: humans act as bayesian learners. Curr Biol,
17(22):R977–R980, Nov 2007.