Content uploaded by Romain Daniel Cazé

Author content

All content in this area was uploaded by Romain Daniel Cazé on Dec 29, 2013

Content may be subject to copyright.

Biological Cybernetics manuscript No.

(will be inserted by the editor)

Adaptive properties of diﬀerential learning rates for

positive and negative outcomes

Romain D. Caz´e ·Matthijs A. A. van

der Meer

Received: date / Accepted: date

Abstract The concept of the reward prediction error – the diﬀerence between

reward obtained and reward predicted – continues to be a focal point for much

theoretical and experimental work in psychology, cognitive science, and neuro-

science. Models that rely on reward prediction errors typically assume a single

learning rate for positive and negative prediction errors. However, behavioral

data indicate that better-than-expected and worse-than-expected outcomes of-

ten do not have symmetric impacts on learning and decision-making. Further-

more, distinct circuits within cortico-striatal loops appear to support learn-

ing from positive and negative prediction errors respectively. Such diﬀerential

learning rates would be expected to lead to biased reward predictions, and

therefore suboptimal choice performance. Contrary to this intuition, we show

that on static “bandit” choice tasks, diﬀerential learning rates can be adap-

tive. This occurs because asymmetric learning enables a better separation of

learned reward probabilities. We show analytically how the optimal learning

rate asymmetry depends on the reward distribution, and implement a bio-

logically plausible algorithm that adapts the balance of positive and negative

learning rates from experience. These results suggest speciﬁc adaptive advan-

tages for separate, diﬀerential learning rates in simple reinforcement learning

RC is supported by an Marie Curie initial training fellowship. MvdM is supported by the

National Science and Engineering Council of Canada (NSERC).

Romain Caz´e

Department of Bioengineering

Imperial College

London, England

r.caze@imperial.ac.uk

Matthijs van der Meer

Department of Biology and Centre for Theoretical Neuroscience

University of Waterloo, Canada

mvdm@uwaterloo.ca

2 Romain D. Caz´e, Matthijs A. A. van der Meer

settings, and provide a novel, normative perspective on the interpretation of

associated neural data.

Keywords Reinforcement learning ·Reward prediction error ·Decision-

making ·Meta-learning ·Basal ganglia

1 Introduction

A central element in the major theories of reinforcement learning is the reward

prediction error (RPE) which, in its simplest form, is the diﬀerence between

the amount of received and expected reward [38]. Thus, a positive RPE is gen-

erated when more is received than expected, and conversely, negative RPEs

occur when less is received than expected. Proposals based on RPEs can ex-

plain salient features of Pavlovian and instrumental learning [32] and generalize

to powerful algorithms that can learn about states and actions which are not

themselves rewarded, but result in reward later (e.g. temporal-diﬀerence re-

inforcement learning, [37]). Essentially, these algorithms attempt to estimate

from experience the expected reward resulting from states and actions, so that

the amount of reward obtained can be maximized.

RPEs appear to be encoded by the neural activity of several brain areas

known to support learning and motivated behavior, such as a population of

dopaminergic neurons in the ventral tegmental area (VTA) [2]. Projection

targets of these neurons encode action and state values [39], and reinforcement

learning model ﬁts to behavioral and neural data can account for trial-by-trial

changes in both [29]. Taken together, these ﬁndings support the familiar notion

that circuits centered on the dopaminergic modulation of activity in cortico-

striatal loops implement a reinforcement learning system. Thus, reinforcement

learning models provide an explicit, computational account of the relationship

between RPEs, estimates of expected reward, and subsequent behavior. This

account is a powerful framework for understanding the neural basis of learning

and decision-making in mechanistic, biological detail.

Many reinforcement learning models implicitly assume that positive and

negative RPEs impact estimation of action and state values through a com-

mon gain factor or learning rate [38,29]. However, converging evidence in-

dicates that learning from positive and negative feedback, including positive

and negative RPEs, in fact relies on dissociable mechanisms in the brain. De-

pending on the speciﬁc setting, these distinct mechanisms are referred to as

approach/avoid, go/noGo, or direct/indirect pathways in the basal ganglia,

associated with predominantly D1 and D2-expressing projection neurons in

the striatum, respectively [16,23]. This neural separation of the direct and in-

direct pathways raises the possibility that distinct, diﬀerential learning rates

are associated with positive and negative RPEs.

In support of this idea, a number of studies have reported behavioral evi-

dence for diﬀerential learning rates in humans [15,13]. From a reinforcement

learning perspective, diﬀerential learning rates imply biased estimates of the

amount of expected reward, which in turn can lead to suboptimal decisions.

Adaptive properties of diﬀerential learning rates 3

This “irrationality” notion is congruent with observations from psychology

and behavioral economics. For instance, a recent study examined the extent

to which subjects updated their estimates of the likelihood of various events

happening to them (e.g. get cancer, win the lottery) after being told the true

probabilities [35]. Strikingly, it was found that subjects tended to update their

estimates more if the odds were better (in their favour) than they thought.

Such diﬀerential updating could underlie the formation of a so-called “opti-

mism bias” [34] and has also been proposed as contributing to other biases

such as risk aversion in mean-variance learners [26,5].

Such behavioral evidence for asymmetric updating following positive (bet-

ter than expected) and negative (worse than expected) outcomes invites nor-

mative questions. Are these biases suboptimal in an absolute sense, but per-

haps the result of limited cognitive resources? Or alternatively, are they op-

timized for situations diﬀerent than tested? Here, we aim to address these

questions quantitatively in simple reinforcement learning models. We show

that even in simple, static bandit tasks, agents with diﬀerential learning rates

can outperform unbiased agents. These results provide a diﬀerent view on (1)

the interpretation of behavioral results where such biases are often cast as

irrational, and (2) neurophysiological data comparing neural responses associ-

ated with positive and negative prediction errors. Finally, this work suggests

simple and biologically motivated improvements to boost the performance of

commonly used reinforcement learning models.

2 Impact of diﬀerential learning rates on value estimation

To model a tractable decision problem where the eﬀects of diﬀerential learning

rates for positive and negative RPEs can be explored, we implement agents

that learn the values of actions on probabilistically rewarded “bandit” tasks.

That is, there is only one state with one (this section) or two possible actions

(the next section), the expected value of which the agent learns incrementally

from probabilistic feedback according to a simple delta rule. Unlike the stan-

dard form of typical reinforcement learning algorithms such as Q-learning [40],

however, our agents employ diﬀerent learning rates for positive and negative

prediction errors ∆Qt= (rt−Qt):

Qt+1 =Qt+α+∆Qtif ∆Qt≥0

α−∆Qtif ∆Qt<0

Qt, by convention, corresponds to the expected reward of taking a certain

action at time step t(“Q-value”). rtis the actual reward received at time step

t. The usual state and action indices s, a are omitted because in this section,

we only consider one state and one action. Unlike typical “symmetric” update

rules, the existence of two learning rates α+and α−can lead to biased Q-

values, as we will show below.

We ﬁrst derive an analytical expression for the steady-state Q-value for a

single action with a probabilistic, binary outcome. Without loss of generality,

4 Romain D. Caz´e, Matthijs A. A. van der Meer

we set the outcomes to rt= 1 or and rt= -1, with probabilities pand 1 −p

respectively. Then, for suﬃciently small αand Q06={−1,1}, if rt= 1 the

outcome is always superior to the predicted Q, whereas if rt=−1 the outcome

is always inferior. Thus, the update rule for each trial is the following:

Qt+1 =Qt+α+(1 −Qt) with probability p

α−(−1−Qt) with probability 1 −p

At steady state, ˆ

Qt+1 =ˆ

Qt=Q∞where ˆ

Qrepresents the mean Qwhen

t→ ∞. This yields:

Q∞=Q∞+pα+(1 −Q∞) + (1 −p)α−(−1−Q∞)

If we deﬁne xas the ratio between the learning rate for positive over neg-

ative prediction value x=α+/α−, we can rewrite the previous equation and

obtain Q∞:

Q∞=px −(1 −p)

px + (1 −p)(1)

A number of properties of this result are worth noting. When α−> α+the

Q-values are “pessimistic”, i.e. they are below the true value; whereas when

α+=α−, the steady-state Q-value converges, as expected, to the true mean

value for this action (equal to 2p−1; see Figure 1 where α+=α−). Moreover,

when α+> α−, the Q-values are “optimistic”, i.e. they are above the true

value. This bias is illustrated in Figure 1 where α+= 4α−. Note also the

complex dependence of the bias on the true mean value. The bias is lower for

low Qvalues than for high Qvalues in the case of a pessimistic agent, while

this is the opposite for an optimistic agent.

To introduce our simulation testbed, we ﬁrst sought to conﬁrm the above

analytical results numerically. The analytical results predicted with high ac-

curacy the behavior of a modiﬁed Q-learning algorithm (5000 iterations of

800 trials each; error bars in Figure 1 show the variance across iterations).

Thus, diﬀerential learning rates for positive and negative RPEs lead to biased

estimates of true underlying values, in a heterogeneous manner that depends

on the true mean value as well as the learning rate asymmetry. In the next

section, we examine the impact of this distortion on performance in choice

settings (“two-armed bandits”).

3 Impact of diﬀerential learning rates on performances

In this section, we consider the consequences of biased estimates resulting from

diﬀerential learning rates in simple choice situations. In particular, we simulate

the performance of three Q-learning agents on two diﬀerent “two-armed ban-

dit” problems. The “rational” (R) agent has equal learning rates for negative

and positive prediction error α= 0.1; the optimistic (O) agent has a higher

Adaptive properties of diﬀerential learning rates 5

Fig. 1 Diﬀerential learning rates result in biased estimates of true expected

values. Analytically derived Q-values (black ﬁlled circles, Q∞) for diﬀerent true values

of Q:0.8,0.6,−0.6,−0.8 (grey dotted lines) and for diﬀerent ratios of α+and α−. Note

that the true Q-values are the means of probabilistic reward delivery schedules. Error bars,

computed using numerical simulations, show the variance of the estimated Q-values after

800 trials averaged over 5000 runs (the means converge to the analytically derived values).

When the learning rates for negative and positive prediction errors are equal, the derived

Q-values correspond to the true values, whereas they are distorted when the learning rates

are diﬀerent. A “pessimistic” learner ( α+

α

−<1) underestimates the true values, and an

“optimistic” learner ( α+

α

−>1) overestimates them.

learning rate for positive prediction errors (α+= 0.4) than for negative pre-

diction errors (α−= 0.1), and the pessimistic (P) agent has a higher learning

rate for negative (α−= 0.4) than for positive prediction errors (α+= 0.1). All

agents use a standard softmax decision rule with ﬁxed temperature β= 0.3

to decide which arm to choose given the Q-values; we employ this decision

rule here because it is the de facto standard in applications of reinforcement

learning models to behavioral and neural data. We discuss the applicability of

these results for the epsilon-greedy decision rule below.

The three agents are tested on two versions of a diﬃcult two-armed bandit

task, with large variance in the two outcomes but small diﬀerences in the

means [38]. For simplicity, we again consider binary outcome distributions

{−1,1}, however, the results that follow are quite general (see Discussion). In

the ﬁrst, “low-reward” task, the respective probabilities of r= 1 are 0.2 and

0.1; in the other “high-reward” task, the two arms are rewarded with reward

probabilities of 0.9 and 0.8 respectively.

When simulated on these two tasks, a striking pattern is apparent (Figure

2a): in the low-reward task, the optimistic agent learns to take the best action

signiﬁcantly more often than the rational agent, which in turn performs better

than the pessimistic agent (left panel). The agents’ performance is reversed

in the high-reward task (right panel). To understand this pattern of results,

recall that reinforcement learners face a trade-oﬀ between exploitation and

6 Romain D. Caz´e, Matthijs A. A. van der Meer

exploration [38]: choosing to exploit the option with the highest estimated Q-

value does not guarantee that there is not an (insuﬃciently explored) better

option available. This problem is particularly pernicious for probabilistic re-

wards such as those in the current setting, where only noisy estimates of true

underlying values are available.

Fig. 2 Diﬀerential learning rates increase performance in speciﬁc tasks. (A) Mean

probability of choosing the best arm (averaged over 5000 runs) for the three agents, “ratio-

nal” (R, α+=α−, blue line), “optimist” (O, α+> α−, green line), and “pessimist” (P,

α+< α−, red line). The left plot shows performance on the low-reward task (0.1 and 0.2

probability of reward for the two choices), the right plot performance on the high-reward

task (0.8 and 0.9 probability). Note that in the low-reward task the optimistic agent is the

best performer and the pessimistic agent the worst, but this pattern is reversed for the high-

reward task. (B) Probability of switching after 800 episodes for each agent. This probability

depends on the task: in the low-reward task the optimistic agent is the least likely to switch,

with the pattern reversed for the high-reward task.

To quantify diﬀerences in how the agents navigate this exploration versus

exploitation trade-oﬀ in the steady state, we estimated the probability that

an agent repeats the same choice between time tand t+ 1, after extended

learning (800 trials). For both tasks, we expect that agent will “stay” more

than “switch” owing to diﬀerences in mean expected reward for both choices.

However, as shown in Figure 2b, the diﬀerent agents have a distinct probability

of switching, and this diﬀerence between agents depends on the task. Higher

probabilities of switching (“exploring”) are associated with lower performance

(compare with Figure 2a).

Why does this occur? As shown by Equation 1 and illustrated in Figure

1, diﬀerential learning rates result in biased value estimates on probabilistic

tasks such as the one simulated here. To understand the implications of this for

choice situations, the dependence of this bias on the true mean value is critical.

Adaptive properties of diﬀerential learning rates 7

In the low-reward task, mean values are low, and therefore the distortion of

these values will be high for the “optimistic” agent (see Figure 1a). Conversely,

in this task, distortion will be low for the “pessimistic” agent (Figure 1b). The

implication of this is that distortion of the true values eﬀectively increases

the contrast between the two choices, leading to (1) increased probability of

choosing the best option using a softmax rule, and (2) increased robustness

to random ﬂuctuations in the outcomes; this latter eﬀect does not rely on the

softmax action selection rule and can also occur for diﬀerent action selection

rules such as -greedy.

More precisely, for the results in Figure 2, the rational agent tends to

approach the true mean values, with a mean ﬁnal estimated after 800 trials

approximatively equal to Q1=−0.63, Q0=−0.84, and in the high reward

approximatively equal to Q1= 0.78, Q0=−0.51. These estimated Q-values

are close to the true Q-value for the state, and thus the ∆Q is close to the

true value of 0.2. The estimated ∆Q at steady state for a biased agent due to

the heterogeneous distortion observed in the ﬁrst part is in excess of 0.5. It is

this distortion (separation) which enables higher steady state performance for

the biased agent. Likewise, distortion in the opposite direction (compression)

is the reason for impaired performance. Speciﬁcally, in the low-reward task,

the pessimistic agent will have a smaller ∆Q than the rational agent, because

the pessimistic bias causes saturation as the Q-values approaches −1, making

the two Q-values closer than the true values, ∆Q < 0.06. For the same reason,

the opposite symmetric observation holds for an optimistic agent in the high-

reward task, where this agent has a ∆Q < 0.04.

In addition to this eﬀect on performance in the steady state, diﬀerential

learning rates likely also impact the early stages of learning, when the agent

takes its ﬁrst few choices. This idea is illustrated by the following intuition:

in the high-reward task, an agent will have a tendency to over-exploit its ﬁrst

choice, because it is likely that this ﬁrst choice will be rewarded. In the low-

reward task, an agent will have the opposite tendency, it will over-explore its

environment because it is likely that neither arm will provide a reward for the

ﬁrst few trials. To the extent that this eﬀect contributes to performance, it

may be mitigated by appropriately diﬀerential learning rates.

These observations raise the obvious question: can we ﬁnd optimal learning

rates which maximize ∆Q? Can we ﬁnd an agent which maximizes this ∆Q

in both tasks? We explore these issues in the following section.

4 Derivation of optimally diﬀerential learning rates

From Equation 1, we can compute ∆Q∞at steady state between the two

choices:

∆Q∞=2(p1q0−p0q1)x

p1p0x2+ (p1q0+p0q1)x+q0q1

where the indices correspond to the diﬀerent choices (bandit arms); the

index of the arm with the highest probability of reward is 1, with 0 indicating

8 Romain D. Caz´e, Matthijs A. A. van der Meer

the other arm. qi= 1 −piis the probability of no-reward for arm i. We can

determine the xfor which this rational function is maximal and thus ﬁnd the

ratio for which ∆Q is maximal:

x∗=√q0q1

√p0p1

In the limit where p0→p1the optimal x∗tends to the ratio between

p(no −reward) and p(reward). The results in the previous section indicate

that the best steady state performance is achieved when ∆Q∞is maximal;

thus, we can conclude that the best learning rates for positive (resp. negative)

prediction error should be proportional to the rate of no-reward (resp. rate of

reward). Speciﬁcally, in the low-reward task the optimal x∗is equal to 6. In

the high-reward task an optimal x∗=1

6. Thus, an optimal x∗in a given task

is also the one which leads to the worst performance in the other task, where

the optimal x∗is the inverse. To solve this issue we introduce an agent with a

plastic learning rate (meta-learner), adaptable to tasks with diﬀerent reward

distributions.

5 Meta-learning of optimally diﬀerential learning rates

We have shown in the previous section that behavior is optimal when the

learning rate for positive (resp. negative) prediction error corresponds to the

probability of no-reward (resp. reward) in a given task. Thus, here we study

an agent which adapts its learning rate for positive and negative prediction

errors. Given our results, we choose as a target α+→w×p(no −reward) and

α−→w×p(reward), with the reward and no-reward probabilities estimated

by a running average of reward history. The parameter w(set to 0.1 in Figure

3) replaces the standard learning rate parameter αand captures the sensitivity

of learning to the estimated reward rate. Thus, we deﬁne a meta-learning agent

which derives separate learning rates for positive and negative prediction errors

from a running estimate of the task reward rate.

We compared the performance of this meta-learning agent to the rational

agent in both high- and low-reward tasks introduced in the previous section.

As shown in Figure 3, the meta agent outperforms the rational agent in both

tasks. In fact, the meta-agent slightly outperforms the purely pessimistic or

optimistic agent, meaning that in the low reward task this meta-learning agent

is optimistic whereas, in the high reward task, the agent is pessimistic. As

expected the two learning rates converge at steady state toward the mean

probability of reward or no-reward in one given task. For a large range of α

between 0.01 and 0.4 we obtained similar results (Figure 3a). Thus, the meta-

learning agent performed best in both task settings by ﬂexibly adapting its

learning rate based on reward history – it under- or overestimates the true

expected rewards to improve choice performance.

Adaptive properties of diﬀerential learning rates 9

Fig. 3 Meta-learning of diﬀerential learning rates results in optimal performance

by the same agent across tasks. As in Figure 2, shown is the probability of choosing the

best option in the two diﬀerent tasks (left: low reward, right: high reward) for two diﬀerent

agents. In blue (α= 0.1) is the performance of the “rational” agent R, in teal and navy blue

is the performance when α= 0.01 and α= 0.4 respectively, and in violet is the agent with

two plastic learning rates N. This “meta-learning” agent outperforms the rational agent in

both tasks by ﬁnding appropriate, diﬀerential learning rates based on a running average of

rewards received. This estimate started with a value of 0.5 for both probabilities, and was

updated by keeping track of the number of rewarded and non-rewarded trials (i.e. an inﬁnite

window for the running average, but similar results are obtained for any suﬃciently reliable

estimation method).

The results in section 3 suggest that diﬀerential learning rates are most

useful in situations where competing choice values are close together. This ef-

fect is illustrated for the meta-learner in Figure 4b) in a task where the reward

probabilities are 0.75 and 0.25, i.e. where reward probabilities are separated by

0.5 rather than 0.1, as explored previously, and where no-reward and reward

probabilities are equal. In this scenario the advantage of diﬀerential learning

rates is negligible. A further setting of interest is the extension to more than

two choices: Figure 4c shows the performance of the meta-learner as compared

to ﬁxed learning rate agents for two diﬀerent three-armed bandits. In this set-

ting the meta-learner outperforms the rational agent by the same margin as

in the two-armed bandit case, demonstrating that the beneﬁts of diﬀerential

learning rates are not restricted to binary choice.

6 Discussion

We simulated agents that make decisions based on learned expected values that

systematically deviate from the true outcome magnitudes. This bias, derived

from diﬀerential learning rates for positive and negative reward prediction

errors (RPEs), distinguishes our approach from typical reinforcement learning

models with a single learning rate that attempt to learn the true outcome

magnitudes. In contrast, in economics, divergence between objective values

and subjective utilities is foundational [20]. In line with this idea, our agents

can be said to learn something akin to subjective utilities, in the sense that

actions are based on a distorted (subjective) representation of the true values.

Surprisingly, however, we show that subjective or biased representations

based on the biologically motivated idea of diﬀerential processing of posi-

10 Romain D. Caz´e, Matthijs A. A. van der Meer

Fig. 4 The performance advantage of diﬀerential learning rate agents depends

on task structure. A We plot here the performance of the diﬀerent agents (Meta-learner,

Rational, Optimist, and Pessimist) where the probabilities of reward are 0.75 and 0.25 for

the two choices. In tasks with more than two choices, diﬀerential learning rates continue to

outperform the other agents. BPerformance of agents in a “three-armed bandit” task, with

reward probabilities 0.1; 0.15; 0.2 in the low reward task and 0.8;0.85; 0.9 in the high reward

task.

tive and negative prediction errors can perform objectively better on simple

probabilistic learning tasks. This result suggests that the presence of separate

direct and indirect (or “Go”/“NoGo”, “approach”/“avoid”) pathways in the

nervous system enables adaptive value derived from distinct learning rates in

these pathways. Behavioral evidence for diﬀerential learning rates have been

reported in a number of studies [15,13,35] but to our knowledge this work is

the ﬁrst to explore the implications of these results from a normative perspec-

tive. Similarly, a number of models have employed separate parameters for

learning from positive and negative outcomes (e.g. [9,22]) but these studies

likewise did not explore the raisons d’etre for such an architecture. We show

here that diﬀerential learning rates can result in increased separation between

competing action values, leading to a steady-state performance improvement

because of an interaction with probabilistic action selection and stochastic

rewards.

We also implemented a meta-learning agent to achieve optimal action value

separation adaptively, based on an estimate of the average reward rate. This

agent always outperforms the unbiased agent at steady-state, but it is slower

than the rational agent to reach this higher level of performance. In situations

where reward probabilities are changing, this approach may delay the behav-

ioral response to the change. However, it is well known that “model-free” RL

models in general do not perform well in volatile situations such as serial re-

versal learning [8]. Thus, we view the current proposal as complementary to,

Adaptive properties of diﬀerential learning rates 11

and compatible with (within a single RL system) meta-learning of other RL

parameters [10,33]. Examples of such meta-learning include models that adap-

tively regulate overall learning rate [1], the exploration/exploitation trade oﬀ

[7,19], and on-line state splitting [31]. A related idea is that of distributed

learning, in which many individual RL models – each with speciﬁc parameter

values drawn from a given distribution – are instantiated in parallel and then

compete for behavioral control [11,24]. The meta-learner proposed here can be

implemented in both these ways, but diﬀers from other proposals in examining

normatively the eﬀect of positive and negative prediction errors. In particular,

unlike meta-learners that modify alpha, beta, gamma, our proposal deliber-

ately distorts value estimates. It relaxes the notion that reinforcement learners

should strive for accurate predictions; rather, it should strive for predictions

that are eﬀective after the properties (imperfections, constraints) of e.g. the

action selection stage are taken into account.

The present results were obtained under speciﬁc circumstances, using a

so-called “Q-learner” and a softmax action selection rule, with a temperature

parameter equal to 0.3, experiencing static reward distributions. This was done

chosen to demonstrate the main point in a simple setting, commonly in use

both as a RL testbed and in experiments. Nevertheless, the underlying insight

is quite general: in the presence of noisy observations and speciﬁc properties of

the decision stage (e.g. softmax), unbiased estimation of expected values may

not lead to optimal performance. Of course, more sophisticated models such

as a Bayesian learner with appropriate (meta-)parameters and priors [41], or a

risk-sensitive reinforcement learner [26] can in principle implement an optimal

solution to the reinforcement learning problems explored here using unbiased

estimates. However, these approaches are more computationally intensive to

implement, and less straightforward to relate to neural signals and neural ar-

chitectures. Q-learning and related models are in widespread use as models for

ﬁtting behavior and neurophysiological data in neuroscience, providing much

evidence that biological learning mechanisms share at least some properties

with these models ([29, 25], but see [17]). Thus, the primary relevance of this

work lies in its application to the interpretation of neuroscience data; we dis-

cuss some examples in the next section.

The broad spectrum of more traditional model-free reinforcement learning

approaches such as the one taken here suggests a number of possible extensions

to the current proposal. For instance, it would be of interest to determine if

diﬀerential learning rates can be advantageous in temporal credit assignment

problems (i.e. those requiring a temporal-diﬀerence component, learning to

choose actions that are not themselves rewarded but lead to reward later; for

a rare experimental investigation of RL variables on such a task see [6]). We

tentatively suggest that because the reported performance diﬀerences arise

because of the stochasticity in outcome distributions, it may be suﬃcient to

apply diﬀerential learning rates only to actions or states associated with actual

rewards. The resulting biased value estimates may then be propagated using

standard temporal-diﬀerence methods.

12 Romain D. Caz´e, Matthijs A. A. van der Meer

A number of experimental studies have linked the regulation of diﬀerential

learning rates to the neuromodulator dopamine. For instance, Parkinsonian

patients oﬀ medication (i.e. with reduced dopamine levels) update RL value

estimates more in response to negative outcomes compared to positive out-

comes; the reverse is the case when subjects are given L-DOPA treatment,

elevating DA (dopamine) levels [15]. A similar eﬀect was found based on ge-

netic polymorphisms in DA receptors [14]. Thus, subjects with relatively low

DA levels behave like our “pessimistic” learner whereas those with high DA

levels are more “optimistic”. Our results therefore suggest that Parkinsonian

patients (oﬀ medication) would do well on tasks like our high-reward task, but

do poorly on low-reward tasks.

An important distinction is that between phasic and tonic dopamine, which

are thought to contribute to response vigor and learning respectively. In par-

ticular, tonic dopamine has been suggested to reﬂect the average reward rate

or “opportunity cost” [27]. Our proposal uses this same estimate for the adap-

tive scaling of learning rates. Thus, in probabilistic learning tasks such as the

ones used here, we would predict that under high reward rates, when tonic

dopamine is high, negative RPEs (reward prediction errors) have more impact

on learning than positive RPEs, because, as we show, this is adaptive in high

reward settings. This is a nontrivial prediction that can be tested with cur-

rent methods; it is critical, however, that the experimental design can separate

eﬀects on learning rate from eﬀects on the exploration-exploitation trade-oﬀ,

which has also been linked to tonic DA [19].

Dysregulation of DA levels has also been linked to various manifestations of

addiction, including those involving drugs of abuse [30] and problem gambling

[4]. The application of reinforcement learning models to such behaviors sug-

gests a mis-estimation of choice values lies at the root of dysfunctional decision-

making. Such mis-estimation may arise for instance from a non-compensable

positive prediction error [30], or from inappropriate meta-learning of the adap-

tive learning rates proposed here. In this respect it is particularly interesting

to note that several studies have speciﬁcally attributed meta-learning roles to

frontal areas also linked to addiction [7,21, 36]. The applicability of the norma-

tive ideas developed here to these settings would beneﬁt from an empirical test

of positive and negative learning rates in aﬀected individuals, perhaps along

the lines of [3].

Our results also suggest a further alternative to improving reinforcement

learning ﬁts to behavioral and neural data. As noted previously, to improve on

the basic RL model with a single learning rate, some studies have used models

with distinct learning rates for positive and negative prediction errors [13],

requiring the estimation of two free parameters from the data. In contrast,

because our meta-learner sets its learning rates based on observed reward

probabilities, it can potentially account for a similar range of behaviors with

fewer parameters (one meta-parameter winstead of the two learning rates).

How could diﬀerential learning rates be implemented in the brain? The

experimental investigation of neural mechanisms for prediction-error based

learning/updating is currently an active area of research. An inﬂuential pro-

Adaptive properties of diﬀerential learning rates 13

posal arising idea is that RPEs may have diﬀerent eﬀects on direct (D1) and

indirect (D2) pathways in the striatum [16,23]. In this view, symmetric cod-

ing of RPEs may initially be symmetrically encoded in the ﬁring rate of DA

neurons, but the impact of these RPEs on basal ganglia synaptic weights

may be modulated independently depending on their direction (sign). One

possible mechanism for this may involve the modulation of presynaptic DA

terminals in the striatum by e.g. hippocampal inputs [18]; an idea that can be

tested with current voltammetry methods. Alternatively, the coding of RPEs

by DA neurons may be asymmetric to begin with, due to a smaller dynamic

range in the negative direction from baseline [28]. More radical ideas propose

a complete separation of DA as only supporting learning from positive RPEs

[12]. In any case, these experimental ﬁndings demonstrate a striking level of

dissociation of learning pathways for negative and positive prediction errors,

suggestive of broad adaptive value. Our results suggest a normative explana-

tion for these observations, and some encouragement can perhaps be derived

from the demonstration that in (some) situations where rewards are rare, it

pays to be optimistic.

Acknowledgements This work originated at the Okinawa Computational Neuroscience

Course at the Okinawa Institute for Science and Technology (OIST), Japan. We are grateful

to the organizers for providing a stimulating learning environment.

References

1. Timothy E J Behrens, Mark W Woolrich, Mark E Walton, and Matthew F S Rush-

worth. Learning the value of information in an uncertain world. Nature Neuroscience,

10(9):1214–21, September 2007.

2. Ethan S Bromberg-Martin, Masayuki Matsumoto, and Okihide Hikosaka. Dopamine in

motivational control: rewarding, aversive, and alerting. Neuron, 68(5):815–34, December

2010.

3. James F. Cavanagh, Michael J. Frank, and John J B. Allen. Social stress reactivity

alters reward and punishment learning. Soc Cogn Aﬀect Neurosci, 6(3):311–320, Jun

2011.

4. Henry W. Chase and Luke Clark. Gambling severity predicts midbrain response to

near-miss outcomes. J Neurosci, 30(18):6180–6187, May 2010.

5. Mathieu D’Acremont and Peter Bossaerts. Neurobiological studies of risk assessment:

a comparison of expected utility and mean-variance approaches. Cognitive, Aﬀective &

Behavioral Neuroscience, 8(4):363–74, December 2008.

6. Nathaniel D. Daw, Samuel J. Gershman, Ben Seymour, Peter Dayan, and Raymond J.

Dolan. Model-based inﬂuences on humans’ choices and striatal prediction errors. Neu-

ron, 69(6):1204–1215, Mar 2011.

7. Nathaniel D. Daw, John P. O’Doherty, Peter Dayan, Ben Seymour, and Raymond J.

Dolan. Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–

879, Jun 2006.

8. Peter Dayan and Yael Niv. Reinforcement learning: the good, the bad and the ugly.

Curr Opin Neurobiol, 18(2):185–196, Apr 2008.

9. Bradley B. Doll, W Jake Jacobs, Alan G. Sanfey, and Michael J. Frank. Instructional

control of reinforcement learning: a behavioral and neurocomputational investigation.

Brain Res, 1299:74–94, Nov 2009.

10. K Doya. Metalearning and neuromodulation. Neural Networks, 15(4-6):495–506, 2002.

11. Kenji Doya, Kazuyuki Samejima, Ken-ichi Katagiri, and Mitsuo Kawato. Multiple

model-based reinforcement learning. Neural Comput, 14(6):1347–1369, Jun 2002.

14 Romain D. Caz´e, Matthijs A. A. van der Meer

12. Christopher D. Fiorillo. Two dimensions of value: dopamine neurons represent reward

but not aversiveness. Science, 341(6145):546–549, Aug 2013.

13. M. J. Frank, A. A. Moustafa, H. M. Haughey, T. Curran, and K. E. Hutchison. Ge-

netic triple dissociation reveals multiple roles for dopamine in reinforcement learning.

Proceedings of the National Academy of Sciences, 104(41):16311–16316, 2007.

14. Michael J Frank, Bradley B Doll, Jen Oas-Terpstra, and Francisco Moreno. Prefrontal

and striatal dopaminergic genes predict individual diﬀerences in exploration and ex-

ploitation. Nature Neuroscience, 12(8):1062–8, August 2009.

15. Michael J. Frank, Lauren C Seeberger, and Randall C O’reilly. By carrot or by stick:

cognitive reinforcement learning in parkinsonism. Science, 306(5703):1940–3, December

2004.

16. C. R. Gerfen, T. M. Engber, L. C. Mahan, Z. Susel, T. N. Chase, F. J. Monsma, Jr.,

and D. R. Sibley. D 1 and D 2 Dopamine Receptor-Regulated Gene Expression of

Striatonigral and Striatopallidal Neurons. Science, 250:1429–1432, December 1990.

17. Samuel J Gershman and Yael Niv. Learning latent structure: carving nature at its

joints. Current Opinion in Neurobiology, 20(2):251–6, April 2010.

18. Anthony A. Grace. Dopamine system dysregulation by the hippocampus: implica-

tions for the pathophysiology and treatment of schizophrenia. Neuropharmacology,

62(3):1342–1348, Mar 2012.

19. Mark D Humphries, Mehdi Khamassi, and Kevin Gurney. Dopaminergic Control of the

Exploration-Exploitation Trade-Oﬀ via the Basal Ganglia. Frontiers in Neuroscience,

6(February):9, January 2012.

20. Daniel Kahneman and Amos Tversky. Prospect theory: An analysis of decision under

risk. Econometrica: Journal of the Econometric Society, 47(2):263–292, 1979.

21. Mehdi Khamassi, Pierre Enel, Peter Ford Dominey, and Emmanuel Procyk. Medial

prefrontal cortex and the adaptive regulation of reinforcement learning parameters.

Prog Brain Res, 202:441–464, 2013.

22. Mehdi Khamassi, St´ephane Lall´ee, Pierre Enel, Emmanuel Procyk, and Peter F

Dominey. Robot cognitive control with a neurophysiologically inspired reinforcement

learning model. Frontiers in Neurorobotics, 5(July):1, January 2011.

23. Alexxai V Kravitz, Lynne D Tye, and Anatol C Kreitzer. Distinct roles for direct and

indirect pathway striatal neurons in reinforcement. Nature Neuroscience, pages 4–7,

April 2012.

24. Zeb Kurth-Nelson and A David Redish. Temporal-diﬀerence reinforcement learning

with distributed representations. PLoS One, 4(10):e7362, 2009.

25. Tiago V Maia and Michael J Frank. From reinforcement learning models to psychiatric

and neurological disorders. Nature Neuroscience, 14(2):154–62, February 2011.

26. Oliver Mihatsch and Ralph Neuneier. Risk-sensitive reinforcement learning. Machine

Learning, 49:267–290, 2002.

27. Yael Niv, Nathaniel D Daw, Daphna Joel, and Peter Dayan. Tonic dopamine: opportu-

nity costs and the control of response vigor. Psychopharmacology, 191(3):507–20, April

2007.

28. Yael Niv, Michael O. Duﬀ, and Peter Dayan. Dopamine, uncertainty and td learning.

Behav Brain Funct, 1:6, May 2005.

29. John P O’Doherty, Alan Hampton, and Hackjin Kim. Model-based fMRI and its ap-

plication to reward learning and decision making. Annals of the New York Academy of

Sciences, 1104:35–53, May 2007.

30. A David Redish. Addiction as a computational process gone awry. Science,

306(5703):1944–1947, Dec 2004.

31. A David Redish, Steve Jensen, Adam Johnson, and Zeb Kurth-Nelson. Reconciling

reinforcement learning models with behavioral extinction and renewal: implications for

addiction, relapse, and problem gambling. Psychol Rev, 114(3):784–805, Jul 2007.

32. Wolfram Schultz. Behavioral theories and the neurophysiology of reward. Annual Re-

view of Psychology, 57:87–115, January 2006.

33. Nicolas Schweighofer and Kenji Doya. Meta-learning in reinforcement learning. Neural

Netw, 16(1):5–9, Jan 2003.

34. Tali Sharot. The optimism bias. Current Biology, 21(23):R941–5, December 2011.

35. Tali Sharot, Christoph W Korn, and Raymond J Dolan. How unrealistic optimism is

maintained in the face of reality. Nature Neuroscience, 14(11):1475–1479, October 2011.

Adaptive properties of diﬀerential learning rates 15

36. Amitai Shenhav, Matthew M. Botvinick, and Jonathan D. Cohen. The expected value of

control: an integrative theory of anterior cingulate cortex function. Neuron, 79(2):217–

240, Jul 2013.

37. Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. January

1984.

38. R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press,

Cambridge Mass., September 1998.

39. Matthijs van der Meer, Zeb Kurth-Nelson, and a David Redish. Information processing

in decision-making systems. The Neuroscientist, 18(4):342–59, August 2012.

40. Cristopher Watkins. Learning from delayed rewards. PhD thesis, May 1989.

41. Angela J. Yu. Adaptive behavior: humans act as bayesian learners. Curr Biol,

17(22):R977–R980, Nov 2007.