Content uploaded by Robert Lowe
Author content
All content in this area was uploaded by Robert Lowe on Feb 07, 2017
Content may be subject to copyright.
Original Article
Adaptive Behavior
1–19
ÓThe Author(s) 2017
Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/1059712316682999
journals.sagepub.com/home/adb
Affective-Associative Two-Process
theory: A neural network investigation
of adaptive behaviour in differential
outcomes training
Robert Lowe
1,2
and Erik Billing
1
Abstract
In this article we present a novel neural network implementation of Associative Two-Process (ATP) theory based on an
Actor–Critic-like architecture. Our implementation emphasizes the affective components of differential reward magnitude
and reward omission expectation and thus we model Affective-Associative Two-Process theory (Aff-ATP). ATP has been
used to explain the findings of differential outcomes training (DOT) procedures, which emphasize learning differentially
valuated outcomes for cueing actions previously associated with those outcomes. ATP hypothesizes the existence of a
‘prospective’ memory route through which outcome expectations can bring to bear on decision making and can even
substitute for decision making based on the ‘retrospective’ inputs of standard working memory. While DOT procedures
are well recognized in the animal learning literature they have not previously been computationally modelled. The model
presented in this article helps clarify the role of ATP computationally through the capturing of empirical data based on
DOT. Our Aff-ATP model illuminates the different roles that prospective and retrospective memory can have in decision
making (combining inputs to action selection functions). In specific cases, the model’s prospective route allows for adap-
tive switching (correct action selection prior to learning) following changes in the stimulus–response–outcome
contingencies.
Keywords
Actor–Critic, reward learning, animal models, Associative-Two Process theory, neo-behaviourism, decision making
1 Introduction
1.1 Background
In recent years, there has been a growth of interest in
investigating mechanisms of learning, memory and
adaptive behaviour in human subjects by use of more
or less traditional behaviourist methods. Such investi-
gations have covered areas of individual learning and
decision making (Este
´vez, 2005; Esteban, Plaza, Lo
´pez-
Crespo, Vivas & Este
´vez, 2014) as well as paradigms
concerned with social decision making and behaviour
(e.g. Sebanz, Knoblich & Prinz, 2005; Atmaca, 2011;
Lowe, R., Alme
´r, A., Lindblad, G., Gander, P.,
Michael, J. & Vesper, 2016). This ‘minimalist’ metho-
dological perspective offers a controlled approach for
studying processes that may underly complex cognitive
behaviour. Staddon (2014) has posited the importance
of a neo-behaviourist approach, which places a strong
focus on the influence of hidden or internal variables
on adaptive responding and analysis thereof. Internal
states are ‘not just desirable [but] essential to the
theoretical understanding of any historical system. The
behaviour of no organism above a bacterium can be
understood without this concept’ (Staddon, 2014, p.
176). The relationship between internal, or hidden,
states has often been described in terms of associative
links between outcome expectancies and behavioural
choices (e.g. de Wit & Dickinson, 2009). Minimalist
psychological models of adaptive behaviour have also
used such ‘neo-behaviourist’ terminology to explain the
role of internal states. Miceli & Castelfranchi (2014),
for example, have referred to a S!A!R mode of asso-
ciative processing where A stands for ‘Affect. In this
case the links between affect and stimuli are hypothe-
sized as being bi-directional. An adaptive benefit of this
bi-directionality may be that affective states can have a
1
University of Sko
¨vde, Sweden
2
University of Gothenburg, Sweden
Corresponding author:
Robert Lowe, University of Skovde, Skovde 541 40, Sweden.
Email: robert.lowe@his.se
role in selective attention of external stimuli suppres-
sing attention to those stimuli incongruent with the
present affective state.
The aim of the work presented in this article is to
provide a neo-behaviourist model of adaptive beha-
viour based on a particular two-process theory of learn-
ing the empirical evidence for which having indicated
the existence of decision-mediating internal states.
1.2 The two-process theory of learning
A legacy of behaviourism is the view of learning as
occurring according to two distinct processes. First,
there is Pavlovian (or classical) conditioning, where
associations made between stimuli and outcomes are
not contingent upon behavioural (instrumental) inter-
vention. Second, there is instrumental (or operant) con-
ditioning where such stimulus–outcome associations
are contingent upon behaviour. Two-process theory
has emphasized the inter-dependency of these two pur-
portedly distinct processes (cf. Overmier & Lawry,
1979). Underlying two-process theories is the use of a
three-term contingency of instrumental learning: S–R–
O, where S = stimulus, R = response, and O = rein-
forcing outcome, and where the Pavlovian process
(through S–O associations) is embedded within the
instrumental process.
Historically, versions of two-process theory have
emphasized the energizing or motivational component
(cf. Mowrer, 1947) rather than directional mediational
control of responding. A perspective on two-process
conceptual models entailing mediational control was
later put forward (cf. Trapold, 1970; Trapold &
Overmier, 1972). This, Associative Two-Process theory
(ATP) has since provided a theoretical foundation for
explaining the results of transfer-of-control paradigms
in which outcomes (e.g. reinforcers) are qualitatively
differentiated in relation to the cueing stimulus and/or
response antecedent to them.
ATP (Figure 1) views the two-process as consisting
of S–R (so-called retrospective) and S–E–R (so-called
prospective) routes where E represents an expectation
of an outcome (or mediator). ATP indicates that the
prospective route is formed according to a split between
S–E and E–R associations (Overmier & Lawry, 1979)
and that this split in learning is testable according to
transfer-of-control paradigms evidencing ATP. The ret-
rospective route is thus-named owing to the require-
ment of maintaining in memory an eliciting stimulus
over an inter-stimulus interval until such a time as a
response is pertinent. The prospective route is thus-
named owing to responding being mediated by an
expectation of a future outcome (E). ATP has remained
popular to the present day and is considered the leading
explanation for certain transfer-of-control effects (see
Urcuioli, 2005, for a review) as well as providing the
theoretical underpinnings to a stimulus classification by
outcomes perspectives on learning and adaptive deci-
sion making (Urcuioli, 2013).
Fundamental to adaptive behaviour in ATP theory
is that learning by differential outcomes expectations
provides an extra source of information by which to
mediate responding. Thereby, construction of S–E and
E–R differential associative contingencies (Figure 1D)
may (i) substitute for or (ii) overshadow, information (or
lack thereof) obtained by learned S–R contingencies
(Urcuioli, 1990). The S–E–R prospective route, in addi-
tion, may facilitate learning and decision making/
responding, based on the S–R retrospective route (i.e.
where the S–R contingencies are incompletely learned).
The use of differential outcomes training (DOT) proce-
dures have been used to identify the existence of ATP ret-
rospective and prospective processing routes. Such
procedures have also been used to investigate the so-called
differential outcomes effect (DOE; cf. Trapold 1970). The
DOE simply states that learning differential S–R pairs
(e.g. S1–R1 and S2–R2) reinforced by an outcome O,
occurs more rapidly over presentations/trials when out-
comes are differentiated per pair (e.g. S1–R1!O1 and
S2–R2!O2) than when the outcomes are the same for
the pairs (compare Figure 1C with Figure 1D).
1.3 Two-process theory and reinforcement learning
In a variety of experimental investigations using DOT
many types of differential outcomes have been used
including explicitly and non-explicitly rewarding types
Figure 1. Associative Two-Process theory and its
underpinnings, adapted from Peterson & Trapold (1980).
(A) Standard behaviourist perspective where S–R associations
are learned via the outcome expectation representation (value).
(B) Outcome expectation can cue responding following
S–E, E–R learning. (C) Non-differential outcomes expectations
do not provide extra information for cueing responses.
(D) Differential outcomes expectations provide additional
information to stimuli for cueing responses. Here ldenotes
non-differential outcomes; l1 and l2 are differential.
2Adaptive Behavior
using as experimental subjects non-human animals
(Peterson & Trapold, 1980; Urcuioli, 1990) and infant
and adult humans (cf. Maki, Overmier, Delos &
Gutmann, 1995; Este
´vez & Fuentes, 2001; Este
´vez,
Overmier & Fuentes, 2003; Este
´vez, 2005; Martı
´nez
et al., 2012; Esteban et al., 2014). Differential reward-
based outcomes can take many forms:
1. type (e.g. food versus water);
2. quantity (magnitude/amount of reinforcement);
3. probability (every episode presentation versus prob-
abilistic presentation);
4. delay and duration of presentation.
This list of reinforcement dimensions (cf. Friedrich
& Zentall, 2011) can be compared with those identified
by Doya (2008) who suggests that in computational
models ‘the value of a reward given by an action at a
state is a function of reward amount, delay and prob-
ability.’ (p. 410). This is in contrast with reinforcement
learning (RL) algorithms that conflate reinforcement
information into a single dimension in the tradition of
classical RL constructions, e.g. Actor–Critic, SARSA.
A useful conceptual comparison can be drawn between
the two-process and, specifically, ATP framework, and
the Actor–Critic architecture. In the latter, a reward
valuation of a state (Sutton & Barto, 1998) or temporal
stimulus representation (Suri & Schultz, 1998) provides
the stimulus–expectation computation (the Critic), or
S–E association. An independent S–R processing route
whose associative links are updated by the rewarding
outcome (prediction errors) provides the basis of the
Actor. Actor–Critic architectures, however, lack direct
mediation of responses (actions) by the outcome expec-
tations. There is no S–E–R route.
Kruse & Overmier (1982), see also Overmier &
Lawry (1979), through their Associative Mediational
Theory (AMT) accounted for how differential prob-
abilistic rewarding outcomes can, in turn, differentially
mediate response control. In this case, such differential
outcome expectancies are viewed as learned cues that
substitute for external stimuli. For example, E1 can
represent one reward probability expectation, and E2 a
different reward probability expectation. Such E values,
as inputs to response options, may also induce energiz-
ing effects (Overmier & Lawry, 1979): this type of ener-
gizing has been labelled ‘affective’ by the authors.
However, differential outcomes based on rewards (and
punishers) have also been viewed as means by which
different emotions may be elicited (Rolls, 1999, 2013).
Reward receipt induces positive emotions (affective
states) such as excitement or joy, while omission of
expected reward induces negatively valenced emotions
such as frustration. On this basis, we use the term
‘affect’ in the broad sense of describing both reward (or
punishment)-relevant properties of energizing and
(appraisal-relevant) expectancies that may elicit or cue
behavioural responses.
1.4 Breakdown
In this paper we provide a RL, neural network imple-
mentation of Affective-Associative Two-Process theory
(Aff-ATP). The aim of this model is to shed light, com-
putationally, on the relationship between the hypothe-
sized retrospective and prospective memory routes as
they mediate decision making. It is also to demonstrate
the plausibility and requisite computational elements of
Aff-ATP’s implementation. In Section 2, we describe
the Aff-ATP model and discuss its theoretical deriva-
tion. In Section 3, we discuss the methodology of the
experiments (carried out on pigeons) that are simulated
to validate our model. In Section 4, we provide results
of this validation test and clarify the function of differ-
ent components of the model. Finally, in Section 5 we
provide a discussion.
2 The Aff-ATP model: a neural network
implementation
It has been extensively demonstrated (Urcuioli, 2005)
that the ATP theory (Trapold, 1970) (see also Figure 1)
can account for, and predict, learning and decision
making patterns when contrasting differential out-
comes to non-differential (or common) outcomes train-
ing procedures. From a computational perspective the
mechanism is less clear.
It has been empirically validated that DOT effects
occur in relation to multiple outcome dimensions (cf.
Urcuioli, 1990). Such dimensions, as they pertain to
reinforcing outcomes, were alluded to in the introduc-
tion: type,quantity/magnitude,probability – of outcome,
timing – delay and duration of outcome. A relevant
model that implicitly captures two such dimensions,
magnitude and (omission) probability, and that is
explicable both in neural networks and animal learning
terms, exists via Balkenius & More
´n (2001) (see also
More
´n 2002). This model has previously been used to
account for animal learning data based on reward pre-
sentation and omission (of expected reward) mediation
of behaviour. It, thereby, offers a hint at how our Aff-
ATP model may be implemented such that it captures,
related, DOT data.
The Balkenius & More
´n (2001) model has several
variants (Balkenius & More
´n, 2001; More
´n, 2002;
Balkenius, More
´n & Winberg, 2009; Billing &
Balkenius, 2014); see Figure 2 for a minimalist depic-
tion of the version of More
´n (2002).
The More
´n (2002) model was tested extensively on a
number of standard animal learning paradigms includ-
ing (i) acquisition–extinction–reacquisition, (ii) block-
ing, (iii) conditioned inhibition, demonstrating typical
Lowe and Billing 3
animal learning behavioural effects, e.g. accelerated re-
learning following the reacquistion phase in (i). The
adapted learning model was also utilized on a robot
(head) permitting an orienting response to visually and
emotionally salient stimuli (Balkenius et al., 2009). It
should be emphasized that, from an animal learning
perspective, the network, was not explicitly designed to
accommodate differential outcomes scheduling. From
a machine learning perspective, it was not embedded
within an Actor–Critic network nor implemented
within a TD learning framework: it adapts the
Rescorla & Wagner (1972) rule. Nevertheless, its multi-
dimensional reinforcement computations implicitly
provide a type of differential outcomes Critic imple-
menting differential S–E associations as part of our
Aff-ATP model. Below the More
´n (2002) model is our
own differential outcomes Critic (Figure 3) that simi-
larly enables computations across the dimensions of
(omission) probability and magnitude but, since it
exploits a TD implementation, it also computes delay
of reinforcement.
The Aff-ATP model implementation that we pro-
pose in the next section, combines a differential out-
comes computing Critic with an Actor that enables two
forms of learning: (i) retrospective, which we liken to a
standard Actor–Critic approach
1
in that action is cued
by (looks back to) the initial stimulus; (ii) prospective,
action is cued by (looks forward to) dimensions of out-
come expectations (value) supplied by the Critic.
Restrospection versus prospection effects have been
much emphasized (cf. Wasserman & Zentall, 2006).
2.1 The Aff-ATP neural network
We present a neural network architecture that, in com-
puting the three critical dimensions of reinforcement
identified by Doya (2008) (i) magnitude, (ii) probabil-
ity, (iii) (delay) timing, of reinforcement, provides an
important theoretical and computational model of the
ATP theory, i.e. Aff-ATP theory. We divide our
description of the Aff-ATP model into Critic and
Actor components.
2.2 Aff-ATP Critic: differentiating omission
probability
The Differential outcomes Critic depicted in Figure 3
has the ability to differentiate between stimuli across
the dimension of omission probability. This constitutes
an extension of the Critic architecture of Balkenius &
More
´n (2001) depicted in Figure 2. The Critic com-
prises two TD(0) sub-networks, for learning magnitude
and omission probability values, where omission value
is learned as a function of reward (magnitude) value.
The error node of the omission Critic (red node in
Figure 3) is updated when rewarding input (from the
target), learned to be expected in the magnitude Critic,
fails to arrive, a discounted (positive) prediction error of
omission. This omission prediction error is, initially, zero
when reward does arrive (correct prediction of the neg-
ative prediction error). With learning, when reward is
not forthcoming, the higher the reward expectancy
(computed as magnitude), the greater will be the omis-
sion prediction error (positive) used to update the omis-
sion probability value representation. When predicted
reward omission does not occur (the reward is pre-
sented) a negative prediction error is generated which
reduces the omission probability value. Updating the
value function entails using a discrete time prediction
error for the magnitude Critic (see Equation (4)) and
adaptation thereof for the omission Critic (Equation
(5)). The learned value functions for the two sub-
networks assume the profile of typical TD discounted
delay where maximum values indicate the precise time
of expected magnitude and omission probability, in
(0,1). As for Figure 2, expectation of omission inhibits
the magnitude expectation output (to the Actor), a
relayed value whose activation does not interfere with
magnitude learning. This means that as a result of a sti-
mulus input the learned stimulus-expectation of omis-
sion will provide the dominant output (to the Actor) as
a function of its strength (expected probability of omis-
sion). Equations (1)–(5) implement the Critic:
Figure 2. Moren and Balkenius model. S = stimulus number.
Non-arrowed connections are either excitatory/inhibitory.
Figure 3. Differential outcomes critic.
4Adaptive Behavior
Ve(t)=f0(X
N
n=1X
S
s=1
(uens (t)fns(t)) ð1Þ
ue(t)=ue(t1)+bedefns(t1)ð2Þ
f0(x)=
0,x\0
x,x0 and x\1
1,x1
8
<
:
ð3Þ
where Ve(t) is the learned value function (expectation);
ue(t) is the value function (S–E) update rule; e2fm,og
is an index denoting magnitude, or omission Critic,
value functions, respectively; nis the number of stimuli
discrete trace representations in (1,N) where N=10;t
is time in (1,T) where T=10;sis the number of differ-
ent stimuli in (1,S) where S=2here; beis a learning
rate in ½0,1); deis the prediction error term utilized here
as de2(1,1) but where ½dm+ , i.e. a non-negative
input, is used when e=m;Fis the input stimulus vec-
tor of size = NxS for each timestep t, the so-called
compound stimulus, see Suri (2002). The 10 timesteps
constitute the duration of each learning trial, reset to 1
at the start of each trial (stimulus onset). Each vector
of the compound stimulus has a single unit set to 1 and
all others set to 0. A given vector represents the trace
delay of a phasically (short duration) presented stimu-
lus across the inter-stimulus interval (time between sti-
mulus and reward presentations). This means a unique
vector represents a time step following offset of the
stimulus presentation whose unity value provides a pre-
synaptic component of the two-factor learning rule (the
other ‘factor’ being the prediction error). It must be
noted that the stimulus representation is not the exter-
nal (conditioned) stimulus per se. We have
dm(t)=l(t1)+gVm(t)Vm(t1)ð4Þ
do(t)= dm(t)+gVo(t)Vo(t1)ð5Þ
where dm(t) and do(t) represent prediction errors used
to update the magnitude and omission Critics, respec-
tively, and to approximate their value functions better;
l(t) is the reward signal in (0,1); g=0:99. This forma-
lizes the Critic insofar as it computes omission prob-
abilities relative to a fixed reward magnitude.
Computationally, the Critic is equivalent to a standard
TD(0) algorithm since in a standard Critic its value cal-
culation is a function of magnitude and probability of
presentation, i.e. [1Vo], and where magnitude = 1,
Vorepresents the probability of omission. The omission
prediction error need only change its sign to maintain
error values for learning. Dividing the standard Critic
according to these two dimensions provides the possi-
bility to exploit the differential value informational
content of the Critic in the Actor for purposes of action
selection. An example of the computation of the neural
network implementation of this Critic can be found in
Figure 4.
Figure 4. Differential outcomes computations. Computations of the Critic following learning of the value of a particular stimulus
(0.5 omission probability of reinforcement of magnitude value 1.0). Left: Pre-reinforcement step. At the time step prior to reinforcer
presentation. Right: Reinforcement step. At the timestep of reinforcer presentation. Darker nodes represent greater activation;
black = full activation of 1, white = zero activation.
Lowe and Billing 5
2.3 Aff-ATP Actor
The Actor of our Aff-ATP model is divided into retro-
spective and prospective components where the latter
uses inputs relayed from the Critic to cue actions unlike
the former component, which, similar to standard
Actor–Critic architectures, implements a policy sepa-
rate from the Critic.
2.3.1 Retrospective Actor. Retrospection concerns main-
taining in memory over a delay the stimulus associated
with a particular response. Equations (6)–(8) describe
the retrospective actor:
srj(t)= X
s2S
(cjs(t)Ss(t)zCrj(t)) + jð6Þ
where srj(t) are Retrospective Actor output units to the
action layer and j2f1,2g;cjs(t) is a connection weight
between srjand stimulus representation Ss2f0,1gand
sstimuli 2f1,2g;z=5is a constant that ensures a
strong inhibitory effect of srj(t)byCrj(t), which is the
output of the Prospective Actor. Finally, j20:1½0,1.
is a noise term. We have
esjs(t)= 1if t=1
(1n)esjs(t1) otherwise
ð7Þ
where esjs(t) is the eligibility trace connecting responses
jto stimuli sused to update the srj(t) weights C(t);
1nis the degree of eligibility per time step following
Sspresentation where Ssis presented at t=1and
n=0:1. We have
cjs(t)=f0(cjs (t1)+r1srj(t)l(t)esjs(t)
r2cjs(t1)) ð8Þ
where r1=1 ensuring so-called one-shot learning,
r2=0.0005 which ensures a slow decay in the absence
of reinforcement. This decay is necessary, in the absence
of negative prediction errors from the Critic, to permit
unlearning.
2.3.2 Prospective Actor. Prospection concerns anticipat-
ing an outcome over a delay. The Prospective Actor
directly processes, and transforms, inputs from the
Critic’s two value dimensions. The output of the
Prospective Actor mediates Action selection through
(a) facilitating, (b) substituting for, (c) overshadowing/
suppressing, the influence of the Retrospective Actor.
There are two key aspects of how the Prospective Actor
interacts with the Retrospective Actor for mediating
Action Selection. (1) Output of the Critic’s Value
dimensions (Magnitude, Omission probability) are sent
to an ‘Affective Classifier Network’. This network pro-
vides an approximation of classification of Critic out-
puts into Pessimistic and Optimistic valuations of the
given stimulus permitting differential affective expecta-
tion. (2) Output of the Affective Classifier Network to a
‘Prospective Response Classifier Network’. At the input
layer this network may have more than one node acti-
vated as a result of an XOR-affective classified output
node projecting to more than one Prospective input
node. Input layer nodes relay activation to correspond-
ingly indexed nodes at an output layer. Through inhibi-
tion of nodes in the output layer by nodes at the input
layer (opposite to that of the relaying input node) an
XOR classification is again approximated. In this way,
the Prospective Actor only mediates Action Selection if:
(a) a stimulus has a dominant affective classification,
(b) an affective classification has a dominant response
classification. Note, it is necessary for XOR classifica-
tion to be only approximate so as not to prohibit learn-
ing where Critic-relayed outputs are less than one.
Equations (9)–(15) describe the Prospective Actor.
Equations (9)–(12) describe the Affective Classifier
Network, while equations (13) and (14) describe the
Prospective Response Classifier Network (see Figure 4).
We have
Epes(t)=Vo(t)ð9Þ
Eopt(t)=f0(Vm(t)Epes (t)) ð10Þ
where Epes is the omission probability relay node and
constitutes a ‘pessimistic’ affective representation (i.e. a
reward omission probability value) while Eopt is the
magnitude relay output of the Critic and constitutes an
‘optimistic’ affective representation (i.e. a reward acqui-
sition probability value). These nodes interface Critic
and Actor. We have
Cc(t)=f0(Ec(t)X(t)) ð11Þ
where c2fpes,optgand Ccare nodes whose values
pass to the Prospective Response Classifier Network
first layer erj(t), see Equation (13). We have
X(t)=(Epes(t)+Eopt (t)) (jEpes(t)Eopt (t)j)ð12Þ
which implements an XOR-like filter mechanism of the
Affective Classifier Network (central node in the net-
work): one and only one input from the Critic permits
Cpes or Copt node activation.
2
Here X(t) is a node that
inhibits the output of Cc(t) nodes (Equation (11)) as a
function of their difference in value: the greater the dif-
ference, the lesser the inhibition. In the absence of this
mechanism, if two strong Cc(t) activations are permit-
ted, during learning, weights from both nodes to acti-
vated erj(t) nodes will be strengthened. This will lead to
ambiguous affective control over prospective responses,
for each stimulus–reinforcement pair. We have
erj(t)=vpesjCpes(t)+voptj Copt(t)ð13Þ
6Adaptive Behavior
where erj(t) receive inputs from Cpes(t) and Copt(t)
weighted by vcj that connect Ccwhere c2fpes,optg
and erjwhere j2f1,2g. We have
Crj(t)=erj(t)erj0(t)ð14Þ
where Crj(t) provides output to the action layer and
inhibits the srj(t) vector where j2f1,2gand j06¼ j.
Inputs from corresponding erj(t) nodes are mutually
inhibitory producing an XOR-like gate for the
Prospective Response Classifier Network. In the
absence of the Crjnodes, erjnodes could activate multi-
ple action nodes: ambiguous action selection. We have
vcj(t)=f0(vcj (t1)+r1Cc(t)l(t)uj(t)r2vcj(t1))
ð15Þ
where O(t). is the weights vector for Cc(t)
(c2fpes,optg) of size 4 projecting to erj(t). As for C
weights, the values of these weights decay in the
absence of reinforcement (term: r2vcj(t1)). The
three-factor rule linking Cc(t) to action units (gated by
the reinforcement signal) is non-Hebbian though action
units may project back to the ‘pre-action’ (erj(t)) units
to no functional detriment. Such re-entrance might be
useful for entraining a particular action over an indefi-
nite time period.
2.3.3 Action layer. The action layer receives competing
inputs from retrospective and prospective actors:
aj(t)=zCrj(t)+srj(t)ð16Þ
where activity of aj(t) is transformed into binary
values (uj(t)) according to the winner of a stochastic
action selection algorithm (e=0:05. probability of
sub-optimal action selection), z=5is an amplification
constant.
2.4 The complete Aff-ATP network
The full instantiation of the architecture (Figure 6)
depicts how the Critic (1), Prospective Actor (2) and
Retrospective Actor (3) components interface. Here, sti-
muli valuations are computationally differentiated by
reward omission probability, e.g. one stimulus may be
omitted 0.5 of the time, a different stimulus at 0.0 prob-
ability. However, we also have extended this network to
differentially valuate stimuli by reward magnitude.
Differential Magnitude here comprises a simple mem-
ory model, which compares the value associated with
the currently presented stimulus to that remembered for
the alternative stimulus. It then projects directly to the
Omission relay node (Epes in dark blue) which, through
inhibiting the Magnitude relay node (Eopt), now com-
putes the difference in magnitude valuations of the two
stimuli. This requires relaxing the constraint that
Figure 5. Actor neural network architecture. See the main
text for further description.
Figure 6. The complete Aff-ATP network. See the main text
for a detailed description. From the perspective of ATP theory,
the neural network architecture can be divided into three main
components: (1) a Critic, which implements S–E associations;
(2) a Prospective Actor, which implements E–R associations
interfacing S–E and E–R according to its two XOR networks
(above and below the lgated plastic connections); (3) a
Retrospective Actor, which implements S–R associations.
Lowe and Billing 7
reinforcement presence (magnitude) cannot be
unlearned since a (reduced) shift in magnitude will not
be possible in the absence of an unlearning rate (see
Section 5). Thus, a small unlearning rate is introduced.
The equations for the memory model (differential mag-
nitude comparator) are as follows:
V0
m(t)=Vm(t)ð17Þ
where V0
m(t) is a memory for the magnitude node of the
sstimulus that was not the most recently presented. We
have
d(t)=f00(Vm0(t)Vms (t))fns(t)ð18Þ
where the value of V0
m(t) is compared with the magni-
tude valuation Vms(t) of the most recently presented
(alternative) stimulus and then multiplied (gated) by
fns(t) for the stimulus trace nand stimulus sand where
Vm0(t)6¼ Vms(t) (see the supplementary material for fur-
ther description of the memory component). We have
f00(x)= 0if x\0
xif x.=0
ð19Þ
Epes(t)=f0(d(t)+Vo(t)) ð20Þ
The extent of the magnitude difference being lower
than the alternative is relayed to the Epes node. It can be
seen as a value of ‘relative’, or ‘comparative’, omission
(lower magnitude value). Note, we use this Epes formu-
lation in the simulation of the experiments described in
the next section.
3 Methodology: experimental set-up
Urcuioli (2005) described a number of differential out-
comes procedures which provide compelling beha-
vioural evidence for the existence of an ATP. These
procedures entail the manipulation of the S (stimulus),
R (response) and O (outcome) contingencies according
to instrumental and Pavlovian stages of learning. The
rewarding outcome (O) in the instrumental condition-
ing stage(s) is contingent upon R following the presen-
tation of S. In the pavlovian stage, however, the
organism is a passive recipient of O following the pre-
sentation of S.
The study of Peterson & Trapold (1980) represents
one of the seminal investigations of the ATP in animal
(pigeon) learning. The experiment, a delayed two-choice
conditional discrimination task, was devised in order to
test an outcome-expectancy mediation theory. This the-
ory is equivalent to ATP theory where two separable
associative links are posited (S–E and E–R) affording a
degree of independence between S and R. Manipulation
of learning contingencies that identify the existence of
these separable links within the prospective learning
path was a major motivation of the research. For each
of four conditions, the steps of the trial set-up are as
follows.
Step 1. Instrumental (S–R) differential outcomes
learning.
Step 2. Pavlovian reversal of S–O learning.
Step 3. Instrumental (S–R) transfer.
For Steps 1 and 3, there are a number of sequential
stages in the experimental procedure. These are out-
lined in Figure 7.
In Step 2, Stage 2, the response possibilities are dis-
abled (side panels darkened). The pigeons were investi-
gated in an operant conditioning chamber with (i) three
translucent response keys illuminated by out-of-sight
projectors, (ii) a food hopper, (iii) a houselight. Steps 1
and 2 are schematized in Figure 8. In Step 1 the pigeon
is required to learn that stimulus 1 requires response 1
in order to receive the reinforcer (reward) that is out-
come 1. Response 2 following stimulus 1 yields no
reward. Stimulus 2 requires response 2 in order to yield
outcome (reward) 2. ATP theorizes that an expectancy
for outcome 1 develops following stimulus 1 (S1!E1)
and similarly an expectancy for outcome 2 develops fol-
lowing stimulus 2 (S2!E2). The two expectancies then
cue differential responses according to learning
(E1!R1 and E2!R2, for the respective stimuli). In
Step 2 the stimulus–outcome contingencies from Step 1
were either the same, or opposite in a Pavlovian
stimulus–outcome learning step, i.e. one in which
responses from Step 1 were no longer possible. Step 3
(Figure 9) illustrates the procedure for the key instru-
mental transfer phase. This phase is composed of four
groups organized across the two Pavlovian conditions
(same, opposite) of Step 2 and across two instrumental
response-outcome contingencies in relation to Step 1
(original, reversal). This phase, thereby, tests all possi-
ble relevant associative relations regarding S–E and
Figure 7. Visualization of the Peterson & Trapold (1980)
experiment 1 trial set-up for Steps 1 and 3. In a given trial, the
subject (pigeon) undergoes the following stages: Stage 0, the
pigeon initiates the trial by pecking on the centrally illuminated
panel (of three panels) in the operant conditioning chamber;
Stage 1, the pigeon is then (0 second delay) presented, in each
trial, with either a red (S1) or green (S2) light in the central
panel for 5 seconds; this represents the stimulus; Stage 2, after a
2 second delay (trace conditioning) the pigeon can peck either
left or right coloured panels (response); Stage 3, the pigeon
receives either a grain or light outcome contingent upon
stimulus-specific correct response.
8Adaptive Behavior
E–R as they may influence response control and pro-
vides an excellent framework for evaluating ATP
theory.
The experimental procedure followed by Peterson &
Trapold (1980) can be summarized, with respect to
variables of interest to our purely neural network repli-
cation study, in the table in Figure 10. In this table, we
list the simulations variables we used to replicate the
experimental set-up and similarly follow a trace condi-
tioning S–R set-up.
4Results
The results section breaks down as follows.
1. Main comparative results: We compare our model’s
performance in simulation with the empirical find-
ings of Peterson & Trapold (1980) in relation to cor-
rect action response to each presented stimulus over
blocks of trials in the four conditions in Step 3
(Figure 9(1)–(4) in blue font).
2. ATP learning analysis: We evaluate the main com-
parative results with respect to S–E, E–R, S–R con-
nections and their inter-relation during learning.
3. Affective classifier lesioning analysis: We lesion the
Affective Classification Network of the Prospective
Actor to evaluate the role and significance of this
component of the network.
4.1 Main comparative results
The ATP-theoretics (Figure 9, blue font) predict the
results obtained by Peterson & Trapold (1980) (Figure
11) including an interesting non-monotonic effect in
condition \same,original .which shows initial correct
performance in Step 3 followed by a decline in, and a
later recovery of, performance. We classify our simula-
tion studies as follows.
Simulation 1. Omission probability: ‘low’ = 0.0, ‘high’
= 0.8; magnitude = 1.0 for all reinforcing responses.
Simulation 2. Magnitude: ‘low’ = 0.2, ‘high’ = 1.0;
omission probability = 0.0 for all reinforcing
responses.
Figure 12 shows the performance of the network for
Simulation 1. Figure 13 shows performance of the net-
work for Simulation 2. Qualitatively similar trends are
observed with respect to Peterson & Trapold (1980)
(Figure 11). The non-monotonic behavioural learning
curve in condition \same,original .is also found in
both omission probability and magnitude simulation.
It is important to note that the original study
of Peterson & Trapold (1980) compared differential
outcomes of type whereas we have shown in our simu-
lations that a comparable profile of learning in a
transfer-of-control step can be found where differential
outcomes concern either magnitude, or omission prob-
ability. We contend that all differential stimuli out-
comes, including type, are associated with a valuation:
they yield a more or less desirable ‘correct’ result. We
also suggest that all differential stimuli outcomes
(including of type) have comparative differences in
valuations. A light presentation outcome might be val-
uated comparatively poorly with respect to a food pre-
sentation outcome (relative omission). The extent to
which such value-based outcomes are also associated
with the sensory properties of the outcomes we con-
sider in Section 5.
4.2 ATP analysis
Figures 14 and 15 provide detailed visualizations of the
learning of the ATP prospective route associations
underlying these results (S–E and E–R connections).
As theorized by Peterson & Trapold (1980) in the case
of \same,original .(see Figure 9, contingency (4)), in
the instrumental transfer step, the network makes the
correct, i.e. rewarding, responses but for the ‘wrong’
reasons. In Step 1 the network learned the
E1(E2)!.R1(R2) contingency, while in Step 2 it main-
tained the S1(S2)!E1(E2) contingency. In Step 3 the
network, thus, obtains the outcome (O2) by cueing R1
following an E1 expectancy. Thus, two wrongs make a
right. Since the outcomes are reversed, however, the
Figure 8. Steps 1 and 2 of the experimental set-up (adapted
from Peterson & Trapold (1980)). Here S1, S2 = stimuli, R1, R2
= responses, O1, O2 = outcomes, E1, E2 = outcome
expectancies, f= no outcome. Red font indicates rewarding
contingencies and associations.
Lowe and Billing 9
expectations gradually switch so that S1(S2)!E2(E1) is
learned, which now cues the non-rewarding response
but is gradually re-learned so that E1(E2)!R2(R1).
No such problem occurs for \opposite,reversal .
since the correct associations have already been made
in Steps 1 and 2. For the other two conditions at the
beginning of Step 3 the correct expectations
S1(S2)!E2(E1) (\opposite,original .), or correct
responses but incorrect expectations (\same,
reversal .) lead to completely incorrect responses on
early trials and through re-learning monotonic
improvement thereafter.
Figure 9. Step 3 of the experimental set-up (adapted from Peterson & Trapold (1980)). In F, the associations required to be learned
are shown. In G, the crossed arrows indicate those new associations which are required. The blue font prospective paths provide
the four conditions to be evaluated.
10 Adaptive Behavior
4.3 Affective classifier lesioning analysis
We carried out another two simulations to demon-
strate: (i) that the network is robust to non-differential
outcomes conditioning; (ii) that the correct learning
profile (relevant to Peterson & Trapold (1980)) requires
the Affective Classification Network. In the case of
Figure 10. Experimental set-up of Peterson & Trapold (1980) versus our simulation set-up interpretation.
Figure 11. Peterson & Trapold (1980) results over four
experimental conditions. Results obtained over blocks of
learning trials: Steps 1 (left) and 3 (right) shown. The results
(right) correspond to the four experimental contingencies
theorized in Figure 9, blue font. Reprinted with permission.
Figure 12. Simulation 1, Step 3: differential omission
probability outcomes. Mean correct responses (20 independent
runs): omission probability = 0.8, for one reinforced S–R
association, 1.0 for the alternative reinforced S–R association, 0
otherwise; magnitude = 1.0.
Lowe and Billing 11
(i) the network still (eventually) learns to produce cor-
rect performances in the different conditions (asympto-
tically converges to approximately 100% correct
choices) albeit with different learning profiles to those
exhibited in the differential outcomes conditions. In the
case of (ii), we lesioned the ‘XOR’ node of the classifi-
cation network that had the role of preventing both
pessimistic and optimistic affective classifications of a
given stimulus influencing response choice. We sum-
marize thus as follows.
Simulation 3. Magnitude = 1.0, omission probability
= 0.0 for all reinforcing responses.
Simulation 4. Omission probability: ‘intermediate’ =
0.5, ‘high’ = 1.0; magnitude = 1.0 for all reinforcing
responses but where the ‘XOR’ node is lesioned.
Note, we originally tested the network on the same,
lower omission probability rate (0.5) as given for
Simulation 3, as a test of robustness, which showed
qualitatively similar results to those in Figures 12
and 13; see the supplementary material.
Figure 16, Simulation 3, shows the effects of a non-
differential outcomes scenario on learning and decision
making performance in Step 3 and produces different
profiles of learning for the different conditions but also
shows that the network eventually produces correct
performance in all conditions. One important differ-
ence here compared to the differential outcomes trained
network (Figures 12 and 13) is the performance of the
network in the \(opposite,reversal).condition. The
performance of the non-differential outcomes trained
network in this condition is initially very poor in the
instrumental transfer step. This is because unlike for
the differential outcomes trained network, outcome
expectations cannot be used for forming S–E and E–R
associations in Steps 1 and 2 from which to construct a
functioning prospective route in Step 3 given the
absence of experience of the correct S–R contingencies.
Thus, the network has to rely on the retrospective route
and learns the S–R contingencies from scratch.
In Figure 17 we see the results of Simulation 4 where
the XOR node is lesioned. The generated profile of
learning in Step 3 is closely related to that found in
Simulation 3, i.e. a non-differential outcomes learning
Figure 13. Simulation 2, Step 3: differential magnitude
outcomes. Mean correct responses (20 independent runs):
magnitude = 0.2, for one reinforced S–R association, 1.0 for the
alternative reinforced S–R association, 0 otherwise; omission
probability = 0.0.
Figure 14. \(opposite,reversal).associative learning. A single
run over trial blocks: omission probability = 0.8 or 0.0 for
differential outcomes; magnitude = 1.0 for all reinforcing
outcomes. S-Em (magnitude value) and S-Eo (omission value)
plots: S1 valuations = black plots, S2 valuations = grey plots; Em-
R and Eo-R plots: E-R1 = grey, E-R2 = black; correct percentage
plot: mean correct = grey plots, mean response 1 = black
dashed line, mean response 2 = red dashed line.
Figure 15. \(same,original).associative learning. A single run
over blocks: omission probability = 0.8 or 0.0 for differential
outcomes; magnitude = 1.0 for all reinforcing outcomes. S-Em
(magnitude value) and S-Eo (omission value) plots: : S1
valuations = grey plots, S2 valuations = black plots ; Em-R and
Eo-R plots: E-R1 = grey, E-R2 = black; correct percentage plot:
mean correct = grey plots, mean response 1 = black dashed line,
mean response 2 = red dashed line.
Figure 16. Simulation 3, Step 3: non-differential outcomes.
Means of 20 independent runs: magnitude = 1.0, for one
reinforced S–R association, 1.0 for the alternative reinforced S–
R association, 0 otherwise; omission probability = 0.0.
12 Adaptive Behavior
profile. However, here we find observably slower learn-
ing: many more blocks of trials required for perfor-
mance to converge to ’100% correct response
performance. The profile of learning can be explained
by the fact that the outputs from Epes and Eopt to Cpes
and Copt, respectively, are no longer XOR-filtered when
the S associated with 0.5 reinforcement omission prob-
ability is presented. This means that the weights
between Cpes/Copt and both er1and er2are updated
with successive S1 and S2 presentations. Now, any time
either S1 (correct R!reward omission probability 0.0)
or S2 (correct R!reward omission probability 0.5) is
presented both er1and er2will be activated (Eopt is only
partially inhibited by Epes when omission probability =
0.5). The second XOR mechanism, mutual inhibition
of erjoutputs from the Prospective Response Classifier
Network, thereafter ensures that the multiple inputs to
Cri produce values tending towards zero. This means
that, as for Simulation 3, learning is dependent upon
the Retrospective Actor, i.e. learning of the S–R con-
tingencies, as failure to differentially Affectively
Classify (Cpes=Copt) the stimuli means that no informa-
tion from the Prospective route can be used for learn-
ing purposes. Finally, the slow learning rate can be
explained by the interference to correct performance
that learning the associated weights (that would be pre-
vented by the Affective Classifier network XOR
mechanism) causes. Since the second XOR mechanism
entails erj!Cri. inhibition proportionate to the inputs
S1 and S2, during learning, slightly different degrees of
inhibition at the output nodes (Cri) of the Prospective
Actor will be generated. Finally, a number of robust-
ness tests of the network were carried out where differ-
ent values of reinforcer magnitude and omission
probability were used in order to see if the network
would, over trials, produce approximately 100% cor-
rect choices. Space precludes detailed analysis here: we
found the network to be robust and results are pro-
vided in the supplementary material.
4.4 Results interpretation
In summary, this section has highlighted the functional-
ity of the Prospective Actor with respect to E–R media-
tion of response choice. The results here have provided
two purposes: (i) to demonstrate the validity of the Aff-
ATP neural network model, which is demonstrated by
the results shown for Simulation 1 and Simulation 2
where the profiles of learning were observably repli-
cated; (ii) to illuminate the existence and role of an
Affective Classifier Network. In relation to (i), whilst
the original (Peterson & Trapold, 1980) experiment
concerned differential outcomes of grain, and a tone,
presented following ‘correct’ choices by the pigeon, in
our simulations we sought to demonstrate that a core
reward-based network could account for the profile of
learning theorized and observed by Peterson & Trapold
(1980). In relation to (ii), lesioning of the Affective
Classifier Network disrupted the learning profiles
observed in the intact model. The learning profiles,
instead, resembled those of the network when presented
with non-differential outcomes based training. This, in
effect, transformed the model into a standard Actor–
Critic neural network architecture.
3
We found qualitatively different results regarding
the learning profiles of Simulation 1/Simulation 2 ver-
sus those of Simulation 3/Simulation 4 where the for-
mer simulations show DOT in an intact model, and the
latter simulations show either non-DOT or a lesioned
network. On the basis of this, it is reasonable to ques-
tion whether DOT and/or an intact network with
Affective Classifier Network is so adaptive: afterall,
some S–R–O contingencies allowed for more efficient
performance in the intact model with DOT and other
contingencies induced inferior performance. We now
discuss situations in which for some contingencies the
ATP mechanism might be of adaptive advantage.
From Figure 9, it is the case of contingency (4),
Opposite/Reversal, that allows for adaptive switching
in the intact model with DOT and not for the cases of
either no-DOT or Affect Classifier Network lesioning
(both cases, we have argued providing a sort of stan-
dard Actor–Critic network computation). The network,
in this contingency, by Step 3 has: (a) already learned
Opposite S–E connections in the Pavlovian step
(S1!E2, S2!E1) to those learned in Step 1 (the instru-
mental step); (b) not learned S–R connections reversed
with respect to the Step 1 instrumental step. However,
the network, through its S–E and E–R previously
learned connections substitutes for the lack of S–R
knowledge through this prospective route. The
Affective Classifier lesioned network, or the intact net-
work subject to non-DOT, cannot draw upon the addi-
tional information provided by the prospective route
that uses differential expectations to guide response
choice. Instead, it has to learn from scratch the S–R
connections; see again Figures 16 and 17. In a more
Figure 17. Simulation 4, Step 3: XOR1 node of the Affective
Classifier network lesioned. Means of 20 independent runs:
omission probability = 0.5, for one reinforced S–R association,
1.0 for the alternative reinforced S-R association, 0 otherwise;
magnitude = 1.0.
Lowe and Billing 13
natural setting, we might liken the changed contingen-
cies to an animal observing outcomes in self or other in
relation to a particular stimulus/event context where
behavioural possibilities are constrained. An ability to
adaptively switch behaviour in line with the observably
changed S–O contingencies by generalizing one’s
responses to the outcomes they have previously given
rise to, could be of adaptive advantage in the absence
of other S–R knowledge (something like task rules).
The Affective Classifier Network, we suggest, allows
for this and its utility is described at length in Lowe,
Alme
´r, Lindblad, et al. (2016).
On the other hand, in the case of contingency (1),
Same/Original, the S–R contingencies (task rules) that
lead to a rewarding outcome are the same in Step 3 as
they are in Step 1. However, the outcomes are not the
same. So, in the non-DOT (and lesioned) conditions,
the network without new learning produces the correct
responses but, as in the case of the Affective Classifier
lesioned network, without the information regarding
the type of outcome that exists. For DOT trained intact
networks (see Figures 12 and 13) the network also, ini-
tially, chooses the correct responses, but again for the
‘wrong’ reasons. However, having knowledge of the dif-
ferential outcomes, the network starts to relearn the S–
E contingencies (see Figure 15). Since switching the S–
E contingencies in this case leads to incorrect responses
via the E–R links, the network has to also relearn the
E–R contingencies, and this accounts for the transiently
poor performance of the network. Nevertheless, it
might be argued that it is critical for the network to
retain the knowledge of the differentiated outcomes
and bring to bear that knowledge on response choice. If
outcomes are differentiated by type, for example, as
they are in the Peterson & Trapold (1980) experimental
set-up, the animal/human might tend to act in a way
that yields a generic reinforcer rather than a preferred
reinforcer. If an organism needs to replenish its energy
stores, it is better that it behaves in a way that promotes
obtaining an energy-rich food source rather than so as
to obtain water, for example.
In relation to contingency (2), Opposite/Original,
the performance in Simulation 1/Simulation 2 versus
Simulation 3/Simulation 4 is orthogonal. As for contin-
gency (1), the S–R contingencies (task rules) that lead
to a rewarding outcome in Step 3 are the same as in
Step 1. Thus, Simulations 3 and 4, produce the correct
results without new learning in Step 3 for the same rea-
son as described for contingency (1). In Simulations 1
and 2, the changed Pavlovian (S–E) associations
S1!E2 and S2!E1 lead to a prospective cueing (E–R)
of now incorrect responses. Hence, response choices
are the opposite of what they should be at the begin-
ning of Step 3 and require re-learning. Whether such a
combination of changed contingencies is such a natu-
rally occurring phenomenon is debatable, however. A
change in observed S–O contingency might, for
example, imply an inadequately perceived/apprehended
relationship between stimulus context (involving many
stimuli) and outcome. Learning the correct context
could then allow for obtaining outcomes previously
acquired with a particular response. We posit that it
might be less naturally occurring for responses to be
required to be changed in order to achieve the previ-
ously reliably occurring outcome. Such dramatic shifts
in contingencies, requiring new S–E and E–R learning,
might more typically be resolved by cognitive mechan-
isms that apprehend a particular contextual basis for
the changed contingencies and thereby suppress any
generalizations they make regarding E–R associations.
Finally, in all simulations, contingency (3), Same/
Reversal, networks had to re-learn from scratch connec-
tions enabling successful response choice. The Affective
Classifier Network was unable to imbue adaptive switch-
ing in this case but did not detract from performance.
It is worth emphasizing that all networks, including
those subjected to a robustness analysis of different val-
ues (see the supplementary material), were able to even-
tually learn the correct S–R contingencies (‘task rules’)
given enough trials.
5 Discussion
In this article, we have introduced a neo-behaviourist
neural network model of ATP theory: the Aff-ATP
neural network model. The modelling approach imple-
ments a Markov decision process (MDP) using tem-
poral difference TD(0) learning within a framework
that can be likened to an Actor–Critic approach but
that has some similarities to SARSA.
We carried out a number of tests of our model using
a simulation of the experimental set-up of Peterson &
Trapold (1980). This set-up covers a number of associa-
tive contingencies related to differential outcomes.
Simulation of this set-up allowed us to produce a sys-
tematic evaluation of our model to compare with the
empirical findings. We summarize our findings as
follows.
1. The model replicates the data of Peterson & Trapold
(1980) (experiment 1) validating it as an ATP model
and demonstrating that reward/affect-based compu-
tation may ground differential outcomes trained
behaviour even when only one outcome is of a pri-
mary reward-based nature.
2. A mechanism for interfacing (prospective versus ret-
rospective) response control was described: (a) use of
an Affective Classifier Network; (b) use of a
Prospective Response Classifier Network. The effec-
tive XOR classified outputs of both (a) and (b) are
then used for the Prospective Actor to suppress
(overshadow), or substitute for, Retrospective Actor
output for response control mediation. This model
14 Adaptive Behavior
was then evaluated using a lesioning approach and
also a robustness analysis.
Whilst the empirical study used for model validation
here has concerned pigeons as subjects, DOT has been
implicated in accelerated learning in both non-human
animals (pigeons, rats) and in human infants (cf. Maki
et al., 1995; Este
´vez & Fuentes, 2001; Este
´vez et al.,
2003) and adults (cf. Martı
´nez et al., 2012). Teaching
procedures that associate correct responses with differ-
entiated outcomes may provide a mnemonic advantage.
Understanding how exactly retrospective and prospec-
tive informational aspects interact, e.g. where different/
new associations are required to be made, might draw
on analysis of the profiles of learning performance gen-
erated by different differential outcomes procedures,
e.g. that exploit more or less differentiation of outcomes
along a single dimension of reinforcement.
The differential-outcomes-based literature is replete
with references to the retrospective (stimulus–response)
and prospective (expectancy–response) nature of ani-
mal and human learning performance. However, it
might be questioned as to what actually is the ‘prospec-
tive’ nature of outcome expectancy cueing of responses.
It has been pointed out (Wasserman & Zentall, 2006)
that cueing does not require anticipation. If the expec-
tancy outcome comes to cue responding as the ATP
account posits, is there a need for a ‘prospective’
account, and what should such an account consist of?
One possibility, as modelled but not explicitly tested in
the Aff-ATP model described in this article is that
where differential outcomes learning is based on RL
anatomy in the brain, e.g. via the interplay between
amygdala and orbitofrontal cortex (cf. Rolls, 1999;
Balkenius & More
´n, 2001), temporally discounted
valuations may be made. Such temporal difference
(TD) learning (Sutton & Barto, 1998) has been more
strongly linked to the basal ganglia (cf. Suri & Schultz,
1998; Schultz, 2007) though extensive work has demon-
strated similar profiles of RL activity in orbitofrontal
cortex and amygdala (cf. Schoenbaum, Saddoris &
Stalnaker, 2007): the prefrontal cortex has also been
implicated in learning both reward and reward omis-
sion probabilities (Watanabe, Hikosaka, Sakagami &
Shirakawa, 2007; Kennerly & Walton, 2011); the basal
lateral amygdala has been implicated in learning differ-
ential outcomes (cf. Savage & Ramos, 2009). TD learn-
ing algorithms such as SARSA utilize a value function
gradient to assign appropriate credit to non-rewarding
states that are antecedent (in space and time) to an
explicitly reinforcing (‘goal’) state. In implementing our
ATP model in a TD learning framework, we provide a
third dimension for differential (reinforcing) outcomes:
discounted delay valuations occurring across the time
interval between stimulus (or contextual cue) presenta-
tion and response leading to the outcome. Where two
outcomes can be contrasted, discounted delay
valuations thereby provide prospective information
regarding which of two outcomes is likely to manifest
first. Such information is implicitly encoded in stan-
dard RL algorithms: a scalar value provides informa-
tion about delay, reinforcer magnitude, and
presentation probability (cf. Doya, 2008). By formulat-
ing the RL algorithm (e.g. the Critic of an Actor–
Critic) in such a way as to exploit these individual
dimensions, more informed online decision making and
learning will result.
In machine learning terms, posing the RL approach
in terms of an MDP utilizing TD information has
applications for robotics. Robots navigating through a
knowable state space, optimally choose actions
(responses) based on present state information. In
environments where multiple reinforcing states exist,
robots that base online decisions on information con-
cerning multiple state space representations of dimen-
sions, e.g. magnitude, omission/presentation
probability, delay, of the reinforcing outcomes, will be
at an adaptive advantage. This is made more clear with
an example: a scalar state valuation of 0.5 might signal:
magnitude = 1.0, presentation probability = 0.5; it
might also signal magnitude = 0.5, presentation prob-
ability = 1.0. A Critic that uses two state valuations
discriminates this information. This allows such agents
to: (i) search for stimuli associated with higher value
for the given task/goal; (ii) in the absence of finding a
preferred stimulus, produce responses that maximize
reward, i.e. through overcoming obstacles that may be
responsible for the lower reward yield.
Regarding the Affective Classification approach uti-
lized in our Aff-ATP model, it should again be stressed
that Urcuioli (2005, 2013) has noted the stimulus classi-
fication by outcomes that occurs as a result of DOT. In
relation specifically to affective (rewarding, or punish-
ing) outcome contingencies, the somatic marker theory
of Damasio (1999) alludes to a possible affective-classi-
fication-of-stimuli perspective. Somatic markers are
affective states that can simplify action selection under
conditions of uncertainty (e.g. when task rules are not
well understood). In the stimulus classification by
expected outcomes perspective, the affective states
(reward and omission) generated where omission prob-
ability is greater than zero, classify particular stimuli by
differential affective states that are then able to differ-
entially cue choice responses. This particular property
of these differential affective states, similar to somatic
markers, is of most relevance when multiple response
options can be associated with each affective state (cf.
Urcuioli, 2013).
A potential objection to the generality of our neo-
behaviourist modelling approach concerns scaling the
model to more complex scenarios. Scaling the network
to account for more than two response options, for
example, simply entails adding more nodes in the er
and Cr‘layers’ of the Prospective Actor, as well as nodes
Lowe and Billing 15
for the sr layer of the Retrospective Actor. This has
implications for our XOR labelling of the Prospective
Response Classifier Network. A winner-take-all
mechanism should apply here, whereas more-than-two-
input problems are typically considered XOR-classified
according to the odd-parity checker interpretation: three
activated er units produce a Cr (and action) output,
whereas we only ever want a single activated er unit to
activate the Crlayer. An alternative interpretation on
more-than-two XOR input problems is the one-hot
checker, consistent with the winner-take-all perspective.
The Affective Classifier Network, on the other hand,
should only have two nodes. This is consistent with the
Rolls (1999) and Mowrer (1947) perspectives on emo-
tions as being elicited in relation to expectation/realiza-
tion of reward acquisition (‘optimistic expectation’) and
reward frustration/omission (‘pessimistic expectation’).
On the other hand, punishment (omission, acquisition)
provides another critical reinforcer-based dimension of
emotion elicitation (Rolls, 1999). However, it has been
suggested that punishment valuation is implemented
separately in the brain and its computation could thus
use a separate Critic (Daw, Kakade & Dayan, 2002;
Boureau & Dayan, 2010; Lowe & Ziemke, 2013). The
dependency of the two-node output of the Affective
Classifier Network on an explicit XOR mechanism (X
node) was made apparent by the lesioning simulation
(Simulation 4), which showed that the profile of differ-
ential outcomes results in the transfer-of-control step
could not be thus replicated.
The differential outcomes used in the Peterson &
Trapold (1980) experiment concerned food (primary
reward) and a light source, i.e. differential by type.Our
model, on the other hand, varied outcomes by either
omission probability or magnitude of primary reward. It
may be the case that outcomes are/can be differentiated
according to the degree of difference in their sensory
properties. However, we contend that at the root of all
outcomes learning is an affective valuation of the differ-
ent outcome stimuli given by the ‘External Reinforcer’
signal. We have explicitly captured relative magnitude
differences with the model, and omission probability dif-
ferences. Value-based circuitry associated with affective
states such as the amygdala, orbitofrontal cortex and lat-
eral prefrontal cortex have been proposed in both differ-
ential outcomes learning performance (Watanabe et al.,
2007; Savage & Ramos, 2009) and in relation to emotion
appraisal and elicitation (Rolls, 1999, 2013; Schoenbaum
et al., 2007). On the other hand, the ‘External Reinforcer’
input to our Critic may, more generally, be viewed as a
target (outcome) signal, that need not be of primary
rewarding value and that varies according to magnitude
or salience indicating the ‘surprisingness’ of the outcome
(Redgrave, Prescott & Gurney, 1999). The sensory prop-
erties of such differential outcomes may also be repre-
sented/anticipated, in non-value-based circuitry. Such
representations could amplify or not the external target
signal input to the value network (Critic), but be attended
to only in relation to their value–relevance (Grossberg &
Seidman, 2006).
Regarding learning mechanisms of the model, we
used a magnitude function that uses a low learning rate
for negative prediction error (for unlearning the magni-
tude reward dimension), relative to the positive predic-
tion error signal (for learning the magnitude reward
dimension). In fact, for the omission probability differ-
ential outcomes simulations magnitude unlearning was
not permissible. This zero-unlearning-of-magnitude
mechanism was also used by Balkenius & More
´n (2001)
and Balkenius et al. (2009) upon which our Critic com-
ponent was based. The rationale behind this design
choice concerns the difficulty in the amygdala (‘magni-
tude Critic’) permitting reversal learning in the absence,
e.g. through lesioning, of the orbitofrontal cortex (‘omis-
sion Critic’). Learning of the context in which reward
omission takes place is undertaken by the orbitofrontal
cortex, which is then said to inhibit the influence on
behaviour of the amygdala. The knowledge of the
reward value of the stimulus, per se, is not extinguished,
only the context-dependent behavioural response to it
(Rolls, 1992). Balkenius & More
´n (2001) note: ‘The
rationale behind this is that once learned, a reaction [.]
is so expensive to retest that it pays to assume this nega-
tive association is valid everywhere unless definite proof
otherwise has been established’ (p. 87). The model of
Balkenius & More
´n (2001), in not permitting unlearning
of the reward magnitude valuation of a given stimulus,
was able to account for the ‘savings effect’, i.e. faster
reacquisition of a rewarded behavioural response than
initial acquisition, owing to the maintenance in memory
of the context-independent reward magnitude value
(amygdala) and the fast learning/unlearning rate of the
omission (orbitofrontal cortex) component. Our model
eases the constraint on unlearning magnitude value by
allowing for slow unlearning and thereby is able to still
account for the savings effect (Lowe, Sandamirskaya &
Billing, 2014) as well as the data simulated and pre-
sented in the present article.
Another aspect of learning concerns the use of direct
reinforcement signals with decaying values (with no
negative prediction error signals unlearning is not oth-
erwise possible), rather than prediction errors to update
the S–R and E–R connections. The traditional method
for updating Actor-based connections is through pre-
diction error thus linking the workings of the Critic to
the learning in the Actor. Functionally, it may also be
desirable to utilize prediction errors of the Critic. Suri
& Schultz (1998), for example, found that providing a
reinforcement signal, rather than a prediction error sig-
nal, to an Actor–Critic network lessened the ability of
the network to learn sequences of stimulus–action
pairs. We have followed a prediction error updating-of-
Actor approach in recent work (Lowe, Alme
´r, Billing,
Sandamirskaya & Balkenius, 2016). However, whilst
16 Adaptive Behavior
our latest model is simpler in terms of nodes, concep-
tually, we feel the present network highlights the func-
tionality required of an ATP network for DOT to be
exploited. That is, it should be able to: (i) classify sti-
muli by affective outcome, (ii) select one action only
based on the outputs of the affective classification.
Effectively, there should be XOR-based outputs from
the Prospective Actor so that: (i) a stimulus is only clas-
sified by affective outcome if it imbues a pessimistic or
an optimistic expectation of the outcome but not both
(or neither); (ii) an affective classification should only
mediate prospective response control (in place of retro-
spective mediation) if one and not more (or fewer)
responses are influenced. In our more recent model
(Lowe, Alme
´r, Billing, et al., 2016) (see also Lowe et
al., 2014), we exploit the competition of continuous
time neural dynamics to implicitly implement this
XOR-gated effect. Neural field modelling (Lowe et al.,
2014) allows for scaling of the model for multiple
actions over a continuous stimulus–action space, which
can be likened to a function approximation approach
to continuous state–action space modelling in RL (cf.
Fre
´maux, Sprekeler & Gerstner, 2013). Harnessing
neural-dynamics allows for more plausible mechanisms
particularly with respect to ‘maintaining in memory sti-
muli information’,
4
i.e. using neural-stability dynamics.
This overall modelling approach has the potential to
provide: (i) new and empirically testable hypotheses
concerning the learning profiles of non-DOT and also
with respect to the varied values of stimuli (see the sup-
plementary material); (ii) a robust sensorimotor inter-
face for online real-time systems (e.g. robots) carrying
out navigation or sequential tasks. In future work, we
hope to develop both aspects to facilitate the model’s
integration into online real-time learning systems.
Funding
The author(s) disclosed receipt of the following financial sup-
port for the research, authorship, and/or publication of this
article: This work was funded through the 7th framework of
the EU (grant number 270247; NeuralDynamics project).
Notes
1. Our Retrospective Actor might be compared with an
actor-only method, however, since it is not updated by
the Critic (prediction error) but rather directly by the
reinforcer.
2. This is only approximately true as referred to previously.
3. Albeit, the learned connection between stimuli and
responses were not updated by prediction errors but by
the reinforcer signals; see Section 5.
4. In the neural network model presented in this article, this
‘retrospective’ information is artificially ‘held’ in memory
until the response is timely selected.
References
Atmaca, S., Sebanz, N., & Knoblich, G. (2011). The joint
Flanker effect: Sharing tasks with real and imagined co-
actors. Experimental Brain Research,211, 371–385.
Balkenius, C., & More
´n, J. (2001). Emotional learning: a com-
putational model of the amygdala. Cybernetics and Sys-
tems: An International Journal,32, 611–636.
Balkenius, C., More
´n, J., & Winberg, S. (2009). Interactions
between motivation, emotion and attention: from biology
to robotics. In Proceedings of the Ninth International Con-
ference on Epigenetic Robotics: Modeling Cognitive Devel-
opment in Robotic Systems (p. 146). Lund: Lund
University, Cognitive Studies.
Billing, E., & Balkenius, C. (2014). Modeling the Interplay
between conditioning and attention in a humanoid robot:
habituation and attentional blocking. In The Fourth Joint
IEEE Conference on Development and Learning and on Epi-
genetic Robotics, submitted.
Boureau, Y.-L., & Dayan, P. (2010). Opponency revisited:
competition and cooperation between dopamine and sero-
tonin. Neuropsychopharmacology Review,1, 1–24.
Damasio, A. R. (1999). The Feeling of What Happens: Body
and Emotion in the Making of Consciousness. Boston, MA:
Houghton Mifflin Harcourt.
Daw, N. D., Kakade, S., & Dayan, P. (2002). Opponent inter-
actions between serotonin and dopamine. Neural Net-
works,15, 603–616.
de Wit, S., & Dickinson, A. (2009). Associative theories of
goal-directed behaviour: A case for animal–human trans-
lational models. Psychological Research PRPF,73(4),
463–476.
Doya, K. (2008). Modulators of decision making. Nature
Neuroscience,11(4), 410–416.
Esteban, L., Plaza, V., Lo
´pez-Crespo, G., Vivas, A. B., &
Este
´vez, A. F. (2014). Differential outcomes training
improves face recognition memory in children and in
adults with Down syndrome. Research in Developmental
Disabilities,35(6), 1384–1392.
Este
´vez, A. F. (2005). The differential outcomes effect: A use-
ful tool to improve discriminative learning in humans. The
Behavior Analyst Today,6(4), 216.
Este
´vez, A. F., & Fuentes, L. J. (2001). The differential out-
come effect as a useful tool to improve conditional discrim-
ination learning in children. Learning and Motivation,32,
48–64.
Este
´vez, A. F., Overmier, B., & Fuentes, L. J. (2003). Differ-
ential outcomes effect in children: Demonstration and
mechanisms. Learning and Motivation,34, 148–167.
Fre
´maux, N., Sprekeler, H., & Gerstner, W. (2013). Reinfor-
cement learning using a continuous time actor–critic
framework with spiking neurons. PLoS Computational
Biology,9(4), 1–21.
Friedrich, A. M., & Zentall, T. R. (2011). A differential-
outcome effect in pigeons using spatial hedonically nondif-
ferential outcomes. Learning and Behavior,39, 68–78.
Grossberg, S., & Seidman, D. (2006). Neural dynamics of
autistic behaviors: Cognitive, emotional, and timing sub-
strates. Psychological Review,113, 483–525.
Kennerly, S. W., & Walton, M. E. (2011). Decision making
and reward in frontal cortex: complementary evidence
Lowe and Billing 17
from neurophysiological and neuropsychological studies.
Behavioral Neuroscience,125(3), 297–317.
Kruse, J. M., & Overmier, J. B. (1982). Anticipation of reward
omission as a cue for choice behavior. Learning and Moti-
vation,13, 505–525.
Lowe, R., & Ziemke, T. (2013). Exploring the relationship of
reward and punishment in reinforcement learning. In 2013
IEEE Symposium on Adaptive Dynamic Programming and
Reinforcement Learning (ADPRL) (pp. 140–147).
Lowe, R., Alme
´r, A., Billing, E., Sandamirskaya, Y., & Balk-
enius, C. (2016b). Affective-Associative Two-Process the-
ory: A neurocomputational account of partial
reinforcement extinction effects, (submitted).
Lowe, R., Alme
´r, A., Lindblad, G., Gander, P., Michael, J.,
& Vesper, C. (2016). Minimalist social-affective value for
use in joint action: A neural-computational hypothesis.
Frontiers in Computational Neuroscience,10. Available at:
https://doi.org/10.3389/fncom.2016.00088
Lowe, R., Sandamirskaya, Y., & Billing, E. (2014). A neural
dynamic model of associative two-process theory: The dif-
ferential outcomes effect and infant development. In 4th
International Conference on Development and Learning and
on Epigenetic Robotics (pp. 440–447).
Maki, P., Overmier, J. B., Delos, S., & Gutmann, A. J. (1995).
Expectancies as factors influencing conditional discrimina-
tion performance of children. The Psychological Record,
45, 45–71.
Martı
´nez, L., Marı
´-Beffa, P., Rolda
´n-Tapia, D., Ramos-
Lizana, J., Fuentes, L.J. and Este
´vez, A.F., (2012). Train-
ing with differential outcomes enhances discriminative
learning and visuospatial recognition memory in children
born prematurely. Research in Developmental Disabilities,
33, 76–84.
Miceli, M., & Castelfranchi, C. (2014). Expectancy and emo-
tion. Oxford: Oxford University Press.
More
´n, J. (2002). Emotion and Learning: A Computational
Model of the Amygdala, PhD thesis, Lund University.
More
´n, J., & Balkenius, C. (2000). A computational model of
emotional learning in the amygdala. In J.-A. Mayer,
A. Berthoz, D. Floreano, H. L. Roitblat, & S. W. Wilson
(eds.), From Animals to Animats 6 (pp. 383–391). Cam-
bridge, MA: MIT Press.
Mowrer, O. H. (1947). On the dual nature of learning: A rein-
terpretation of ‘‘conditioning’’ and ‘‘problem-solving’’.
Harvard Educational Review,17, 102–148.
Overmier, J. B., & Lawry, J.A. (1979). Pavlovian conditioning
and the mediation of behavior. The Psychology of Learning
and Motivation,13, 1–55.
Peterson, G. B., & Trapold, M. A. (1980). Effects of altering
outcome expectancies on pigeons’ delayed conditional dis-
crimination performance. Learning and Motivation,11,
267–288.
Redgrave, P., Prescott, T. J., & Gurney, K. (1999). The basal
ganglia: a vertebrate solution to the selection problem?
Neuroscience,89, 1009–1023.
Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlo-
vian conditioning: Variations in the effectiveness of rein-
forcement and non- reinforcement. In A. H. Black, & W.
F. Prokasy (eds), Classical Conditioning II: Current
Research and Theory. New York: Appleton-Century-
Crofts.
Rolls, E. (1992). Neurophysiology and functions of the pri-
mate amygdala. In J. Aggleton (ed.), The Amygdala: Neu-
robiological Aspects of Emotion, Memory and Mental
Dysfunction (pp. 143–165). New York: Wiley-Liss.
Rolls, E. T. (1999). The Brain and Emotion. Oxford: Oxford
University Press.
Rolls, E. T. (2013). What are emotional states, and why do we
have them? Emotion Review,5(3), 241–247.
Savage, L. M., & Ramos, R. L. (2009). Reward expectation
alters learning and memory: The impact of amygdala on
appetitive-driven behaviors. Behavioural Brain Research,
198, 1–12.
Schoenbaum, G., Saddoris, M. P., & Stalnaker, T.A. (2007).
Reconciling the roles of orbitofrontal cortex in reversal
learning and the encoding of outcome expectancies. Annals
of the New York Academy of Science,1121, 320–335.
Schultz, W. (2007). Multiple dopamine functions at different
time courses. Annual Review of Neuroscience,30, 259–288.
Sebanz, N., Knoblich, G., & Prinz, W. (2005). How two share
a task: Corepresenting stimulus–response mappings. Jour-
nal of Experimental Psychology: Human Perception and
Performance,31, 1234–1246.
Staddon, J. (2014). The New Behaviorism. Hove: Psychology
Press.
Suri, R. E. (2002). TD models of reward predictive responses
in dopamine neurons. Neural Networks,15, 523–533.
Suri, R., & Schultz, W. (1998). Learning of sequential move-
ments by neural network model with dopamine-like rein-
forcement signal. Experimental Brain Research,121(3),
350–354.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning:
An Introduction. Cambridge, MA: The MIT Press.
Trapold, M. A. (1970). Are expectancies based upon different
positive reinforcing events discriminably different?Learn-
ing and Motivation,1, 129–140.
Trapold, M. A., & Overmier, J. B. (1972). The second learning
process in instrumental learning. In Classical Conditioning
II: Current Research and Theory (pp. 427–452). New York:
Appleton-Century-Crofts
Urcuioli, P. (1990). Some relationships between outcome
expectancies and sample stimuli in pigeons’ delayed match-
ing. Animal Learning and Behavior,18(3), 302–314.
Urcuioli, P. (2005). Behavioral and associative effects of dif-
ferential outcomes in discriminating learning. Learning and
Behavior, 33(1), 1–21.
Urcuioli, P. (2013). Stimulus control and stimulus class for-
mation. In Madden, G. J., Dube, W. V., Hackenberg, T.
D., Hanley, G. P., & Lattal, K. A. (eds), APA Handbook
of Behavior Analysis (Vol. 1, pp. 361–386). Washington,
DC: American Psychological Association.
Wasserman, E. A., & Zentall, T. R. (eds) (2006). Comparative
Cognition: Experimental Explorations of Animal Intelli-
gence. Oxford: Oxford University Press.
Watanabe, M., Hikosaka, K., Sakagami, M., & Shirakawa,
S. (2007). Reward expectancy-related prefrontal neuronal
activities: Are they neural substrates of ‘‘affective’’ work-
ing memory? Cortex,43, 53–64.
18 Adaptive Behavior
Author biographies
Robert Lowe is a Cognitive scientist (docent) whose research focus is on Affective and Emotion sci-
ence and computational modelling. He received his MSc in Computer Science and PhD at the
University of Hertfordshire and his BSc degree in Psychology from the University of Reading. Since
then he has been a Senior Lecturer in Cognitive Science at the School of Informatics, University of
Sko
¨vde, Sweden, and now works at the Department of Applied IT, University of Gothenburg. He is
on the editorial boards of Frontiers in Psychology, International Journal of Advanced Robotics
Systems, Adaptive Behavior and Constructivist Foundations.
Erik Billing is a Cognitive scientist whose research interests include Robotics and Human-Robot
Interaction. He completed his PhD at Umea
˚University, Sweden, and has since been an Associate
Senior Lecturer in Informatics at the School of Informatics, University of Sko
¨vde. He is the chair for
the Swedish natonal Cognitive Science Society, SweCog.
Lowe and Billing 19