Optimal decision network with distributed representation.
ABSTRACT On the basis of detailed analysis of reaction times and neurophysiological data from tasks involving choice, it has been proposed that the brain implements an optimal statistical test during simple perceptual decisions. It has been shown recently how this optimal test can be implemented in biologically plausible models of decision networks, but this analysis was restricted to very simplified localist models which include abstract units describing activity of whole cell assemblies rather than individual neurons. This paper derives the optimal parameters in a model of a decision network including individual neurons, in which the alternatives are represented by distributed patterns of neuronal activity. It is also shown how the optimal weights in the decision network can be learnt via iterative rules using information accessible for individual synapses. Simulations demonstrate that the network with the optimal synaptic weights achieves better performance and matches fundamental behavioural regularities observed in choice tasks (Hick's law and the relationship between the error rate and the time for decision) better than a network with synaptic weights set according to a standard Hebb rule.
-
Citations (0)
-
Cited In (0)
Page 1
ARTICLE IN PRESS
Neural Networks ()–
www.elsevier.com/locate/neunet
Optimal decision network with distributed representation
Rafal Bogacz∗
Department of Computer Science, University of Bristol, Bristol BS8 1UB, UK
Received 22 October 2005; received in revised form 31 January 2007; accepted 31 January 2007
Abstract
On the basis of detailed analysis of reaction times and neurophysiological data from tasks involving choice, it has been proposed that the brain
implements an optimal statistical test during simple perceptual decisions. It has been shown recently how this optimal test can be implemented in
biologically plausible models of decision networks, but this analysis was restricted to very simplified localist models which include abstract units
describing activity of whole cell assemblies rather than individual neurons. This paper derives the optimal parameters in a model of a decision
network including individual neurons, in which the alternatives are represented by distributed patterns of neuronal activity. It is also shown how
the optimal weights in the decision network can be learnt via iterative rules using information accessible for individual synapses. Simulations
demonstrate that the network with the optimal synaptic weights achieves better performance and matches fundamental behavioural regularities
observed in choice tasks (Hick’s law and the relationship between the error rate and the time for decision) better than a network with synaptic
weights set according to a standard Hebb rule.
c ? 2007 Elsevier Ltd. All rights reserved.
Keywords: Decision making; Distributed representation; SPRT; Perceptual choice; Cell assemblies
1. Introduction
Experimental studies have shed light on the neural bases
of simple perceptual decision-making and indicated that they
involve at least three basic processes. First, sensory cortical
areas provide noisy evidence in support of alternative choices
(Britten, Shadlen, Newsome, & Movshon, 1993; Ditterich,
Mazurek, & Shadlen, 2003; Hanks, Ditterich, & Shadlen,
2006). The noisy evidence supporting a particular alternative is
represented in the firing rate of the sensory neurons selective
for this alternative. Hence the goal of the decision process
may be formulated as choosing the alternative for which
the corresponding neuronal population has the highest mean
firing rate (Gold & Shadlen, 2001, 2002). Second, it has been
observedthatincertaincorticalregions(e.g.lateralintraparietal
area (LIP) and frontal eye field (FEF)) neuronal firing rates
gradually increase during the decision process, and it has been
proposed that these areas integrate sensory evidence over time
(Schall, 2001; Shadlen & Newsome, 2001). This integration
averages out the noise present in the sensory evidence. Third,
in the free-response paradigm in which animal can respond at
∗Tel.: +44 117 954 5141; fax: +44 117 954 5208.
E-mail address: R.Bogacz@bristol.ac.uk.
any time, it has been observed that when the firing rate of these
integrator neurons exceed a certain threshold, the decision is
made and the action execution is initiated (Roitman & Shadlen,
2002).
Theintegrationprocessduringdecision-makingtaskstakesa
certain amount of time which is referred to as the decision time
(DT). In tasks under the free-response paradigm, the reaction
time consists of the DT and an additional period connected
with visual and motor processes. DTs have been estimated from
behavioural data (Ratcliff, Van Zandt, & McKoon, 1999; Usher
&McClelland,2001)anddirectlyfromneurophysiologicaldata
(Reddi, 2001; Sato, Murthy, Thompson, & Schall, 2001). In
difficult tasks, the DT often constitutes the majority of the
reaction time (e.g. Ratcliff et al. (1999)).
Due to the evolutionary pressure for the speed and the
accuracy of choices, it may be plausible that the neural decision
circuits operate in an optimal or nearly optimal way, i.e.
minimizing the DT. Indeed, on the basis of careful studies
of human DTs, psychologists (Laming, 1968; Ratcliff, 1978;
Ratcliff et al., 1999; Stone, 1960) have proposed that during
simple perceptual choice between two alternatives the brain
effectively performs a sequential probability ratio test (SPRT)
— an optimal algorithm allowing the fastest decisions for
any required accuracy (Barnard, 1946; Wald, 1947; Wald
0893-6080/$ - see front matter c ? 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.neunet.2007.01.003
Please cite this article in press as:
doi:10.1016/j.neunet.2007.01.003
Bogacz, R. Optimal decision network with distributed representation.Neural Networks (2007),
Page 2
ARTICLE IN PRESS
2
R. Bogacz / Neural Networks ()–
Nomenclature
Throughout the paper the following notational convention
is used: all variables in localist models are denoted by
capital letters, while all variables in distributed models —
by small letters.
A
a
number of alternative decisions
sparseness of coding in the distributed decision
network
magnitude of noise in localist models
magnitude of noise in distributed stimuli
integration constant
input to integrator neuron i via recurrent weights
vi,j
input to integrator neuron i via feedforward
weights wi,j
decay or leak in the localist decision network
decay or leak of integrator neurons
decay or leak in the Usher and McClelland model
matrix of linear coefficients in the distributed
decision network
mean input to unit I
number of neurons in each layer of the distributed
decision network
weight of self-excitatory connections in the
localist decision network
weightsofconnectionsbetweenintegratorneuron
j and integrator neuron i
weights of connections between input neuron j
and integrator neuron i
weight of inhibitory connections in the localist
decision network
weights of connections between integrator and
inhibitory neurons
weight of inhibitory connections in the Usher and
McClelland model
input to localist unit I
activity of input neuron j
membership of input neuron j in assembly I
activity level of localist unit I
activity of integrator neuron i
membership of integrator neuron i in assembly I
learning rate
independent Wiener processes
C
c
dt
hv
i
hw
i
K
k
KUM
li,j
MI
n
V
vi,j
wi,j
WINH
winh,i
WUM
XI
xj
xj,I
YI
yi
yi,I
α
ηI
& Wolfowitz, 1948). The theory postulating that the brain
performs SPRT has also been shown to be consistent with
neurophysiological data (Gold & Shadlen, 2001, 2002; Shadlen
& Newsome, 2001; Smith & Ratcliff, 2004). This theory claims
that the decision is made as soon as the difference between
integrated evidence in support of the first and second alternative
exceeds a positive or a negative threshold. It has been recently
shownhowSPRTmaybeimplementedinbiologicallyplausible
neural network models in which the difference between the
integrated evidence is computed via feedforward (Mazurek,
Roitman, Ditterich, & Shadlen, 2003; Shadlen & Newsome,
2001) or feedback inhibitory connections (Bogacz, Brown,
Moehlis, Holmes, & Cohen, 2006; Brown et al., 2005).
However, the above analyses were restricted to very
simplified localist models which include abstract units or
variables describing activity of whole cell assemblies rather
than individual neurons. Cell assemblies (Hebb, 1949) are
the distributed patterns of neuronal activity which represent
visual stimuli in certain cortical areas (e.g. Gochin, Colombo,
Dorfman, Gerstein, and Gross (1994). The computational
models including separate units or variables describing
activity of individual neurons are referred to as distributed
models. Wang (2002) developed a biologically realistic
distributed model of decision-making in area LIP (involved in
controlling eye movements) during a task in which a monkey
has to discriminate whether dots presented on the screen move
left or right and to indicate its decision by making a saccade
in the direction of movement. Recently, Wong and Wang
(2006) have shown how SPRT may be implemented in the
Wang (2002) model. In this model, each alternative decision is
representedbyaseparatepopulationofneurons,becauseinarea
LIP neurons selective for very different directions of saccade
are located in separate locations of area LIP. Hence in the Wang
(2002) model, the assemblies werenon-overlapping: if aneuron
belongs to the assembly representing one alternative, it does not
belong to the assembly representing the other.
However, in the case of many other decisions, alternative
choices are likely to be represented by overlapping assemblies
so that a neuron selective for one alternative may also
be selective for others. Many computational models have
been proposed for how distributed overlapping representations
may develop (e.g. Olshausen and Field (1996)) and be
stored (e.g. Hopfield (1982)). Such overlapping distributed
representations allow coding of many more alternatives by a
group of neurons than non-overlapping representations. This
property is particularly important in choices with many possible
answers. For example, in the case of motor decisions, there is
practically infinite number of possible movements, e.g. of an
arm (including all combinations on angular velocities of all
joints). And indeed, neurophysiological data suggest that the
motor actions are encoded in distributed overlapping patterns of
neuronal activity (e.g. Chapin (2004), Georgopoulos, DeLong,
and Crutcher (1983), Georgopoulos, Pellizzer, Poliakov, and
Schieber (1999), Schieber and Hibbard (1993)). Similar
arguments apply to perceptual decisions (e.g. deciding what is
the animal you are looking at, before choosing an appropriate
action), which are the focus of this paper. During such
perceptual decisions, alternative choices are also likely to be
represented by overlapping assemblies, because complex visual
stimuli are represented in this way in the visual areas in last
stages of the vental stream (Erickson, Jagadeesh, & Desimone,
2000) and in the prefrontal cortex (Averbeck, Crowe, Chafee,
& Georgopoulos, 2003; Miller, Erickson, & Desimone, 1996).
The following experimental data suggest that during
perceptual decisions: the sensory evidence is being integrated
over time, and the alternatives are represented by overlapping
Please cite this article in press as:
doi:10.1016/j.neunet.2007.01.003
Bogacz,R. Optimal decision network with distributed representation.Neural Networks (2007),
Page 3
ARTICLE IN PRESS
R. Bogacz / Neural Networks ()–
3
patterns of neuronal activity. First, fMRI data indicate that
prefrontal neurons integrate sensory information in the task
in which human subjects were asked to decide whether a
noisy stimulus is a face or a house (Heekeren, Marrett,
Bandettini, & Underleider, 2004), and the prefrontal neurons
have been shown to encode complex visual stimuli in
overlapping distributed patterns of their activity (Averbeck
et al., 2003; Miller et al., 1996). Second, DTs in tasks involving
word discrimination have been well described by the models
assuming evidence integration (e.g. Ratcliff (1978)), and the
overlapping representations are likely to encode words as such
representations encode vocalizations in primates (Romanski,
Averbeck, & Diltz, 2005).
A response may not be required immediately after
perceptual decisions (e.g. a predator may identify an animal
and wait with the action) and the data show that the integration
process is not limited to the tasks in which an immediate
response is required: The gradually increasing firing rates have
beenobservedinFEFalsoiftherewasadelaybetweenstimulus
offset and the response (Kim & Shadlen, 1999). Furthermore,
in this study the FEF neurons represented the correct alternative
even in the passive viewing condition when no response was
required. Moreover, in the study of Gold and Shadlen (2003),
the animals did not know which motor response was required
forwhichalternativeduringstimulusviewing,andnevertheless,
the accuracy increased with viewing time providing evidence
for information integration.
Although the data reviewed above suggest plausibility of de-
cision processes with overlapping distributed representations,
there is neither experimental evidence demonstrating them di-
rectly nor theoretical work showing whether optimal decision
makingispossiblewithoverlappingdistributedrepresentations.
The latter theoretical question is addressed in this paper and the
experiments that would confirm the existence of these decision
processesaresuggestedinthediscussion.Thispaperderivesthe
values of the weights of connections in decision networks us-
ing distributed overlapping representation, which minimize DT
for any fixed accuracy, and, in case of two alternatives, allow
the network to implement SPRT. It is shown that these weights
can be learnt via iterative rules using information accessible for
individual synapses.
The derivation of the optimal parameters is achieved
by finding the relationships between the parameters of the
distributed and the localist decision networks for which
both models perform exactly the same computations. Since
the optimal parameters of the localist decision network are
known, these relationships give the optimal parameters of the
distributed decision network. But these relationships may also
havevalueontheirownastheybridgethetwodifferentlevelsof
modelling of the neural circuits: localist and distributed. Hence
they may be useful for deriving distributed versions also for
other localist models, and thus grounding these models in the
neurobiological implementation.
The model of decision network with distributed overlapping
representation considered in this paper is not intended to map
directly onto a particular cortical area, because the location
of input neurons and integrator neurons may depend on the
decision task. However in general, the input during this kind
of perceptual decisions is likely to be provided by visual areas
in the late ventral stream, representing complex features. The
integrationislikelytooccurinfrontalorinparietalareas,where
it has been shown before (e.g. Heekeren et al. (2004), Shadlen
and Newsome (2001)).
In this paper, the individual neurons are described as linear
elements to enable mathematical tractability and to allow
finding explicit conditions under which the localist and the
distributeddecisionnetworksperformequivalentcomputations.
The assumption of linearity in processing can be justified by
assuming that attention acts to place non-linear integrators in
the most sensitive, linear range of their response functions
(e.g. Cohen, Dunbar, and McClelland (1990)).
The paper is organized as follows. Section 2 reviews localist
models of decision making and conditions under which their
performanceisoptimal.Section3derivesanoptimaldistributed
decision network. Section 4 shows that this network achieves
faster decision times and matches behavioural data better than
the networks in which the weights are set up according to a
standard Hebb rule. Section 5 discusses the predictions of the
theory and the direction of the future work.
2. Optimal localist decision networks
This section reviews three localist decision networks: the
simplest one, i.e. the race model, and two optimal networks.
Let A denote the number of alternative choices. All models
reviewed here include A integrator units corresponding to
assemblies accumulating evidence for each alternative. Let us
denote the activity levels of these units (which may correspond
to the total activity of neurons in cell assemblies) by YI(I ∈
{1,..., A}). It is assumed that at the beginning of the decision
process all YI(0) = 0. Each unit receives noisy input
XI= MI+ CηI
with mean MI, and standard deviation C (ηI
independent Wiener processes). All models reviewed here
assume that whenever the activity of any unit exceeds a
particular threshold, decision has been made in favour of the
alternative represented by the first unit which crossed the
threshold. The decision is considered to be correct if the mean
input MI to this unit is the highest among all units (this
assumption is common in many models of decision making,
e.g. Gurney, Prescot, and Redgrave (2001), Mazurek et al.
(2003), Shadlen and Newsome (2001), Usher and McClelland
(2001), Vickers (1970), Wang (2002)).
In the race model (Vickers, 1970) the units simply integrate
their input XI, so that the change in YI (or its derivative
over time) is equal to:˙YI = XI. The race model, however,
produces slower DTs than the model reviewed below (Bogacz
et al., 2006; McMillen & Holmes, 2006), and is inconsistent
with neurophysiological data: It predicts that integrators
representing all alternatives should increase their activity
during the integration process, while it has been observed that
neurons representing the “losing” alternative decrease their
activity (Shadlen & Newsome, 2001).
(1)
denote
Please cite this article in press as:
doi:10.1016/j.neunet.2007.01.003
Bogacz,R. Optimal decision network with distributed representation.Neural Networks (2007),
Page 4
ARTICLE IN PRESS
4
R. Bogacz / Neural Networks ()–
Fig. 1. The architectures of (a) the Usher and McClelland (2001), (b) the localist, and (c) the distributed decision networks. Small circles denote individual neurons,
large circles denote cell assemblies. Arrows denote excitatory connections, lines ended with black circles denote inhibitory connections. For clarity, in panel (c) only
sample connections are shown.
Usher and McClelland (2001) proposed a localist model
of decision making with the architecture shown in Fig. 1(a).
The changes in the activity of integrators are described by the
following stochastic differential equations:
˙YI= XI− KUMYI− WUM
A
?
J?=I
J=1
YJ.
(2)
The activity levels of units decay (or leak) with
proportionality constant KUM. The model also includes
competition between all units in the form of all-to-all inhibitory
connections with weight WUM.
Bogacz et al. (2006) have shown that for two alternatives
(A = 2), when KUM
these parameters are high (relative to input and noise), the
performance of the Usher and McClelland (2001) model is
optimal. For these parameters, the activity levels of the units
are proportional to the difference between integrated evidence
in support of the two alternatives, and hence the model
approximates SPRT. As mentioned in the introduction, SPRT
gives the fastest DTs for any required accuracy, which can
be illustrated on the example of the race and the Usher and
McClelland models. In both models, for given MI and C, the
DT and the accuracy depend on the height of the decision
threshold. However, if the decision threshold is chosen in each
of the models to give the same accuracy (e.g. 90%), then
the optimally parameterized Usher and McClelland model will
give faster average DT than the race model. Intuitively, this
advantage of the Usher and McClelland model comes from
its ability to adaptively react to the level of conflict between
alternatives: If on a given trial the average input to the losing
alternative is higher (due to noise), the winning unit will have
to integrate for longer as it will receive more inhibition from the
losing unit. Such adaptation of integration time is not present in
the race model.
In case of multiple alternatives (A > 2), there exists a
generalization of SPRT, known as Multihypothesis SPRT
(Dragalin, Tertakovsky, & Veeravalli, 1999), but its imple-
mentation would require a network with architecture much
more complex than that of the Usher and McClelland
model. McMillen and Holmes (2006) have shown that for A >
2,theUsherandMcClellandmodelachievesthelowestDTpos-
sible within this simple architecture also when KUM = WUM
Please cite this article in press as:
doi:10.1016/j.neunet.2007.01.003
= WUM and the values of both
and both these parameters are high. Let us call the parameters
satisfyingtheseconstraintsoptimal,andingeneral,inthispaper
let us use the word optimal to refer to parameters that allow the-
oretically best possible performance (i.e. the shortest DTs for a
fixed error rate) for A = 2, and that allow the best possible
performance within the architecture considered for A > 2.
Wang (2002) proposed a detailed distributed model of
decision making in area LIP. Fig. 1(b) shows a localist model
with the architecture (connections between various neuronal
populations) of the Wang (2002) model. It will be referred as
the localist decision network. This model does not capture the
complexity of the Wang (2002) model, but the simplifications
made allow mathematical tractability. The localist decision
network is very similar to the Usher and McClelland (2001)
model with just two differences. First, the integrators do not
inhibit one another, but rather send excitatory connections to a
pool of inhibitory neurons, which then inhibit the integrators.
Second, the integrator neurons send excitatory connections
(denoted by V in Fig. 1(b)) within an assembly. These
connections were found to be necessary to enable a network
of individual model neurons, whose membrane voltages decay
rapidly (on a millisecond scale), to integrate information on
the timescales of decisions (hundreds of milliseconds). The
changes in the activity of the localist units are described by
the following stochastic differential equations (Bogacz et al.,
2006):
˙YI= XI− KYI+ VYI− WINH
A
?
J=1
YJ.
(3)
To find the optimal parameters of the localist decision
network, let us rewrite Eq. (3) as:
˙YI= XI− (K − V + WINH)YI− WINH
A
?
J?=I
J=1
YI.
(4)
Comparing Eqs. (2) and (4) shows that Usher and
McClelland (2001) model and the localist decision network
are computationally equivalent, when there are the following
relationships between their parameters: KUM= K −V +WINH,
WUM= WINH. Hence given that the optimal parameters of the
Usher and McClelland model must satisfy: KUM= WUMand
both are high, the optimal parameters of the localist decision
Bogacz,R. Optimal decision network with distributed representation. Neural Networks (2007),
Page 5
ARTICLE IN PRESS
R. Bogacz / Neural Networks ()–
5
network must satisfy K − V + WINH= WINH, i.e. K = V, and
WINHis high (Bogacz et al., 2006).
3. Optimal distributed decision networks
This section derives a distributed decision network
computationally equivalent to the optimal localist decision
network. Let us define that two networks are computationally
equivalent if for a given input both networks make the same
choice after the same DT. This definition implies that the
repeated simulation of two equivalent networks will yield
exactly the same error rate and DT distribution. Since we know
the optimal parameters of the localist decision network, the
relationship between the parameters of the localist and the
distributeddecisionnetworksgivingcomputationalequivalence
will allow us to compute the optimal parameters of the
distributed decision network.
3.1. Network architecture
Thearchitectureofthedistributeddecisionnetworkisshown
in Fig. 1(c). In this model, both layers of integrators and inputs
are described as populations of n neurons. Let us denote the
activities of the integrator neurons by yiand the activities of the
input neurons by xj. Let us denote the weights of connections
between input neuron j and integrator neuron i by wi,j, and
the weights of connections between integrator neuron j and
integrator neuron i by vi,j(see Fig. 1(c)).
For simplicity, the inhibitory neurons are not modelled
individually, but as a population, because they are not selective
for different alternatives, by contrast to integrator and input
neurons. Let us denote the weights of connections between
integrator neuron i and the population of inhibitory neurons
by winh,i, and for simplicity assume that the weights of
connections between inhibitory and integrators neurons are
equal to 1. Finally, let us denote the decay rate of the integrator
neurons by k. For simplicity, let us model the individual
integrator neurons as simple linear units. Hence the changes
in the activity of these neurons are described by the following
stochastic differential equations:
˙ yi=
n
?
j=1
wi,jxj− kyi+
n
?
j=1
vi,jyj−
n
?
j=1
winh,jyj.
(5)
3.2. Distributed representation
Let us assume that each alternative I is represented by
an assembly of an neurons (hence a denotes the sparseness
of coding in the distributed network). Let us encode the
relationship of integrator neurons belonging to assemblies in
matrix [yi,I], in particular: yi,I = 1 if neuron i belongs to
assembly I, and yi,I = 0 otherwise (note that one neuron
may belong to many assemblies). Similarly, let xj,I = 1 if
neuron j belongs to assembly I, and xj,I = 0 otherwise.
The theory described below works for any 0 < a < 1,
and also generalizes to a more biologically realistic case of
neurons having different continuous responses to different
stimuli (rather than responding or not as implied before). To
represent this case the elements of matrices [yi,I] and [xj,I]
would be continuous and would have to be normalized such
that the average of each column is equal to a. However, for the
clarity of argument, in the remainder of the paper only binary
patterns will be used.
Let us now relate the variables of the localist and the
distributed decision networks. It is most natural to assume that
the activity of localist unit I corresponds to the total activity of
neurons belonging to assembly I, which can be written as (all
the variables indexed once, e.g. yi, become column vectors):
[XI] =?xj,I
3.3. Parameters giving equivalence and optimal performance
?T?xj
?,
[YI] =?yi,I
?T[yi].
(6)
To establish the equivalence between the localist and
the distributed decision networks, we seek the relationships
between parameters which will satisfy Eqs. (3), (5) and (6).
Appendix A shows that if the following relationships are
satisfied and pseudo-inverses of [xj,I]Tand [yi,I]Texist,
the computations of the localist and the distributed decision
networks are equivalent:
k = K,
(7)
winh,i= WINH
?vi,j
1
an
??yi,I
A
?
I=1
yi,I,
(8)
?= V?yi,I
?−1,
(9)
[wi,j] =?yi,I
Below the intuition is provided for why the above conditions
need to be satisfied, and it is considered how the weights in
the distributed decision network described by Eqs. (8)–(10)
can be learnt in a biologically plausible manner, i.e. using
only information accessible to individual synapses. It is usually
assumed that a synapse (e.g. storing weight vi,j) can only
“access” information about activity of presynaptic and post
synaptic neurons (e.g. yi and yj). Eqs. (9) and (10) seem to
violate this condition, as they involve computation of pseudo-
inverses (e.g. which requires an “access” to all elements of
matrix yi,I), but it will be shown that there exist simple
iterative learning algorithms using only information locally
accessible to synapses that converge to the weights satisfying
Eqs. (9) and (10).
The condition of Eq. (7) ensures that the individual neurons
decay with the same rate as units in the localist model. Eq. (8)
ensures that the inhibition received by each integrator neuron is
equal to (left-hand side of this equation comes from Eq. (5), the
two transformations use Eqs. (8) and (6)):
?T−1 ?xi,I
?T.
(10)
n
?
j=1
winh,jyj =
n
?
j=1
WINH
1
an
A
?
I=1
yj,Iyj
= WINH
1
an
A
?
I=1
YI.
(11)
Please cite this article in press as:
doi:10.1016/j.neunet.2007.01.003
Bogacz,R. Optimal decision network with distributed representation.Neural Networks (2007),