ArticlePDF Available

Simple Artificial Neural Networks That Match Probability and Exploit and Explore When Confronting a Multiarmed Bandit

Authors:

Abstract and Figures

The matching law (Herrnstein 1961) states that response rates become proportional to reinforcement rates; this is related to the empirical phenomenon called probability matching (Vulkan 2000). Here, we show that a simple artificial neural network generates responses consistent with probability matching. This behavior was then used to create an operant procedure for network learning. We use the multiarmed bandit (Gittins 1989), a classic problem of choice behavior, to illustrate that operant training balances exploiting the bandit arm expected to pay off most frequently with exploring other arms. Perceptrons provide a medium for relating results from neural networks, genetic algorithms, animal learning, contingency theory, reinforcement learning, and theories of choice.
Content may be subject to copyright.
1368 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 8, AUGUST 2009
Simple Artificial Neural Networks That Match Probability
and Exploit and Explore When Confronting
a Multiarmed Bandit
Michael R. W. Dawson, Brian Dupuis, Marcia L. Spetch, and
Debbie M. Kelly
Abstract—The matching law (Herrnstein 1961) states that response rates
become proportional to reinforcement rates; this is related to the empirical
phenomenon called probability matching (Vulkan 2000). Here, we show
that a simple artificial neural network generates responses consistent with
probability matching. This behavior was then used to create an operant
procedure for network learning. We use the multiarmed bandit (Gittins
1989), a classic problem of choice behavior, to illustrate that operant
training balances exploiting the bandit arm expected to pay off most
frequently with exploring other arms. Perceptrons provide a medium for
relating results from neural networks, genetic algorithms, animal learning,
contingency theory, reinforcement learning, and theories of choice.
Index Terms—Instrumental learning, multiarmed bandit, operant con-
ditioning, perceptron, probability matching.
I. INTRODUCTION
The matching law states that the rate of a response reflects the rate
of its obtained reinforcement: if response A is reinforced twice as fre-
quently as response B, then A will appear twice as frequently as B
[1]. While modern variations exist [4], the matching law is usually ex-
pressed as , where is response rate, and
are parameters, and is reinforcement rate. Intended, to explain re-
sponse frequency, the matching law also predicts how response strength
varies with reinforcement frequency [5]. The matching law is a founda-
tional regularity, applying to many tasks in psychology and economics
[6]–[8]. An empirical phenomenon that is formally related [9] to the
matching law is probability matching, in which the probability that an
agent makes a choice among alternatives mirrors the probability as-
sociated with the outcome or reward of that choice [2]. This brief in-
vestigates whether a simple artificial neural network can vary response
strengths in accordance with such probability matching.
A perceptron [10], [11] is a simple artificial neural network whose
input units send signals about detected stimuli through weighted con-
nections to an output unit, which converts them into a response ranging
from 0 to 1 using a nonlinear activation function. Modern perceptrons
typically use the logistic equation , where
is the activity of output unit , and is the incoming signal. Per-
ceptrons can be trained to produce desired responses to stimuli with a
gradient-descent learning rule [12] that modifies network weights using
response error scaled by the derivative of the activation function. Per-
ceptrons can simulate a large number of classic results in the learning
literature [13]. Given the ubiquity and importance of the probability
matching, is it possible that perceptrons can exhibit such matching as
well? Our first simulation attempted to answer this question.
Manuscript received January 05, 2009; accepted June 09, 2009. First pub-
lished July 10, 2009; current version published August 05, 2009. The work
of work M. R. W. Dawson, M. L. Spetch, and D. M. Kelly was supported by
NSERC Discovery Grants.
M. R. W. Dawson, B. Dupuis, and M. L. Spetch are with the Department of
Psychology, University of Alberta, Edmonton, AB T6G 2P9 Canada (e-mail:
mdawson@ualberta.ca; bdupuis@ualberta.ca; mspetch@ualberta.ca).
D. M. Kelly is with the Department of Psychology, University of
Saskatchewan, Saskatoon, SK S7N 5A5 Canada (e-mail: debbie.kelly@usask.
ca).
Digital Object Identifier 10.1109/TNN.2009.2025588
II. SIMULATION 1: MATCHING DIFFERENTIAL REINFORCEMENT
PROBABILITIES
A. Method
1) Network Architecture and Training Set: Perceptrons, with a
single output unit and four input units, were trained. The input units
were turned on or off to represent the presence or absence of four
different discriminative stimuli (DSs). For example, the input pattern
indicated that DS1 was present and that all other DSs were
absent.
Each DS was reinforced at different frequencies. For the first 300
epochs of training, DS1 was reinforced on 20% of its presentations
while DS2, DS3, and DS4 received 40%, 60%, and 80% reinforce-
ment, respectively. For the second 300 epochs of training, reinforce-
ment frequencies were reversed, so that DS1 was reinforced 80% of its
presentations while DS2, DS3, and DS4 received 60%, 40%, and 20%
reinforcement, respectively.
Reinforcement probabilities were manipulated by repeating the pat-
tern that coded the presence of one of the DSs ten times, building a
total training set of 40 input patterns. Each pattern was reinforced (or
not) by being paired with a desired network output value of either 1 or
0. Differential probabilities of reinforcement were produced by varying
the number of positive reinforcements applied to a DS’s set of ten input
patterns. For example, setting the desired response to DS1 to 1 for two
of its input patterns, and to 0 for its remaining eight patterns, produced
a 20% reinforcement probability.
2) Network Training: Ten different perceptrons were trained using
the gradient-descent rule with a learning rate of 0.1, and with con-
nection weights randomly set in the range from to prior to
training. Training was accomplished with the Rosenblatt program that
is available as freeware [14]. During an epoch of training, a network
was presented each of the 40 patterns; connection weights were mod-
ified after each presentation. The order of DS presentations was ran-
domized in each epoch. Network responses to each DS were recorded
every 20 epochs of training. After 300 epochs of training, the rein-
forcement contingencies associated with the DSs were inverted without
reinitializing connection weights. Training continued for an additional
300 epochs.
B. Results
The results, presented in Fig. 1(a), indicated that perceptrons
matched response strength to reinforcement probabilities, and quickly
adjusted their behavior when reinforcement probabilities were altered.
After 60 epochs, the perceptrons generated output activity that equalled
probability of reinforcement for each DS (e.g., generating activity of
0.20 to DS1). When reinforcement probabilities were changed, the
perceptrons adjusted and again matched their responses to the new
reinforcement contingencies within 60 epochs.
A number of experiments have studied probability matching under
conditions in which reinforcement probabilities are changed or re-
versed at the midpoint of the study; subjects in these experiments have
included insects [15]–[18], fish [19], turtles [20], pigeons [21], and hu-
mans [22]. The simulation results reported in Fig. 1 are very similar to
the results obtained in these experiments. For example, in their classic
study of probability matching in the fish [19], Behrend and Bitterman
found that their subjects quickly matched their choice preference of
two alternatives to the probability of reinforcement of the two. When
reinforcement probabilities were altered, the animals quickly altered
their choice of behavior to reflect the new contingencies. Behrend and
Bitterman’s graph of this choice of behavior over time [19, Fig. 2] is
strikingly similar in shape to the curves illustrated in Fig. 1(a).
1045-9227/$26.00 © 2009 IEEE
Authorized licensed use limited to: UNIVERSITY OF ALBERTA. Downloaded on August 8, 2009 at 18:09 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 8, AUGUST 2009 1369
Fig. 1. Average responses of ten different perceptrons to each of the four
stimuli as a function of training epoch. (a) Responses for networks from the
first simulation which used standard training procedures. (b) Responses for
networks from the second simulation which used an operant training procedure.
These results are comparable to other algorithms in the machine
learning literature that do not involve artificial neural networks. For ex-
ample, at trial , the pursuit algorithm [23] updates the expected prob-
ability of source delivering reinforcement using the equation
, where is equal to 1 if reinforce-
ment is received and 0 otherwise, and is a constant. When equals
0.005 and the pursuit algorithm is given the identical training set used
with the perceptrons, the results were indistinguishable from Fig. 1(a);
the total sum of squared differences between the 120 points plotted in
Fig. 1(a) and the same data from the pursuit algorithm was 0.005. In
short, the matching behavior of the perceptrons was identical to that
obtained using a standard machine learning algorithm.
III. SIMULATION 2: OPERANT LEARNING OF THE
MULTIARMED BANDIT
The training set used above is similar to the classic multiarmed
bandit problem [3]. In this problem, an agent is in a room with
different “one-armed bandit” gambling machines. When a machine’s
arm is pulled, the machine pays either 1 or 0 units; each machine
has a different (and usually fixed) probability of paying off, which is
not known by the agent. At each time , the agent pulls the arm of
one machine. The goal is to choose machines to maximize the total
payoff over the game’s duration. To do this, the agent must explore the
different machines to determine payoff probabilities. The agent must
also exploit the results of this exploration in order to maximize reward.
As a result, there is a tradeoff between exploration and exploitation
that must be balanced [24]. A “greedy strategy” only pulls the arm of
the machine with the highest expected payoff probability. However,
as the duration of the game increases, an alternative strategy would be
to explore other machines as well, in case early probability estimates
were inaccurate.
Simulation 1 was not identical to the multiarmed bandit problem be-
cause there was no balance between exploitation and exploration: every
time a “machine” was presented to a perceptron, it “pulled the ma-
chine’s arm” and learned new information. This can be changed using
the knowledge gleaned from Simulation 1 that network responses esti-
mate reward likelihood.
Rather than modifying weights on every DS presentation, one can
implement operant network learning as follows. On every trial, com-
pute the network’s response to a presented stimulus. The magnitude
of this response is the network’s current estimate of reward likelihood.
This response is used as the probability of updating weights (i.e., of
learning on the trial). That is, connections weights are not always mod-
ified, as was the case in Simulation 1. Sometimes they will be changed,
but other times they will remain the same. As training proceeds, con-
nection weights will be updated more frequently for those DSs associ-
ated with a higher frequency of reinforcement than for those receiving
less reinforcement. Therefore, a perceptron trained (operantly) on this
problem would be functionally equivalent to an agent playing a mul-
tiarmed bandit. Operant learning also imposes a simple balance be-
tween exploitation and exploration, because the perceptron will occa-
sionally modify connection weights using a DS associated with a low
(but nonzero) estimated reinforcement contingency.
A. Method
The method for Simulation 2 was identical to that used in Simulation
1, except that output unit activity was used as a probability to determine
whether connection weights were modified. This was accomplished as
follows. After the network’s response to a pattern was calculated, a
random number between 0 and 1 was generated. If this number was
less than or equal to the network’s response, then the learning rule was
used to update all of the connection weights. Otherwise, the connection
weights were not changed, and the next pattern was presented to the
network.
It might be argued that this method is a major departure from Simula-
tion 1, in the sense that an external controller is generating the random
number that is used, in conjunction with output unit activity, to deter-
mine whether a particular trial will involve learning. From this per-
spective, the perceptron itself is incapable of operant learning, because
it requires this external control. However, while it is possible to elab-
orate artificial neural network architectures to build learning rules di-
rectly into them, it is almost always the case that these learning rules
exist as controllers that are external to the network [25]. Thus, in our
view, Simulation 2 uses a slightly elaborated learning rule that is no
more external to the perceptron than was the learning rule employed
in Simulation 1, or than any learning rule that is typically used to train
artificial neural networks.
B. Results
Fig. 1(b) illustrates the results of using this operant procedure to
train perceptrons on exactly the same task used in Simulation 1. The
results were qualitatively very similar to the results in Fig. 1(a): per-
ceptrons quickly adjusted responses to match reinforcement contin-
gency, and then quickly readapted when reinforcement contingencies
were inverted. One difference in results was that the operant networks
were slightly slower at achieving matching behavior. Also, when re-
inforcement contingencies changed, operant networks adapted to DS
earlier than the other DSs, because, at epoch 300, operant training was
changing weights on the basis of DS information more frequently than
on the basis of the other DSs. Quantitatively, the total sum of squared
differences between the 120 data points used to create Fig. 1(a) and the
corresponding data points in Fig. 1(b) was 0.557.
IV. GENERAL DISCUSSION
In summary, we have shown that perceptrons generate responses that
accord with probability matching, and that this can be used to create
an operant training paradigm for these networks. There are three main
implications of these results.
First, formal accounts of matching are typically stated in terms
of observables (rates of reinforcement and responses) and not by
appealing to underlying mechanisms [6], [8]. Because the matching
law can emerge from modifying a perceptron’s connection weights, it
Authorized licensed use limited to: UNIVERSITY OF ALBERTA. Downloaded on August 8, 2009 at 18:09 from IEEE Xplore. Restrictions apply.
1370 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 8, AUGUST 2009
might be explained by appealing to general mechanisms of associative
learning.
Other mechanistic accounts of matching have been proposed. Mc-
Dowell et al. have developed a genetic algorithm that evolves a popu-
lation of behaviors over time, skewing the distribution of possible be-
haviors in the direction of those that have been reinforced [26]–[29].
The matching law is an emergent property of this selectionist account
of adaptive behavior. Such selectionist accounts are usually taken as
radical alternatives to instructionist theories such as artificial neural
networks [30]. However, it is possible to create neural networks that
are consistent with selectionist theory [31], [32]. Clearly, one area de-
serving future research is an examination of the computational and al-
gorithmic similarities and differences between selectionist and instruc-
tionist mechanisms that are capable of producing matching behavior.
A second implication of our results is a response to Herrnstein’s po-
sition [8, p. 68] that the matching law can be distinguished from other
theories of learning, such as the Rescorla–Wagner model [33]. That
perceptrons produce probability matching indicates that this distinct-
ness needs to be reevaluated. Formal equivalences between perceptron
learning and the Rescorla–Wagner model [34] and contingency theory
[35] have already been established. Genetic algorithms that produce
matching behavior reach the same equilibria as the Rescorla–Wagner
model [29], [36]. Matching behavior would be expected from these
approaches to learning, as well as from networks adapted via rein-
forcement learning [37] Perceptrons might mediate a formal account
of the relationships between neural networks, these important theories
of learning, and the matching law and probability matching.
A third implication of our results concerns further explorations of
matching behavior in perceptrons, particularly with the goal of using
perceptrons to emulate animal behaviors that have been used to study
the matching law. The perceptrons reported here represent idealized
systems that demonstrated probability matching. They were trained in
a situation in which they distinguished between four discriminative
stimuli, and in which they were essentially trained using a random-
ratio reinforcement schedule. Animal subjects produce behavior in ac-
cord with Herrnstein’s matching law when trained under different rein-
forcement schedules (in particular, concurrent interval schedules) [6].
Furthermore, under a variety of conditions, animals produce system-
atic deviations from the strict matching law [6], [38] including under-
matching, where they respond less frequently to a DS than the matching
law would predict, and bias, where they have a stronger preference for
a DS than the matching law would predict. Importantly, the goal of
this brief was not to emulate extant data in the animal literature, but
instead to determine whether perceptrons were capable of probability
matching. Given that our results indicate that perceptrons can match
probabilities, this suggests a future line of research in which constraints
on network architecture, learning procedures, and learning rules can be
explored in an explicit attempt to emulate the complexity of the data
that is to be found in the experimental literature on matching.
A fourth implication of our results is that the ability of perceptrons
to simulate multiarmed bandit problems might serve to link statistical
theories of choice [3], and models of reinforcement learning [24], [37],
to other theories of learning, including artificial neural networks and
standard associative models. Theories of multiarmed bandits usually
view each machine as a unique whole. However, machines could be
viewed as feature collections, with the machine’s reinforcement esti-
mate based upon the sum of the estimates associated with each feature,
and not upon the (whole) machine itself. The reinforcement estimate
associated with a feature can depend on the payoff of several machines,
because a feature may be shared by more than one machine.
When cast in this way, the multiarmed bandit can be related to other
learning problems, such as the reorientation task used to study spatial
representations [39]. In the reorientation task, an agent explores dif-
ferent locations, each describable as a feature set, with some features
present at multiple locations. Not all locations are rewarded, and the
agent must use feature sets to learn where rewards might be placed.
Perceptrons thus provide an opportunity to explore a featural elabora-
tion of the multiarmed bandit, relating it to a broader set of learning
paradigms than has been previously considered.
This featural elaboration of choice tasks such as the multiarmed
bandit is also crucial to comparing perceptrons to other models, such
as genetic algorithms [26]. In the perceptron, there is a very limited
behavioral repertoire (i.e., choose or not choose), but behavior can be
in principle selected by a potentially huge variety of stimuli. In con-
trast, the genetic algorithm has an enormous variety of behaviors to
select from, but these are selected randomly without considering stim-
ulus properties. The fact that these two different approaches produce
matching is very interesting, and raises the possibility that they repre-
sent complementary mechanisms for generating such behavior as prob-
ability matching.
REFERENCES
[1] R. J. Herrnstein, “Relative and absolute strength of response as a func-
tion of frequency of reinforcement,” J. Exp. Anal. Behav., vol. 4, pp.
267–272, 1961.
[2] N. Vulkan, “An economist’s perspective on probability matching,” J.
Econom. Surv., vol. 14, pp. 101–118, Feb. 2000.
[3] J. C. Gittins, Multi-Armed Bandit Allocation Indices. Chichester,
U.K.: Wiley, 1989.
[4] J. J. McDowell, “On the classic and modern theories of matching,” J.
Exp. Anal. Behav., vol. 84, pp. 111–127, Jul. 2005.
[5] P. de Villiers and R. J. Herrnstein, “Toward a law of response strength,”
Psychol. Bull., vol. 83, pp. 1131–1153, 1976.
[6] M. Davison and D. McCarthy, The Matching Law : A Research Re-
view. Hillsdale, N.J.: L. Erlbaum, 1988.
[7] P. de Villiers, “Choice in concurrent schedules and a quantitative for-
mulation of the law of effect,” in Handbook of Operant Behavior,W.
K. Honig and J. E. R. Staddon, Eds. Englewood Cliffs, NJ: Pren-
tice-Hall, 1977, pp. 233–287.
[8] R. J. Herrnstein, The Matching Law : Papers in Psychology and Eco-
nomics. New York: Harvard Univ. Press, 1997.
[9] R. J. Herrnstein and D. H. Loveland, “Maximizing and matching on
concurrent ratio schedules,” J. Exp. Anal. Behav., vol. 24, pp. 107–116,
1975.
[10] F. Rosenblatt, “The perceptron: A probabilistic model for informa-
tion storage and organization in the brain,” Psychol. Rev., vol. 65, pp.
386–408, 1958.
[11] F. Rosenblatt, Principles of Neurodynamics. Washington, DC:
Spartan Books, 1962.
[12] M. R. W. Dawson, Minds and Machines : Connectionism and Psycho-
logical Modeling. Malden, MA: Blackwell, 2004.
[13] M. R. W. Dawson, “Connectionism and classical conditioning,Com-
parat. Cogn. Behav. Rev., vol. 3, Monograph, pp. 1–115, 2008.
[14] M. R. W. Dawson, Connectionism : A Hands-on Approach, 1st ed.
Malden, MA: Blackwell, 2005.
[15] M. E. Fischer, P. A. Couvillon, and M. E. Bitterman, “Choice in honey-
bees as a function of the probability of reward,” Animal Learn. Behav.,
vol. 21, pp. 187–195, Aug. 1993.
[16] T. Keasar, E. Rashkovich, D. Cohen, and A. Shmida, “Bees in two-
armed bandit situations: Foraging choices and possible decision mech-
anisms,” Behav. Ecol., vol. 13, pp. 757–765, Nov.-Dec. 2002.
[17] N. Longo, “Probability-learning and habit-reversal in the cockroach,”
Amer. J. Psychol., vol. 77, pp. 29–41, 1964.
[18] Y. Niv, D. Joel, I. Meilijson, and E. Ruppin, “Evolution of reinforce-
ment learning in uncertain environments: A simple explanation for
complex foraging behaviors,Adapt. Behav., vol. 10, pp. 5–24, 2002.
[19] E. R. Behrend and M. E. Bitterman, “Probability-matching in the fish,”
Amer. J. Psychol., vol. 74, pp. 542–551, 1961.
[20] K. L. Kirk and M. E. Bitterman, “Probability-learning by the turtle,
Science, vol. 148, pp. 1484–1485, 1965.
[21] V. Graf, D. H. Bullock, and M. E. Bitterman, “Further experiments on
probability-matching in the pigeon,” J. Exp. Anal. Behav., vol. 7, pp.
151–157, 1964.
[22] W. K. Estes and J. H. Straughan, “Analysis of a verbal conditioning
situation in terms of statistical learning theory,J. Exp. Psychol., vol.
47, pp. 225–234, 1954.
Authorized licensed use limited to: UNIVERSITY OF ALBERTA. Downloaded on August 8, 2009 at 18:09 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 8, AUGUST 2009 1371
[23] M. A. L. Thathachar and P. S. Sastry, “A new approach to the design of
reinforcement schemes for learning automata,” IEEE Trans. Syst. Man
Cybern., vol. SMC-15, no. 1, pp. 168–175, Feb. 1985.
[24] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement
learning: A survey,J. Artif. Intell. Res., vol. 4, pp. 237–285, 1996.
[25] M. R. W. Dawson and D. P. Schopflocher, “Autonomous processing in
PDP networks,” Philosoph. Psychol., vol. 5, pp. 199–219, 1992.
[26] J. J. McDowell, “Computational model of selection by consequences,
J. Exp. Anal. Behav., vol. 81, pp. 297–317, May 2004.
[27] J. J. McDowell and Z. Ansari, “The quantitative law of effect is a ro-
bust emergent property of an evolutionary algorithm for reinforcement
learning,” in Advances in Artificial Life. Cambridge, MA: MIT Press,
2005, vol. 3630, pp. 413–422.
[28] J. J. McDowell and M. L. Caron, “Undermatching is an emergent
property of selection by consequences,” Behav. Processes, vol. 75, pp.
97–106, Jun. 2007.
[29] J. J. McDowell, P. L. Soto, J. Dallery, and S. Kulubekova, M. Kei-
jzer, Ed., “A computational theory of adaptive behavior based on an
evolutionary reinforcement mechanism,” in Proc. Conf. Genetic Evol.
Comput., New York, 2006, pp. 175–182.
[30] M. Piattelli-Palmarini, “Evolution, selection and cognition: From
“learning” to parameter setting in biology and in the study of lan-
guage,” Cognition, vol. 31, pp. 1–44, 1989.
[31] J. W. Donahoe and D. C. Palmer, Learning and Complex Behavior.
Boston, MA: Allyn and Bacon, 1994.
[32] R. B. T. Lowry and M. R. W. Dawson, “Connectionist selectionism: A
case study of parity,Neural Inf. Process.—Lett. Rev., vol. 9, pp. 59–67,
2005.
[33] R. J. Herrnstein, “Derivatives of matching,Psychol. Rev., vol. 86, pp.
486–495, 1979.
[34] R. S. Sutton and A. G. Barto, “Toward a modern theory of adaptive
networks: Expectation and prediction,” Psychol. Rev., vol. 88, pp.
135–170, 1981.
[35] D. R. Shanks, The Psychology of Associative Learning. Cambridge,
U.K.: Cambridge Univ. Press, 1995.
[36] D. Danks, “Equilibria of the Rescorla-Wagner model,J. Math. Psy-
chol., vol. 47, pp. 109–121, Apr. 2003.
[37] R. S. Sutton and A. G. Barto, Reinforcement Learning. Cambridge,
MA: MIT Press, 1998.
[38] W. M. Baum, “On two types of deviation from the matching law:
Bias and undermatching,” J. Exp. Anal. Behav., vol. 22, pp. 231–242,
1974.
[39] K. Cheng and N. S. Newcombe, “Is there a geometric module for spa-
tial orientation? Squaring theory and evidence,” Psychonom. Bull. Rev.,
vol. 12, pp. 1–23, Feb. 2005.
Authorized licensed use limited to: UNIVERSITY OF ALBERTA. Downloaded on August 8, 2009 at 18:09 from IEEE Xplore. Restrictions apply.
... During this work, I (too slowly) realized that what the perceptrons were really doing was learning about the probability of reward associated with signals carried by cues. This led to some early explorations of the behavior of perceptrons in simple contingency experiments (Dawson & Dupuis, 2012;Dawson, Dupuis, Spetch, & Kelly, 2009). Eventually I started to explore the behavior of perceptrons when they learned about uncertain environments-environments in which an input stimulus does not signal an outcome with certainty but only signals the outcome with a certain degree of probability. ...
... Theories of multiarmed bandits usually view each machine as a unique whole. However, it has been argued that each bandit can easily be viewed as feature collections, with the machine's probability of reward being determined by this feature set and not upon the (whole) machine itself (Dawson et al., 2009). The utility of a feature for predicting reward can depend on the payoff of several machines because a feature may be shared by more than one bandit. ...
... Second, the common framework provided by the probabilistic discrimination task provides a medium that shows the relationship between very different literatures, such as research on multiarmed bandit algorithms and research on animal navigation. Third, as was the case for the reorientation task, it has been shown that perceptrons can be used to model aspects of the multiarmed bandit problem (Dawson et al., 2009). This provides more evidence for exploring the properties of these simple networks when they adapt to uncertain environments. ...
... The current paper extends a previous study [16] that explored the ability of modern perceptrons to match probabilities. It used four different input units to represent the presence or absence of four different cues, each of which signaled a different probability of reward. ...
... The input units represented the presence or absence of four different cues. The use of four cues emulated the structure of the perceptrons studied in our previous research [16]; by using additional input units, one could easily train a perceptron to process more than four cues. A binary representation indicated the presence or absence of cues. ...
... The presence of Cue A indicated a reward probability of 0.20, while Cue B, Cue X and Cue Y each signaled reward probabilities of 0.40, 0.60, and 0.80 respectively. These values are identical to those studied previously [16]. In this first simulation, the probabilities associated with the various cues were conditionally independent of one another. ...
Article
Full-text available
Probability matching occurs when the behavior of an agent matches the likelihood of occurrence of events in the agent’s environment. For instance, when artificial neural networks match probability, the activity in their output unit equals the past probability of reward in the presence of a stimulus. Our previous research demonstrated that simple artificial neural networks (perceptrons, which consist of a set of input units directly connected to a single output unit) learn to match probability when presented different cues in isolation. The current paper extends this research by showing that perceptrons can match probabilities when presented simultaneous cues, with each cue signaling different reward likelihoods. In our first simulation, we presented up to four different cues simultaneously; the likelihood of reward signaled by the presence of one cue was independent of the likelihood of reward signaled by other cues. Perceptrons learned to match reward probabilities by treating each cue as an independent source of information about the likelihood of reward. In a second simulation, we violated the independence between cues by making some reward probabilities depend upon cue interactions. We did so by basing reward probabilities on a logical combination (AND or XOR) of two of the four possible cues. We also varied the size of the reward associated with the logical combination. We discovered that this latter manipulation was a much better predictor of perceptron performance than was the logical structure of the interaction between cues. This indicates that when perceptrons learn to match probabilities, they do so by assuming that each signal of a reward is independent of any other; the best predictor of perceptron performance is a quantitative measure of the independence of these input signals, and not the logical structure of the problem being learned.
... In many sequential decision making problems, the setting of imperfect feedback is most generally encompassed by the framework of partial monitoring [41]; where the goal is to sequentially select one of existing M actions (e.g., expert advice and parallel running algorithms) and minimize (or maximize) some loss (or reward) whilst observing the outcomes of the actions in a limited partial manner [42]. Various interesting problems can be modeled by the partial monitoring framework, such as dynamic pricing [43], label efficient prediction [44], linear/convex optimization with full/bandit feedback [45]- [47], the dark pool problem [48] and the popular multiarmed bandit problem [29], [31], [32], [36], [39], [40], [49], [50], which is (in some sense) a limited feedback version of the traditional prediction with expert advice [11], [12], [15], [17], [51], [52]. This setting is applicable in a wide range of problems from recommender systems [23], [53]- [55], cognitive radio [56], [57] and clinical trials [58] to online advertisement [59]. ...
... With the uniform mixture weights in(49), we have E[R ST ] where W T ≤ W and D A − B.Proof. From(49) ...
Preprint
We study the problem of expert advice under partial bandit feedback setting and create a sequential minimax optimal algorithm. Our algorithm works with a more general partial monitoring setting, where, in contrast to the classical bandit feedback, the losses can be revealed in an adversarial manner. Our algorithm adopts a universal prediction perspective, whose performance is analyzed with regret against a general expert selection sequence. The regret we study is against a general competition class that covers many settings (such as the switching or contextual experts settings) and the expert selection sequences in the competition class are determined by the application at hand. Our regret bounds are second order bounds in terms of the sum of squared losses and the normalized regret of our algorithm is invariant under arbitrary affine transforms of the loss sequence. Our algorithm is truly online and does not use any preliminary information about the loss sequences.
... The latter determines the size of a same-type data batch to be analyzed in the following step. range of phenomena covered by Psychology and Neuroscience [7,6]. Hence our approach achieved this by first ensuring the task-oriented deep learning model had a near perfect performance in a simple classification task, as this symbolized a high level of confidence post-training. ...
Preprint
Full-text available
Exploration of the physical environment is an indispensable precursor to data acquisition and enables knowledge generation via analytical or direct trialing. Artificial Intelligence lacks the exploratory capabilities of even the most underdeveloped organisms, hindering its autonomy and adaptability. Supported by cognitive psychology, this works links human behavior and artificial agents to endorse self-development. In accordance with reported data, paradigms of epistemic and achievement emotion are embedded to machine-learning methodology contingent on their impact when decision making. A study is subsequently designed to mirror previous human trials, which artificial agents are made to undergo repeatedly towards convergence. Results demonstrate causality, learned by the vast majority of agents, between their internal states and exploration to match those reported for human counterparts. The ramifications of these findings are pondered for both research into human cognition and betterment of artificial intelligence.
... The different order in each of the source columns indicates that each type of perceptron assigns different levels of importance to the various degrees of an input profile. With the logistic activation function, perceptron outputs can literally be interpreted as conditional probabilities (Bishop, 1995(Bishop, , 2006Dawson & Dupuis, 2012;Dawson, Dupuis, Spetch, & Kelly, 2009;Dawson & Gupta, 2017;Dupuis & Dawson, 2013a;Hastie, Tibshirani, & Friedman, 2009). In our simulations, this is the probability that a particular input unit represents the correct musical key given the presence of a particular input pattern. ...
Article
Full-text available
Résumé Nous explorons ici la capacité d’un très simple réseau neural artificiel, un perceptron, à affirmer la clé musicale de nouveaux stimuli. Premièrement, les perceptrons sont formés pour associer des profils de clés standardisés (prélevés parmi une à trois différentes sources) avec différentes clés musicales. Une fois la formation achevée, nous avons mesuré l’exactitude avec laquelle les perceptrons affirmaient les clés musicales pour 296 nouveaux stimuli. Selon les profils clés utilisés pendant la formation, les perceptrons peuvent produire les mêmes résultats que les algorithmes de sélection de clés lors de cette tâche. Des analyses plus poussées indiquent que les perceptrons génèrent plus d’activité dans une unité qui représente une clé sélectionnée et beaucoup moins d’activité dans les unités qui représentent les clés concurrentes qui n’ont pas été sélectionnées, comparativement à un algorithme traditionnel. Finalement, nous avons examiné la structure interne des perceptrons formés et découvert qu’ils, contrairement aux algorithmes traditionnels, attribuaient de très différents poids aux différentes composantes d’un profil-clé. Les perceptrons apprennent que certaines composantes de profil sont plus importantes dans la spécification de clés musicales que d’autres. Ces poids différentiels pourraient être incorporés dans des algorithmes traditionnels qui eux-mêmes n’emploient pas de réseaux neuraux artificiels.
Preprint
We study the adversarial online learning problem and create a completely online algorithmic framework that has data dependent regret guarantees in both full expert feedback and bandit feedback settings. We study the expected performance of our algorithm against general comparators, which makes it applicable for a wide variety of problem scenarios. Our algorithm works from a universal prediction perspective and the performance measure used is the expected regret against arbitrary comparator sequences, which is the difference between our losses and a competing loss sequence. The competition class can be designed to include fixed arm selections, switching bandits, contextual bandits, periodic bandits or any other competition of interest. The sequences in the competition class are generally determined by the specific application at hand and should be designed accordingly. Our algorithm neither uses nor needs any preliminary information about the loss sequences and is completely online. Its performance bounds are data dependent, where any affine transform of the losses has no effect on the normalized regret.
Preprint
We study the adversarial multi-armed bandit problem and create a completely online algorithmic framework that is invariant under arbitrary translations and scales of the arm losses. We study the expected performance of our algorithm against a generic competition class, which makes it applicable for a wide variety of problem scenarios. Our algorithm works from a universal prediction perspective and the performance measure used is the expected regret against arbitrary arm selection sequences, which is the difference between our losses and a competing loss sequence. The competition class can be designed to include fixed arm selections, switching bandits, contextual bandits, or any other competition of interest. The sequences in the competition class are generally determined by the specific application at hand and should be designed accordingly. Our algorithm neither uses nor needs any preliminary information about the loss sequences and is completely online. Its performance bounds are the second order bounds in terms of sum of the squared losses, where any affine transform of the losses has no effect on the normalized regret.
Article
Dr. Marcia Spetch is a Canadian experimental psychologist who specializes in the study of comparative cognition. Her research over the past four decades has covered many diverse topics, but focused primarily on the comparative study of small-scale spatial cognition, navigation, decision making, and risky choice. Over the course of her career Dr. Spetch has had a profound influence on the study of these topics, and for her work she was named a Fellow of the Association for Psychological Science in 2012, and a Fellow of the Royal Society of Canada in 2017. In this review, I provide a biographical sketch of Dr. Spetch's academic career, and revisit her contributions to the study of small-scale spatial cognition in two broad areas: the use of environmental geometric cues, and how animals cope with cue conflict. The goal of this review is to highlight the contributions of Dr. Spetch, her students, and her collaborators to the field of comparative cognition and the study of small-scale spatial cognition. As such, this review stands to serve as a tribute and testament to Dr. Spetch's scientific legacy.
Article
We investigate the adversarial multiarmed bandit problem and introduce an online algorithm that asymptotically achieves the performance of the best switching bandit arm selection strategy. Our algorithms are truly online such that we do not use the game length or the number of switches of the best arm selection strategy in their constructions. Our results are guaranteed to hold in an individual sequence manner, since we have no statistical assumptions on the bandit arm losses. Our regret bounds, i.e., our performance bounds with respect to the best bandit arm selection strategy, are minimax optimal up to logarithmic terms. We achieve the minimax optimal regret with computational complexity only log-linear in the game length. Thus, our algorithms can be efficiently used in applications involving big data. Through an extensive set of experiments involving synthetic and real data, we demonstrate significant performance gains achieved by the proposed algorithm with respect to the state-of-the-art switching bandit algorithms. We also introduce a general efficiently implementable bandit arm selection framework, which can be adapted to various applications.
Data
Full-text available
This citation refers to a book. For further information, copy and paste in your browser the following link. http://lcb-online.org/
Article
Full-text available
Reinforcement learning is a fundamental process by which organisms learn to achieve goals from their interactions with the environment. Using evolutionary computation techniques we evolve (near-)optimal neuronal learning rules in a simple neural network model of reinforcement learning in bumblebees foraging for nectar. The resulting neural networks exhibit efficient reinforcement learning, allowing the bees to respond rapidly to changes in reward contingencies. The evolved synaptic plasticity dynamics give rise to varying exploration/exploitation levels and to the well-documented choice strategies of risk aversion and probability matching. Additionally, risk aversion is shown to emerge even when bees are evolved in a completely risk-less environment. In contrast to existing theories in economics and game theory, risk-averse behavior is shown to be a direct consequence of (near-)optimal reinforcement learning, without requiring additional assumptions such as the existence of a nonlinear subjective utility function for rewards. Our results are corroborated by a rigorous mathematical analysis, and their robustness in real-world situations is supported by experiments in a mobile robot. Thus we provide a biologically founded, parsimonious, and novel explanation for risk aversion and probability matching.
Book
It is hard to think of any significant aspect of our lives that is not influenced by what we have learned in the past. Of fundamental importance is our ability to learn the ways in which events are related to one another, called associative learning. This book provides a fresh look at associative learning theory and reviews extensively the advances made over the last twenty years. The Psychology of Associative Learning begins by establishing that the human associative learning system is rational in the sense that it accurately represents event relationships. David Shanks goes on to consider the informational basis of learning, in terms of the memorisation of instances, and discusses at length the application of connectionist models to human learning. The book concludes with an evaluation of the role of rule induction in associative learning. This will be essential reading for graduate students and final year undergraduates of psychology.
Article
Reinforcement learning is a fundamental process by which organisms learn to achieve goals from their interactions with the environment. Using evolutionary computation techniques we evolve (near-)optimal neuronal learning rules in a simple neural network model of reinforcement learning in bumblebees foraging for nectar. The resulting neural networks exhibit efficient reinforcement learning, allowing the bees to respond rapidly to changes in reward contingencies. The evolved synaptic plasticity dynamics give rise to varying exploration/exploitation levels and to the well-documented choice strategies of risk aversion and probability matching. Additionally, risk aversion is shown to emerge even when bees are evolved in a completely risk-less environment. In contrast to existing theories in economics and game theory, risk-averse behavior is shown to be a direct consequence of (near-)optimal reinforcement learning, without requiring additional assumptions such as the existence of a nonlinear subjective utility function for rewards. Our results are corroborated by a rigorous mathematical analysis, and their robustness in real-world situations is supported by experiments in a mobile robot. Thus we provide a biologically founded, parsimonious, and novel explanation for risk aversion and probability matching.
Book
Connectionism is a "hands on" introduction to connectionist modeling through practical exercises in different types of connectionist architectures. Explores three different types of connectionist architectures - distributed associative memory, perceptron, and multilayer perceptron provides a brief overview of each architecture, a detailed introduction on how to use a program to explore this network, and a series of practical exercises that are designed to highlight the advantages, and disadvantages, of each accompanied by a website at http://www.bcp.psych.ualberta.ca/~mike/Book3/ that includes practice exercises and software, as well as the files and blank exercise sheets required for performing the exercises designed to be used as a stand-alone volume or alongside Minds and Machines: Connectionism and Psychological Modeling (by Michael R.W. Dawson, Blackwell 2004).
Article
Pigeons on concurrent variable-ratio variable-ratio schedules usually, though not always, maximize reinforcements per response. When the ratios are equal, maximization implies no particular distribution of responses to the two alternatives. When the ratios are unequal, maximization calls for exclusive preference for the smaller ratio. Responding conformed to these requirements for maximizing, which are further shown to be consistent with the conception of reinforcement implicit in the matching law governing relative responding in concurrent interval schedules.