PreprintPDF Available

Reinforcement Learning in Contests

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We study contests as an example of winner-take-all competition with linearly ordered large strategy space. We study a model in which each player optimizes the probability of winning above some subjective threshold. The environment we consider is that of limited information where agents play the game repeatedly and know their own efforts and outcomes. Players learn through reinforcement. Predictions are derived based on the model dynamics and asymptotic analysis. The model is able to predict individual behavior regularities found in experimental data and track the behavior at aggregate level with reasonable accuracy.
Content may be subject to copyright.
Reinforcement Learning in Contests
Vikas Chaudhary
We study contests as an example of winner-take-all competition with lin-
early ordered large strategy space. We study a model in which each player
optimizes the probability of winning above some subjective threshold. The
environment we consider is that of limited information where agents play
the game repeatedly and know their own eorts and outcomes. Players
learn through reinforcement. Predictions are derived based on the model
dynamics and asymptotic analysis. The model is able to predict individual
behavior regularities found in experimental data and track the behavior at
aggregate level with reasonable accuracy.
1 Introduction
We hypothesize that in winner-take-all agents optimize their subjective threshold
probability of winning when they have limited information about their environ-
ment. Applications of winner-take-all contests range from lobbying, political
polls, sports tournaments, patent races and pharmaceutical R&D. A contempo-
rary example of such a limited information environment is remote working in
many organizations.
There is experimental evidence in a number of experiments on decision
theory indicating that subjects prioritize the probability with which they win as
opposed to being expected utility maximizers. Roy (1952) introduced the “safety
first” principle where the investor chooses a portfolio such that it minimizes the
probability of returns going below the minimum rate of return (this minimum in
winner-take-all contests can be winning the contests). In a similar vein, Edwards
(1953) designed lotteries that have the same expected value to understand what
makes subjects deviate from the expected value. His findings point to an overall
preference for the probability of winning as middle gambles are more frequently
chosen. Likewise, evidence is found in Edwards (1954) and Tversky (1969).
Slovic and Lichtenstein (1968) find that agents place greater importance on the
probability of winning and payos rather than other factors in risky choices.
Lopes (1995) surveys literature that suggests that agents give importance to
the probability of achieving aspiration levels (this aspiration in winner-take-all
contests can be winning the contests). She rightly incorporates this as one of
the criteria in her SP/A dual theory of decision-making in risky choices. Payne
(2005) finds that a substantial proportion of agents make choices to increase their
probability of positive payo(or decrease their probability of negative payo) in
value allocation tasks. Equally, Venkatraman, Payne and Huettel (2014) use the
value allocation task in multi-outcome gambles involving possibilities of both
gains and losses and find that subjects often maximize the overall probability of
winning. Zeisberger (2016), in a series of experiments, notably finds that people
pay explicit attention to the probability of losing. Subjects’ willingness to take
risks and choice behavior is considerably influenced by loss probabilities, and
performance feedback seems unable to mitigate this eect.
One of the winner-take-all contests is Tullock contest, Tullock (1980). In
the standard Tullock contest the probability of winning of a player is given by
) =
), where
is the amount of eort player
has put in and
is the
amount of eort player
has put in ( if
= 0 then P(a)=P(b) = 0.5). The
optimal eort for both the players can be calculated by the Nash equilibrium.
We are interested in competitive situations which can be represented by
Tullock contests where (almost) fixed players play the game repeatedly. Where
agents are not able to see the eort of others or even the aggregate eort, what
they know is their own previous eort and outcome in terms of winning or losing.
This can serve as a base case for real-world situations where signals are not good
enough to form beliefs on others (for example, employees working remotely
in large departments). In such limited information environments where there
is stochastic behavior by other players, low chance of positive reinforcement
(success) at lower bid amounts (as predicted by the standard model) that further
decreases as the number of players increases, a large action space that can allow
agents vary probability of winning, and suggests that agents would play games
in such environments dierently.
In literature, foresight-free stochastic learning models (Roth and Erev, 1995;
Camerer and Ho, 1999) have successfully explained many repeated games where
agents learn to make choices predicted by Nash equilibrium starting from ran-
dom choices. Sobel (2000) discusses the conditions which enable agents to learn
the equilibrium in the game. It states that when an agent starts with a sensible
model, has a stationary environment and it is costless to obtain and process
information then eventually agents learn enough about the environment to make
optimal decisions. These conditions do not hold for the environment of interest
in this chapter. The information is limited, the environment is not stationary
and it is costly to obtain information (since losing is costly). Erev and Roth
(2007) experimentally find that foresight-free learning models fail to explain
the data in individual choice tasks which suggests high sensitivity to feedback.
Payovariability slows learning in games and this eect can be considerably
magnified in multiplayer games with large strategy space. Charness and Levin
(2005) experimentally find that agents make better decisions when the direction
of Bayesian updating and reinforcement learning are the same. The situation
in our environment is that the agent should be bidding low but the chances
of winning at lower eorts are lowered when other agents bid higher, hence
positive reinforcement is less likely. This may lead to agents moving towards
higher eort levels and staying there before dropping out. Teodorescu and Erev
(2014) experimentally demonstrate that it is not the uncontrollability of the
environment rather the low frequency of reward which causes learned help-
lessness. This finding may, for a large strategy space, translate into the agent
eventually dropping out after consecutively losing even after putting in a high
eort. In such an environment, repeated experience of outcomes may lead to
agents becoming risk-averse (March, 1996) and stabilize at any eort (risk) level.
Conversely, it is also likely that some agents do not learn to be risk-averse as
evidence presented in one-armed restless bandit problems in Biele, Erev and Ert
Another type of learning theory, started by Reinhard Selten, is directional
learning (for example, Selten and Stoecker, 1986) where agents have a sequential
dependency in repeated decision tasks. Within this theory Ockenfels and Selten
(2005) formulate a quantitative model, impulse balance equilibrium, to explain
behavior in the first-priz auction. An important factor is that losing in a first
prize auction is not costly. On this basis, Grosskopf (2003) discusses the need for
models that have both reinforcement and direction learning components to be
able to better explain experimental evidence. She rightly emphasis this because
such models can be important when losing is costly as reinforcement would
also play a role. In the case of Tullock contests, the probability of any eort
level leading to winning is linearly ordered. This means that if one strategy has
worked then it is not the case that all other strategies would not have worked.
All other strategies above the chosen strategy must have worked and maybe
some strategies below it might have worked too. Similarly, if one strategy has
not worked last period then definitely all the below strategies would not have
worked and maybe some strategies just above the chosen strategy might not have
worked either. This is the argument we put forward as strategy similarity
and Vahid; 2004; Goldstone and Son, 2012). This means that the model has a
The three parameters of similarity can be: 1) How far a strategy is located in strategy space
from the last period strategy, 2) Whether a strategy is above or below the last period chosen
strategy and 3) Whether the agent has won or lost in the last period
feature of direction learning where the agent ex-post looks at what might have
been better last time and adjusts the decision in this direction. What the agent
will be learning is their subjective probability of each strategy to win the game.
The agent—depending on whether she has won or lost in the last period—can
adapt in that direction.
We introduce a decision-making approach that agents possibly follow in this
highly uncertain environment. The limited information available to agents in
the environment under consideration is their own eorts and outcomes. The
assumption is that they start with the eort which, as per their initial subjective
probabilities of winning, gives them their target probability of winning. The
players play the game with one period in mind. At each period, the agent
chooses the lowest strategy which oers them their target probability of winning
without taking into account the possible future times she may face the same game
situation again. Such isolated decision-making could be due to short-sightedness
or believing that the game is not stationary. Players may be approaching the
game thinking that to be able to make a profit one has to win. On this basis, after
choosing their strategy the agents receive the outcome in terms of winning or
losing the game. They incorporate the experience they receive in the last period
in their decision for the next period. The agent uses this outcome information
to update her subjective probabilities of winning for some of the strategies
The strategy similarity argument is used where agents ex-post update the
similar strategies similarly. This means that the agent treats strategies that are
a few steps away similar to the strategy she played last period. If she wins she
increases the subjective probabilities of winning for all the strategies above the
chosen strategies and some strategies below it. Similarly, if she loses then she
decreases her subjective assessment of the probability of winning for all the
strategies below it and a few strategies above it. The agent does not necessarily
update all the strategies, but she updates strategies for more than just the
one chosen. This similarity-based updating results in the agent switching to
immediate below strategy after consecutive wins on any strategy. Likewise, she
switches to the immediate above strategy after consecutive losses. The learning
speed is exogenous, symmetric (the following analysis assumes this, but the
model and framework of the asymptotic analysis are general enough for any pair
of asymmetric learning speeds) for winning and losing, and does not change
with time.
The model predictions are that agents decrease eort after winning, increase
eort after losing, drop out after consecutive losing in the direction of increasing
eort. Thus intermediate- and long-run aggregate eorts. can be probabilistic in
terms of over or under dissipation.
Some results found in the experimental literature on contests run in similar
information setting suggests our model’s predictions can perform reasonably
well in tracking observed behavior. This two-parameter learning model has
features of reinforcement and direction learning and can provide a more plausi-
ble explanation for some of the experimentally observed behavioral regularities
(increasing eort after losing, decreasing eort after winning, and dropping out
after consecutive losing and over-dissipation at the aggregate level) in Tullock
contests. Based on the proposed model of behavioral decision-making, the artifi-
cial agents are simulated in the repeated contest with the same settings as in the
experiment. The model can capture the frequent individual and approximate ag-
gregate behavior over the period. This behavioral model will be analyzed via the
discrete-time finite-space time-homogeneous Markov process. It will be shown
that the game reaches an absorbing state where one agent wins with minimal
eort and others drop out. In case there is a positive probability that agents after
dropping out bounce back, then the game reaches a limiting distribution.
The chapter is on the reinforcement learning model that has
features of direction learning and contributes to the literature on learning in
games. This chapter adds a new dimension to decision making in winner-take-all
repeated games
. This model is for contests, hence, contributes to the literature
in contests. This chapter adds to the economics literature on similarity (for
example, Sarin and Vahid, 2004; Grosskopf, Sarin and Watson, 2015). It adds
to the literature that attempts to explain dropout behavior in winner-take-all
competitions (for example, Muller and Schotter, 2010). In the environment
considered, it can explain the behavioral regularities such as increasing eort
after losing, decreasing eort after winning, dropping out and overall over-
dissipation in a dynamic game. The strength of the contribution lies in the fact
that the model can explain all the behavioral regularities together, which does
not seem possible using the predominant theories (as discussed in Section 6-
Experimental Evidence).
The remainder of the chapter is organized as follows. Below are definitions
of some terms. Section 2 briefly outlines the static equilibrium characterization.
Section 3 introduces the basic learning model. Section 4 provides an analysis
of the model for Tullock contests. Section 5 states model predictions. Section 6
provides simulations of the models and compares them to experimental findings.
Section 7 states the applicability of the learning model to the all-pay auction.
Section 8 concludes with further discussion of the model.
To our knowledge this is the first model to bring in the probability of winning as an
independent criterion of decision making in winner-take-all competitions. An early version of
this chapter was presented in GW4 Game Theory Workshop in May 2016 and in conference
Contests: Theory and Evidence in June 2017
Target Probability of Winning (
): This is an exogenous probability of winning
aims to achieve. Agent tradeos cost with the probability of winning is
considered to decide on how much eort to expend to have a sucient chance
of winning the game with a positive profit. This is considered as the intrinsic
behavior of the agent for this game.
Over-dissipation: A game is considered to be over-dissipated if the aggregate
eort in the game is higher than that of risk-neutral symmetric Nash equilibrium
of the standard model.
Strategy Set: It is a set containing all possible eort levels (pure strategies/ac-
tions) agent can choose from. Formally a strategy set (
) with
strategies is
{e1,.....ej..., en}
. An eort level
chosen by agent
is denoted as
. The strate-
gies are linearly ordered i.e ej> ekj > k.
Subjective Probability of Winning (SPW): Agents for every strategy (eort level)
have subjective probability of winning. Formally,
) is the agent
’s subjective
probability of winning by playing the strategy
at period
. This is not the
actual probability of winning, and not any function of eort of the agent and
eorts of other agents. 3
Dropout: An agent is considered to have dropped out if she played for a few
consecutive periods and then does not enter the game at least for the next few
periods. Formally, pi
j< αiei
Learning Speed: It is the rate at which agents update their SPWs every pe-
riod. Formally, if learning speed of agent
is (0
1) then its
n0,λi,2λi,3λi, ..,1oj.
Bounce Back: An agent is considered to have bounced back if she puts in positive
eort after dropping out. Formally, bounce back is a process where the agent
after dropping out enters the game again with a new set of arbitrary SPWs and a
small probability of bouncing back to any arbitrary SPW is denoted as
in any
period after dropout. If
 >
0 agent restarts decision making from SPWs, she
bounces back. Once agents have dropped out, they do not bounce back unless
A justification for such an approach is that a) in a severely limited information environment
(of knowing only their own eort and outcome) b) agents convert a strategic game into a decision
problem and c) utilize linearly ordered eort space to form such beliefs.
there is an exogenous perturbation.
2 Static Equilibrium Characterization
It is supposed that agents face the one shot game and have the finite set of
eort levels (e),
as lowest eort level and
as highest ef-
fort level. This eort set is denoted as
{emin, e2, ..ej.., emax }
ej< ekj < k
For the above eort set
let us define a binary relation
ej< ek
with dierent
possible cases.
Case C1: Both ejand ekare s.t. pi
jand pi
kαithen ej%ek.
Case C2: Both ejand ekare s.t. pi
jand pi
k< αithen ej%ek.
Case C3: When ejand ekare s.t. pi
j< αiand pi
kαithen ek%ej.
The above binary relation is complete now it needs to be shown that it is transi-
tive as well.
Case T1: Let us take
are s.t.
from Case C1 above. Similarly, from Case C1 above it is known
transitivity holds.
Case T2: Let us take
are s.t.
l< αi
from Case C2 above. Similarly, from Case C2 above it is known
transitivity holds.
Case T3: Let us take
are s.t.
k< αi
from Case C2 and
from Case C3 above. Similarly, from Case C3 above it
is known el%ejhence transitivity is not violated.
From Case C2 and T2 above it can be seen that for any eort set
such that
< αiejin e={emin, e2, ..ej.., emax},emin is the choice.
The above binary relation is found to be complete and transitive. Therefore it is
a preference relation. For the setup (eort levels are distinct and ordered), inde-
pendence of irrelevant alternative axiom is satisfied and indierence axiom of
choice correspondence is not applicable. From this, it can be said that the prefer-
ence relation has a utility representation and rationalizes the below decision rule.
Decision Rule: The agent i’s problem is:
js.t pi
otherwise, if @js.t.pi
jαithen ei
jis the eort choice of agent i,
is the subjective probability with which agent
thinks to win by putting in
eort ei
Below, a structure is laid out for a pure strategy static equilibrium for dierent
cases of
i=1 αi
. Agents know the number of players,
, in the game.
This decision rule can be understood as the best response function where these
types of agents have converted a game into a decision problem.
Case 1: If i,αi1
mand Pm
i=1 αi1 then ei
j=emin is a stable point.
It holds in case
= 0. This is because in Tullock contests if
= 0 then
probability of winning for each agent is
. This means that agents will achieve
their minimum probability at the minimum possible eort level and hence
do not have an incentive to unilaterally deviate. Argument for the case when
emin >
0 is trivial as
, if
then probability of winning for each agent is
mand there is no incentive for an agent to increase their costly eort level.
Case 2: If Pm
i=1 αi1 then at least one stable point exists.
If Pm
i=1 αi<1 and is.t αi>1
mthen at least one stable exists.
Lets us take an example. Say in the case of two agents having
’s as 0.1 and
0.6, respectively. Agent 1 chooses the minimum non-zero eort level, which is 1.
This means Agent 2 chooses the eort level of 2. So, eort tuple (1,2) is a stable
point as it gives Agent 1 33.3% winning probability and Agent 2 as 66.6%. Note,
Agent 1 cannot go below this eort level. This can be generalized for any set of
Let’s try another example. Say in case of 2 agents having
’s as 0.4 and 0.55,
respectively. In this case, the stable point is (2,3). This can be generalized for
Let’s try another example. Say in case of 3 agents having
’s as 0.05, 0.30 and
0.60, respectively. Agent 1 chooses the minimum non-zero eort level, which is
1. Then Agent 2 chooses 6 and Agent 3 as 11. Hence, (1,6,11) is a stable point.
This can be generalized.
If Pm
i=1 αi= 1 then at least one stable point may or may not exist.
’s expressed as
(= (
R1,...Ri..., Rm
)) s.t highest common factor for
R1,...Ri..., Rm
is 1 and
. For simplicity the assumption made is that
when expressed as percentages are integers then,
r.R are all NE rZ+s.t. emin r.Riemax iand Pm
i=1 Riemax 4
For example, let’s say in case of 2 agents having
’s as 0.4 and 0.6, respectively.
The corresponding
= (2
3). Then
(2,3), (4,6), .. (40,60)
is a set of stable points
for emax = 100
Case 3: If Pm
i=1 αi>1 then no stable point exists.
For example, let’s say in the case of 2 agents having
’s like 0 and 101, respec-
tively. Then the second agent can never attain her TPW because even if agent 1
chooses the lowest eort as 0, the maximum probability with which agent 2 can
win is 100. Similarly, if agents have
’s as 2 and 9 then (0,0) as an equilibrium
eort is ruled out. Let’s say agent 1 chosen eort equals 1 then the minimum
eort agent 2 will choose is 9 which does not satisfy agent 1 TPW. So, agent
1 will (virtually) respond by choosing 2.25 (for the purpose of argument let’s
say eorts are allowed to be real numbers). Likewise, the virtual responses can
be calculated and it can be seen that there is no eort tuple that satisfies the
3 Basic Learning Model
A decision-maker
chooses from a finite set of eort levels (pure strategies/
{e1,e2, ..ej..,en}
. The eort levels are linearly ordered. The cost of
ekj > k
. Every eort level has a probability of winning given by
i=1 ei
, where
The latter condition may not be true as such, for example, in the below example let’s say
R1=49 and R2=51 and emax <100. Here, no stable point will exist.
is the number of players in the game
. The game is a winner-take-all. If an
agent puts in the eort of
then it costs her
amount regardless of whether she
wins or loses. In the case that she wins, she receives the full prize money which
equals the highest possible eort level
. The assumption is that agents face the
same game repeatedly with one repetition of the decision problem per unit of
time (period). Agents know the number of players in the game. We formulate
a decision-making approach in these games; agents are not identical, and they
have their own characteristics. Agents have subjective winning probability (SPW)
for each eort level denoted by, for agent i,Pi={pi
For every agent, based on its learning speed (0
1) its
n0,λi,2λi, ..,1oj
They do not explicitly
form beliefs about the eort levels of other agents. These
are their perception rather than any estimation or actual probability of winning.
The initial SPWs are denoted by Pi(0) which may have been formed by hearsay,
strategy labels, or other factors. The assumption is that
(0) is given. Since the
eort levels are linearly ordered so are the SPWs, that is pi
kj > k i.
Agents are not driven by expected payos rather their exogenous target
probability of winning (TPW)
. The model does not say how agents form their
target probability of winning. This is their approach to choose the eort level to
be played—a strategy that gives them a “good enough” chance of winning. Their
TPW is denoted as, for agent i,αi.
The agent chooses the smallest eort level which can ensure her TPW. On
the one hand, they know their own eort level last period and the outcome. On
the other hand, they neither know the eort of the other agents in the game nor
the outcome of other agents. They are playing the game with one period in mind
and use their experience in the following period to update their SPWs. Let
are positive integers. If they win they increase their SPWs for all the eort
levels above the chosen eort level and
eort levels below the chosen. If they
lose, they decrease their SPWs for all the eort levels below the chosen eort
level and
eort levels above the chosen. The eort level zero is considered as
dropout which agents (passively) select when no SPW satisfies her TPW.
The basic model can be divided into two qualitatively dierent models. One
= 0; this is termed as
unperturbed model
where agents do not bounce
back. The other where
 >
0; this is termed as
perturbed model
where agents
This is standard Tullock contest other than that the agent is not allowed to choose zero eort
level as an active strategy. The probability of winning will change for all-pay auctions but the
basic learning model does not rely on this probability.
It cannot be ruled out that agents implicitly form beliefs about eort levels of other agents.
SPWs can be an outcome of their implicit beliefs about other agents’ eorts. The model does not
say how these SPWs are formed. What we are saying is that they do not approach the game as a
fictitious play where they best respond based on the beliefs about other agents.
7In our model only TPW and cost matters
do bounce back with some positive probability to an arbitrarily new set of
SPWs.Both the models will be analyzed for their asymptotic results. The model
can be stated as below.
Initial Values:
If pi
j(t)< αij∈ {0,1,2, ..,n}then;
j(t+ 1) = pi
j(t)n0,λi,2λi, ..,1ojfor,
Unperturbed Model :
P r(at least one l∈ {0,1,2,.., n}s.t. αi
l(t+ 1) αi)=0
Perturbed Model :
P r(at least one l∈ {0,1,2,.., n}s.t. αi
l(t+ 1) αi) =
where,0 <  1
The initial SPWs for each strategy could be the outcome of an agent’s perception
or previous experiences in similar situations. The model does not say how these
initial SPWs are formed or what factors impact these including the environment
of the game. Given the strategies are ordered linearly, so will be the SPWs:
k(t)if j k(2)
Decision Rule: The agent i’s problem every period tis :
js.t pi
Updating Rule:
If agent iwins after playing strategy k:
j(t+ 1) =
min pi
j(t) + λi,1for jkq
j(t),otherwise (4)
If agent iloses after playing strategy k:
j(t+ 1) =
max pi
j(t)λi,0for jk+r
j(t),otherwise (5)
Boundary Conditions:
It is assumed that the agents cannot move below the smallest positive eort level.
This means that when an agent has reached the smallest positive eort level, she
just updates her SPWs but does not make a decision to go further downwards
based on any new SPWs. One can also say that zero eort level is not part of
the strategy space; rather, it is considered as the condition of dropout. Similarly,
once the agent has dropped out she does not move to any positive eort level
unless there is an exogenous perturbation.
) is updated every period based on the outcome, that is a win or lose. The
SPW update on strategies starts from
below and
above strategies w.r.t. the
last period strategy, in the case of winning and losing respectively. The updating
rule above states how an agent transitions from one strategy to another based on
the outcome, given all strategies are not exhausted, which means that the SPW
for at least the highest strategy is above the threshold. When all strategies are
exhausted, the agent has a small
probability of bouncing back to any arbitrary
SPW and restarts decision making as stated below.
4 Asymptotic Analysis -Tullock Contests
We are interested in understanding the asymptotic state of the game. Given
are dependent on its immediate previous value, it serves as state space in a
discrete-time, finite-state Markov chain.
State-space can be defined as SPW
1,.., p1
1,.., pN
. If the learning speed of the agent is high, then there
will be fewer elements in the SPW set of that agent. The transition probability
will depend on the SPW profile and target probabilities of all agents, which will
not change with time. This means that the Markov process is time-homogeneous.
A simplifying assumption made is that q=r=1.
Note, this is a non-standard Markov model where TPW cannot be modeled as part of the
transition matrix. The complexity of the setup makes deriving analytical results dicult due to
intractability. Indeed no literature was found providing a method of analysis for such a process.
Absorbing State: In this model, the term absorbing state is used in its usual mean-
ing with a specific
description that in this state one agent wins by putting in
minimum eort and other agents have dropped out and hence no agent change
their SPWs thereafter.
Winning Set: The set
which has an agent
as the final winner with all possible
dropout SPWs (
= 1
j< αkk,i
for other agents who have
dropped out is called ‘winning set i’.
Unperturbed Model:
Lemma 1
:An initial state where all agents have their SPW above their TPW for at
least one of the eort levels, there is a positive probability that process can be absorbed
in any of the winning sets.
: If all agents have at least one SPW above the threshold, then all agents
will put in a positive eort, which gives a positive probability for every agent
to win. There is a positive probability that agent
wins every period, which
means that there is a positive probability that the process will be absorbed in
the winning set Ai.
Lemma 2
:In the unperturbed model, from every state, there is a possible path to
reach at least one of the winning sets.
: From Lemma 1 it is known that this is true for any state where all the
agents have SPW for at least one strategy above the threshold. A similar argu-
ment could be made for any general state. For any given state, consider those
agents who have SPW for at least one strategy above the threshold
. The way
the model has been defined is that not all the agents will drop out. All these
agents will put in a positive eort and will have a positive probability of winning
in this period. There is a positive probability that any of these agents wins
every following period, which means that there is a positive probability that the
process will be absorbed in the corresponding winning set.
Proposition 1:If agents do not bounce back (unperturbed model) then the process
The specific description is only to bring contextual clarity, other than that the term holds its
usual meaning
A general element in set
can be written as (1
< αk, ..., < αk
αl,..., < αl(ntimes)..(f or tot al m 1player s))
A special initial condition can arise where all the SPWs for all the agents are below their
TPWs. Here it can be assumed that the game is restarted.
will be absorbed in one of the winning sets.
: In Lemma 2 it is shown that from any state it is possible to reach at least
one of the absorbing states. There are only a finite number of states; hence, the
process will be absorbed.
With the dropouts not bouncing back, the transition
matrix could be identified as absorbing Markov process with at least
of players) absorbing states with at least one state associated with the dropout
of each agent. Using the theorem for absorbing the Markov process, it could
be stated that the process will be absorbed by one of the states in a finite time.
This means that the aggregate eort in the game will decrease, below the Nash
equilibrium of the standard model, after some time. Depending on the relative
parameters of the agents in the group, there would be dierent trajectories of
aggregate eort with dierent average eort (across periods) for each trajectory.
Proposition 213
:Agent’s dropout probability increases with an increase in her relative
α(TPW) and relative λ(learning speed).
: The above proposition is supported by the below simulation results.
1000 2-players games with dierent parametric values (generated randomly) are
simulated to study the significance of relative values of parameters in an agent’s
probability of dropout. The parameters are classified into four categories as
shown in Table 2.1 below. The simulation results are shown in Table 2.2 in which
it is clear that as the agent’s relative TPW or relative learning speed increases
the dropout probability of the agent increases. The results hold for any value of
the sum of TPW of the two agents.
Table 2.1: Parameter classification for a 2-players game.
12Refer to Appendix for mathematical theorem used.
Holt and Roth (2004) conclude with the following: “Looking ahead if game theory’s next 50
years are to be as productive, the challenges facing game theorists include learning to incorporate
more varied and realistic models of individual behavior into the study of strategic behavior and
learning to better use analytical, experimental, and computational tools in concert to deal with
complex strategic environments.” Given the non-standard modeling of the learning model in
Tullock contests as the Markov process, it calls for use of computational methods to achieve
more insightful results. Our learning model and simulation results are in this spirit.
Table 2.2: Dropout frequency based on parameter classification for a 2-players
Proposition 3
:As the aggregate target probability of winning increases in the game,
agents drop out sooner.
: The above proposition is supported by the simulation results. Tables
2.3 and 2.4 show the number of agents who drop out after a fixed period for
various values of aggregate target probability of winning. It is observed that as
the aggregate TPW increases the number of agents dropping out, within a fixed
time, increases.
Table 2.3: Number of dropouts after 1000 periods in a 3 players game for various
values of aggregate TPW.
Table 2.4: Number of dropouts after 1000 periods in a 3 players game for various
values of aggregate TPW.
Perturbed Model:
In the unperturbed model, we assumed that agents do not bounce back. However,
it is possible and at times seen in real life that agents who drop out
back. This possibility is incorporated into the perturbed model analyzed in the
next section, although this model does not explain why the agent bounces back.
One possible reason could be that agents experiment. Now, the attempt is to un-
derstand what will happen to the aggregate eort levels if all the agents bounce
back with some positive probability. No pattern is assumed in terms of where
agents bounce back, rather the bounce back is allowed with an unrestricted set
of new SPWs for the dropout agents. Although the bounced back agent’s SPW
is unrestricted (only defined by its learning speed), the states to which it can
bounce back in the state space are restricted. This is because the agents who
have not dropped out will not change their SPWs. These states will depend on
where the other agents are present in the period when the agent bounces back.
If a player is allowed to bounce back with an arbitrary set of new SPWs, then
agents may not stay in the same set of states to which the process was confined
when no bounce-back phenomenon occurred.
: It is assumed that irrespective of a dropout SPW of the agent, all
agents have a positive probability of bouncing back to any arbitrary set of SPWs.
This means that there is a positive probability of reaching any of the absorbing
states within any winning sets.
Proposition 4
:If players after dropping out bounce back with positive probability
pertur bed model
) to an arbitrary set of new SPWs, then a limiting distribution will
be reached.
: The state space is such that at least one of the winning sets is reachable
from every state. The model is such that the process does not stay at the same
state next period. Proposition 1states that the process will be absorbed if there
is no bounce back. Thus, irrespective of where the process starts, it reaches one
of the absorbing states.
Assume that each agent drops out at a unique SPW. This means that if the
agent has dropped out, she will have a unique SPW which will not change unless
she bounces back. Nevertheless, this is only a simplifying assumption. Later it is
argued what happens if dropout SPWs are not unique.
be the set of states to which other agents except
can bounce after
dropping out. This includes all the states where all agents (except
) drop out.
This also includes the state (
) where the agent
is the final winner and is
putting in minimum eort to win the game.
The state-space can be defined as
, where
is for the agent
and T is the union of states which are not reached by any agent after bouncing
back. For each
there are (
1)! sequence in which other agents can drop out
to make
. It is not known for sure that in the state when the last agent
drops out whether the agent
has reached her highest SPW or not. If not, then
agent icontinues to decrease their eort and increase their SPWs to reach Ai.
Now let’s take the extreme case where each sequence of dropout for any final
consists of a unique set of states, transitioning from one to another and
reach Ai. Our objective is to show that there is one recurrent class.
From the state,
, all the agents except
have some positive probability
to bounce to any set of SPWs above their threshold at which they dropped
out. States to which agents (except
) can bounce will be part of some chain
corresponding to the dropout sequence of player
as the final winner. Hence, all
the states following the bounce-back state will become a part of a communication
The SPW profile (with each player above threshold SPWs) to which agents
have bounced back has a positive probability for any player to be the final
winner (from Proposition 1). Hence, at least one sequence of dropouts for each
player being the final winner will be the part of this communication class. Since
without bounce back all the states in the state space lead to some sequence
of dropout (absorbing process), there cannot be any other recurrent class not
communicating with this communication class. Hence there will be a single
recurrent class. The model is such that the agent can transition to two states with
unequal probabilities, in general, so an irreducible chain is aperiodic. Hence, an
irreducible aperiodic chain will reach a limiting distribution.
In Proposition 1, it is assumed that there is only one absorbing state in each
winning set. Here it is argued that even if there are multiple absorbing states in
all the winning sets, then also the limiting distribution will be reached. This is
because each absorbing state within the winning set has a positive probability to
bounce back to any arbitrary set of SPWs. The states which could be reached
from one absorbing state in a winning set are possible to be reached by any other
absorbing state in that winning set. This means that there will be a positive
probability of being absorbed in any of the absorbing states within a winning
5 Model Predictions
Agents will decrease eort on consecutive winning while they increase the eort
on consecutive losing. This is because agents draw directional feedback from
outcomes and update their SPWs accordingly. The prediction does not depend on
the number of strategies for which they update their SPWs until they update one
strategy below the winning strategy and one strategy above the losing strategy.
Following overbidding, some players will drop out (they may bounce back).
This can be understood as learned helplessness. When agents are not able to win
even after trying harder, they quit. In terms of the model, after agents lose, they
decrease their SPWs for some strategies higher than the lost strategy and with
consecutive losses moves to higher strategy. This continues as long as there is
some strategy for which SPW is greater than or equal to TPW.
The aggregate eort will be higher than the risk-neutral symmetric Nash
equilibrium prediction at the intermediate run and will then start falling if
they do not bounce back. In a winner-take-all, only one agent wins whilst the
remaining agents lose and increase their eort. This dynamic continues until
agents start dropping out and then aggregate eort starts to decrease. From
the proof of Proposition 4, it can be said that the average aggregate eort is
probabilistic if agents bounce back. In this case, the aggregate eort could be
higher or lower compared to the Nash equilibrium of the standard model.
The objective of this learning model is to be able to explain how agents may
learn if they are driven by the probability of winning. One may like to ask
whether this model with specific values of the parameter can lead agents to reach
close to the equilibrium predicted by the standard model. It can be a valuable
exercise to analyze this model for the case when learning speeds are asymmetric
for the case of positive and negative reinforcement. We believe that when agents
, where
is the learning speed in case of positive reinforcement
is the learning speed in case of negative reinforcement, then the agents
will reach around some eort levels and stay close to it for a long time. If they
start with the initial eort levels predicted by the standard model and their
minimum probability as
then they will stay close to equilibrium predicted by
the standard model for a reasonable time.
6 Experimental Evidence
Now discussion will follow on what some known models considered relevant in
the related literature predict and whether these can explain the above behavioral
regularities together. Next, simulations are run for this learning model to see if
it can track the experimental data.
The primary data considered is from one of the experiments in Falluchi,
Renner and Sefton (2013) (FRS, henceforth) where the Tullock contest is played
by fixed players with their own feedback. There are ten games played among
three agents for sixty periods each. Prize money is 1000
0.15 ) and
agents are given the endowment of 1000
at the start of each period. Agents
know their own eorts and outcomes each period, but do not know the eorts
and outcomes of other players. The contest success function, prize money,
endowment, number of periods in the game, and number of players are common
Again mentioning briefly the behavioral regularities observed in this experi-
mental data. Over-dissipation is a consistent finding across the games. Following
over-dissipation, some players drop out over the course of the game. In terms
of individual behavior, it is frequently observed that agents decrease eort on
consecutive winning while they increase the eort on consecutive losing (Table
2.6). There are agents whose participation is limited throughout the experiment.
Conversely, some agents persist with remarkably high eorts despite losing so
frequently that they make a net loss. Many agents start with a considerably high
level of eort. It is observed in these games that on average, agents start high
compared to risk-neutral symmetric Nash equilibrium prediction and further
increase their eorts in the next few periods before aggregate eorts start decreas-
ing. Similar data from one of the experiments in Mago, Samak and Sheremeta
(2016) (MSS, henceforth) is also studied. In this experiment, 15 Tullock contests
are played, each having four fixed players for 20 periods with their own feedback
only. The endowment and prize money is 80 francs (15 francs = US
1) and
the experimental setting is common knowledge. The agents’ behavior in this
experiment is similar to FRS. The aggregate data in MSS are less noisy while the
individual data are noisier compared to FRS.
Table 2.6: Impact of winning and losing on the next period eort decision of the
agent in FRS.
Standard Model
The behavioral regularities observed in the experimental data cannot be ex-
plained by the standard model. If agents are playing as per contest reaction
function based on forming beliefs on other agents’ eort, then the direction of
dropout (if at all) should be preceded by a decrease in eort, not an increase in
eort as more often observed in the data. The agent should not put in any eort
more than the prize value divided by the number of agents, for a risk-neutral
case (Figure 2.6).
Figure 2.7: The contest reaction function for any beliefs of other agents’ aggregate
eort, assuming risk neutrality(r=0). Reaction function is given by
yrwhere y=eiand x=ei
Risk Preferences
Millner and Pratt (1991) find that risk preference can explain the dierences in
the eort levels while Potters, Vries, and Winden (1998) find the opposite. How-
ever, broadly speaking, being more risk-averse means less eort and risk-seeking
means higher eort. In Figure 2.8 below, the general equation for symmetric
NE with risk as a parameter and its graph is provided for the case when agents
have a CRRA utility function. One can argue that agents are playing symmetric
Nash equilibrium with dierent risk levels. While it can explain high eort by
high risk-seeking preference, it cannot explain the dropout preceded by high
eorts. One may further argue that agents are risk-seeking and when they learn
that they cannot win, they drop out. A simple objection to this explanation is
the observation that agents who do not drop out fail to stabilize at any eort
level. This raises the question as to whether it is only agents who drop out who
stabilize. Agents frequently change the direction of their eort level even after
many periods in the game, which makes a risk-preference-based explanation
on its own unconvincing. Another treatment from FRS is that of full feedback,
where it is observed that very few agents drop out. This suggests that the absence
of feedback makes agents approach the game dierently. Cornes and Hartley
(2003) explain over-dissipation using heterogeneity in risk aversion of agents
maximizing earnings. In Chapter One we show that TPWs and EUMs have
the same risk distribution and this learning model explains other behavioral
regularities in addition to over-dissipation.
Figure 2.8: A general expression and graph showing eort for symmetric Nash
equilibrium with dierent values of risk for CRRA utility function. On X-axis is
the r(risk preference) and Y-axis is the optimal eort.
Joy-of-Winning and Relative PayoMaximization
In the literature, several behavioral explanations have been proposed for over-
dissipation in the Tullock contest. Sheremeta (2010) finds that in addition to
monetary incentives subjects derive a non-monetary utility from winning which
leads to higher eort levels. This can partially explain the higher eort levels,
but cannot explain the dynamic behavior of agents over the course of the multi-
ple period games. MSS state the residual reason for overbidding is, in addition
to the utility from winning, that the subjects care about their relative payos. It
seems less likely for the experiment in consideration that anonymous players
in the lab can track the relative payos when no feedback is present on players’
cumulative winning in the game.
Mistakes and Judgmental Biases
Another set of explanations are based on the argument that subjects are prone to
mistakes, judgmental biases such as non-linear probability weighting (Baharad
and Nitzam, 2008), and the hot hand fallacy. These mistakes add noise to the
Nash equilibrium solution and thus may cause overbidding in contests. While,
in general, this can be true that for the first few periods of the game, data under
consideration show that agents put in higher eort even after many periods in
the game and respond to winning and losing in a predictable way. This suggests
that high bids are not mere mistakes; rather, a deliberate choice. In a game with
a large number of agents, winning should be a rare event which means that with
consecutive losing (experience) an agent will be under-weighting the probability
of winning which should make agents decrease the eort and not increase it.
Regret Theory and Impulse Balance Equilibrium
In first-price auction literature, regret theory (Engelbrecht-Wiggans and Katok,
2007; Filiz-Ozbay and Ozbay, 2007) and impulse balance equilibrium (Ockenfels
and Selten 2005) can explain the directional changes in case of winning and
losing every period. The environment under consideration is fundamentally dif-
ferent in two aspects: first, losing is not costly in the first-price auction; secondly,
agents can deduce whether the last period winning bid was higher or lower than
their bid. The aggregate bidding behavior is dierent from the all-pay auction in
which bimodal bidding is witnessed.
A model where agents have exogenous aspiration levels as the reference point
can also explain this increase and decrease in the eort after losing and winning,
respectively. One can see that mathematically ’target probability of winning’ and
aspiration look equivalent. In a repeated game, let’s say an agent has a ’target
probability of winning’ equal 0.5 and the prize money is 1000 then aspiration
(reference point) will be 500. Another (mis)interpretation of target probability in
the repeated game can be the number of times agents want to win; in the above
example, it will be 50%. As can be seen from the data, agents appear to play the
game as a one-shot incorporating their experience. In such a case, aspiration in a
one-shot game with a fixed prize is not appealing. The fundamental dierence
is the psychological interpretation of these as well. The higher value of the
parameter ’target probability of winning’ can be psychologically classified as
the need for security while the higher aspiration can be classified as a desire for
If the agents are driven by an endogenous aspiration as a reference point (Borgers
and Sarin, 2000) they would increase the eort on winning and decrease eort
on losing. The two theories (learning model based on a target probability of
winning vs endogenous aspiration) predict the opposite behavior. If it can be
understood that in which environment one will be true, it can help understand
the agents’ decision-making approach and can prescribe the ecient prize allo-
cation scheme for dierent environments.
In an all-pay auction with fixed matching treatment, Lugovskyy, Puzzello and
Tucker (2010) argued that agents may learn how to collude by reducing their
eort. This may seem an explanation for the case when agents under-dissipate
but in the limited feedback environment collusion seems less likely.
If McKelvey and Palfrey’s (1995) model is applied to the data under consid-
eration, it will not be able to explain since over time many agents move further
away from NE, and dropout behavior cannot be explained convincingly. Lima,
Matros and Turocy (2014) combine cognitive hierarchy and quantal response
equilibrium to model responses of agents that can explain over-dissipation in
contests experiment. In FRS data studied in this chapter, the responses are
though stochastic but skewed in one predictable direction based on whether it is
win or lose and moves substantially across the eort levels including dropout.
Evolutionary Stable Strategy
Hehenkamp, Leininger, Possajennikov (2004) explain over-dissipation by ar-
guing that agents play an evolutionarily stable strategy. While a high target
probability of winning may give agents an evolutionary advantage if winning
every game (not net payo) is the criterion of survival, a static model cannot
explain the dynamic behavior observed in this data.
6.1 Simulations
The objective of the simulation exercise is to see if it can track the aggregate data
and understand the possible trajectories for various values of the parameters
(Fig 2.15-2.17 in Appendix). Given the probabilistic nature of the game and
the diculty of estimating the parameters of an individual player, the period-
wise comparison of the actual behavior of an agent in the experimental data
with the artificial agent in the simulation cannot be made. What can be said
is the behavioral regularities observed and aggregate dynamics found in the
experimental data are also present in the simulation data. The artificial agent is
driven by the model while a real agent would be subject to sharp changes and
some random behavior. This calls for qualitative behavior match for individual
behavior rather than strict quantitative comparison. A simplifying assumption
made is that q=r=1.
Based on this descriptive model of decision-making, simulations are run
for the three-player game played for 60 periods, which is then compared with
experimental data. It is assumed that agents do not bounce back. A total of
1000 games are simulated for 60 periods,
assigned for each agent randomly
from [0.2, 0.4] for each
. Here all agents
have the same learning speed. Further,simulations are run for the cases where
agents are assigned the learning speed randomly from the uniform distribution
of [0.05, 0.1], [0.1, 0.15] and [0.05, 0.15]. The experimental data is of 10 games
for 60 periods with 3 players each. The objective is to see whether the model-
based simulation can track the experimental data, assuming experimental data
is representative of the large sample behavior despite its size. Nevertheless, it
needs to be emphasized that as the size of the data increases, it is expected to see
smoother aggregate behavior. The code used to simulate is given in the appendix;
below, the results are stated directly. The following are the simulation results for
aggregate eort compared period-wise with experimental data followed by com-
parison in blocks of periods. Figures 2.9 to 2.12 refer to the FRS data described
above. Figures 2.13 and 2.14 refer to MSS data described in the literature review.
Figure 2.9: Average eort across groups in FRS and various simulations based
on the proposed behavioral model.
Figure 2.10: The above graph is drawn for data in blocks of ten periods each.
Table 2.11: The table shows mean deviation (MD), mean absolute deviation
(MAD) and root mean square error (RMSE) between experimental data and
simulation data for various learning speeds for period-wise data.
Table 2.12: The table shows mean deviation (MD), mean absolute deviation
(MAD) and root mean square error (RMSE) between experimental data and
simulation data for various learning speeds for data in blocks of 10 period each.
Figure 2.13: The graph shows, period wise, the average eort across groups in
MSS and simulations based on the proposed behavioral model.
Figure 2.14: The above graph is drawn for data in blocks of five periods each.
Table 2.11 shows mean deviation, mean absolute deviation, and root means
square error for the aggregate eort for dierent learning speeds. These are for
in the interval [0.2, 0.4]. It can be seen that the learning speed attributed to
agents from the random uniform distribution of [10, 15], and all agents with
12.5 learning speed tracks the data better. It is expected that as the experimental
data size would increase, the aggregate behavior observed for all the games
(black dotted line above) would become smoother. Looking at the simulations
for various parameters does suggest that the gradual fall in an aggregate eort
at later periods is a common behavior. In initial periods depending on the
parameters, the aggregate eort may rise before starting to decline. These
simulations show that the decision-making model proposed is general enough
to capture the various trajectories of aggregate eort for dierent values of
It is seen in the simulations that for low and medium values of the parameters
(refer to simulations for the six-player game provided in Appendix) there is
an increase in the aggregate eort before it starts falling. The decline in the
aggregate eort is primarily due to the dropouts and in later periods due to
the consistent winning of the single-player remaining. This is because agents
start with low- or middle-level strategies, and they end up competing for prizes
before becoming exhausted with strategies. What is happening is that agents are
trying to search for eort levels where they could have a good chance to win,
which is confirmed only when they win. In cases where parameters are high, the
aggregate eort is predicted to start high and fall steeply. This is because the
agents would start with higher strategies and end up dropping out sooner as
their strategies would be exhausted faster.
The dierence in the game dynamics when the number of agents increases
can be the marginal dierence in the probability of winning when one agent
bids very high and others are bidding low or average. If the number of players is
low, then there is a significant increase in the probability of a win by increasing
eort. If the number of players is high, the increase in the probability of a
win is not significant. This is important in such a scenario in which agents are
trying to locate the eort level whereby they can win by a good probability and
they are revising decisions every period. So, if increasing the eort does not
translate into winning (chances of which would be higher in the case of a large
number of agents) then it is likely that the agent drops out soon. The model
) which fit the experimental data closely are higher for the
four-player game compared to the three-player game.
With the maximum eort being 100, the deviation of aggregate eort by three
agents averaged over 60 periods is not negligible. The deviations substantially
decrease when the comparison is done in blocks of ten periods. From the charts,
it can be seen that the simulations approximate the aggregate experimental data
better when compared period-wise for the MSS compared to FRS. This may be
due to the increased size of the data in MSS. The deviations further decrease to
this data set when the comparison is done in blocks of five periods. This suggests
that the model is approximately fit for aggregate behavior and thus explains the
individual behavioral regularities.
7 Model Applicability to All Pay Auction
An all-pay auction is another example of a winner-take-all competition that
has a linearly ordered strategy set. The basic learning model applies to all-pay
auctions that have the same information environment. The dierence lies in the
transition probabilities in determining the winner post-selection of eort levels
by the players in the game each period.
In the case of Tullock contests without perturbation, it is found that any agent
has a positive probability of eventually dropping out when both exert positive
eorts. This was due to the probabilistic nature of prize allocation where the
probability of winning was proportional to the ratio of eort to the total eort.
In the case of an all-pay auction, the prize allocation is deterministic. For any
set of eorts in a game, the player with the maximum eort receives the prize
with certainty unless there is a tie. The nature of the learning rule is such that
the agent after losing increases eort, while the agent after winning decreases
eort. This allows for the possibility that agents may keep cycling in a few states
without any perturbation and may not eventually drop out. One needs to find
parametric conditions that lead to such cyclic behavior.
At the preliminary stage, we try to see—through simulations—if such cyclic
behavior is found or not. If so, then in some two-player games none of the two
agents will drop out even after a very long time. Table 2.5 shows the number
of agents who drop out after 1000, 10000, and 100000 periods. It can be seen
that there is no substantial increase in the number of agents dropping out even
after the number of periods increased by 100 times, around 25% of games have
no agent dropping out. This proportion of games is indicative that agents cycle
between some states.
Table 2.5: Number of dropouts after 1000, 10000 and 100000 periods in a
2-players all pay auction game.
8 Discussion
A big picture summary of the model is that agents convert a game to a decision
problem and learn through reinforcement. With reinforcement learning their
eective choice set changes and hence their eort choice based on decision rule.
We have made two behavioral assumptions in this model. One is that the
agents use the approach of the target probability of winning, not payoto make
their decisions in contests. The second is on how agents update their subjective
winning probabilities upon winning and losing. In this model, strategy similarity
gives the reinforcement learning model a feature of direction learning.
In the Tullock contest, agents can choose a target probability of winning
which they like. They do this by choosing the level of eort, making the trade-o
between cost (potential profit if they win) and the probability of winning.
The deterministic model has two parameters; a target probability of winning
) and learning speed (
). Although initial SPWs and learning speed would
impact the transition and asymptotic state that the process would reach, what
drives the behavior is the parameter TPW and strategy similarity.
The intuition based on which we postulate TPW is that agents in a probabilis-
tic environment think only if they win will they make a positive profit—while if
they lose, they will make a loss. Either agents are not farsighted, or they do not
believe that the environment is stationary, so they gather directional feedback
only from the outcome of the last period. Agents’ rapid change in eort choices
can be captured by this model. That is possibly the reason they do not put in the
same eort across all the periods of the game.
This deterministic model can explain the observed individual pattern of
behavior qualitatively, which is in contrast to what the standard model would
predict, hence it serves its basic purpose. It could be observed in the three-player
experimental data that agents’ chosen strategy (eort) profile can be explained
by the deterministic model. Note, agents after changing the SPWs may still be at
the same eort levels.
We have modeled the behavioral parameter TPW as an exogenous intrinsic
behavior. The change in eort level is explained by a change in SPWs. It could
be interpreted that in this memory-less model, agents infer that after winning
their minimum probability is met and in case of losing they infer the opposite.
This occurs independently from the behavior of other agents in the game which
is acceptable as the behavior of the other agents is not observable. Indeed,
the number of agents in the game is known but their target probabilities are
unknown and unobservable for other agents.
In this model, we have assumed learning speed to be symmetric, for positive
and negative reinforcement. In FRS data, we see that the degree of positive rein-
forcement is stronger. This may be because in this environment reinforcement
is based on its information value to guide direction. Winning gives agents a
stronger sense of direction than losing. The model predicts that there can be
variation in aggregate eorts. A survey of the experimental contest literature
by Dechenaux, Kovenock and Sheremeta (2015) has highlighted that on average
agents spend considerably more than the equilibrium prediction of the standard
model. Shogren and Baik (1991) find that agents expend the amount closer to
the rational model prediction, while in Millner and Pratt (1989) and Sheremeta
(2010) agents expend significantly more.
It can explain dropout behavior in a dynamic setting. Muller and Schotter
(2010) find that low-ability workers drop out and exert little or no eort, while
high-ability workers make excessive eorts. In a static model, this is explained
by assuming loss aversion on the part of the subjects, such explanation will fall
short in a dynamic setting.
Further study shows that there is an overall decrease in expenditures when
subjects repeatedly play (Fallucchi, Niederreiter and Riccaboni, 2020) and some
agents choose to stay at zero bid for a period of time or for the remaining periods
in the game after bidding for some initial periods.
The specific game would impact the target probability of winning agents
may have. For example, in a first prize auction, the agent may have a higher
value of probability of winning compared to the Tullock contest, which may
further increase in the case of an all-pay auction. Conversely, in the literature
(Muller and Schotter, 2010), the dropout behavior in equilibrium is explained
using loss-aversion in the case of an all-pay auction. The dropout in this model is
explained as a dynamic behavior where agents learn that their target probability
of winning is not achievable. Similarly, the remarkably high eort could be
explained as agents trying to ensure their minimum probability before they
learn to drop out.
The behavioral regularities (increasing eort after losing and decreasing
eort after winning) and aggregate behavior in the four-player game in MSS
is similar to the three-player game in FRS. The simulation can track the data
for MSS with parameter values slightly higher. This suggests that agents use
the target probability of winning to make a decision that has uncertainty in the
severely limited information game as the underlying cause of such behavior.
This model also explains why agents keep exerting eort even if their cumulative
profit is negative. Some agents do not vary their eort substantially over the
periods in the game, these could be explained by assuming their learning speed
is low. The asymptotic prediction (Proposition 1) can explain why generally
there are not more than two (out of three) players bidding consistently later at
the end of the sixty-period game.
This decision-making model is deterministic where agents do not ever stay in
the same state. One may like to change it to a stochastic model where the process
transitions, with higher probability to the state predicted by the deterministic
model, stay at the same state with some probability and transitions in the direc-
tion opposite to what the deterministic model predicts with a lower probability.
The reason for preferring a deterministic model over the stochastic model is the
ease of modeling the behavior and highlighting the decision-making process
that the agent experiences. One objection to our model of TPW can be that sum
of TPW of subjects in the game does not necessarily add to one. Tversky and
Kahneman (1974) give examples to support that a sum of probabilities not being
additive is not completely unusual. This paper state that the use of heuristics
can lead to systematic errors in the judgment of probabilities. They provide ex-
amples of some of the approaches, along with the evidence from the lab, used by
people (including experts in the field) to make real-life decisions. In this model,
the agents do not change their target probability of winning over the periods of
the game. The choice of eorts by the agents observed in the simulation data
is the result of their
SP W s
and TPW. One may further develop a model where
TPW evolves with learning and analyze it to find intermediate predictions. Here,
all the players are considered to be TPW type. The experimental evidence in
Chapter One suggests it is a mixed population. It will be interesting to study
game dynamics when two types (EUM and TPW) are in some proportion. It
seems that TPW will be wiped out in the evolutionary process but it might have
interesting results if intermediate winning scores are also important. The model
is developed for Tullock contests; nevertheless, we believe it can be adapted
to other winner-take-all games and indicate its applicability to all-pay-auction.
Similarly, its predictions can also be found in dierent informational environ-
ments since agents could be playing the game similarly. Fallucchi, Niederreiter
and Riccaboni (2020) find that in Tullock contests, many players learn more
from their own past payos despite having information feedback on total bid
Morgan, Orzen and Sefton (2012) find that a key dierence between the
theory and actual behavior in contests is the variability of investment decisions
over time and the endogenous entry could not discipline the market. The paper
also summarizes findings of over and under dissipation in dierent experimental
studies in Tullock contests. The trajectory of aggregate dissipations appears sim-
ilar to what our model tries to explain for standard Tullock contests. The model
in this chapter contribute to an understanding of over and under dissipation. It
is a matter of further analysis to see if it can be adapted to explain the trajectory
in the case of endogenous entry in the market games.
Eyal Baharad and Shmuel Nitzan. “Contest eorts in light of behavioural
considerations”. In: The Economic Journal 118.533 (2008), pp. 2047–2059.
Guido Biele, Ido Erev, and Eyal Ert. “Learning, risk attitude and hot stoves
in restless bandit problems”. In: Journal of Mathematical Psychology 53.3
(2009), pp. 155–167.
Tilman B
orgers and Rajiv Sarin. “Naive reinforcement learning with
endogenous aspirations”. In: International Economic Review 41.4 (2000),
pp. 921–950.
Colin Camerer and Teck Hua Ho. “Experience-weighted attraction learn-
ing in normal form games”. In: Econometrica 67.4 (1999), pp. 827–874.
Gary Charness and Dan Levin. “When optimal choices feel wrong: A labo-
ratory study of Bayesian updating, complexity, and aect”. In: American
Economic Review 95.4 (2005), pp. 1300–1309.
Richard Cornes and Roger Hartley. “Risk aversion, heterogeneity and
contests”. In: Public Choice 117.1 (2003), pp. 1–25.
Emmanuel Dechenaux, Dan Kovenock, and Roman M Sheremeta. “A
survey of experimental research on contests, all-pay auctions and tourna-
ments”. In: Experimental Economics 18.4 (2015), pp. 609–669.
Ward Edwards. “Probability-preferences among bets with diering ex-
pected values”. In: The American Journal of Psychology 67.1 (1954), pp. 56–
Ward Edwards. “Probability-preferences in gambling”. In: The American
Journal of Psychology 66.3 (1953), pp. 349–364.
Richard Engelbrecht-Wiggans and Elena Katok. “Regret and feedback
information in first-price sealed-bid auctions”. In: Management Science
54.4 (2008), pp. 808–819.
Richard Engelbrecht-Wiggans and Elena Katok. “Regret in auctions: The-
ory and evidence”. In: Economic Theory 33.1 (2007), pp. 81–101.
Ido Erev and Alvin E Roth. “Multi-agent learning and the descriptive value
of simple models”. In: Artificial Intelligence 171.7 (2007), pp. 423–428.
Francesco Fallucchi, Jan Niederreiter, and Massimo Riccaboni. “Learn-
ing and dropout in contests: an experimental approach”. In: Theory and
Decision (2020), pp. 1–34.
Francesco Fallucchi, Elke Renner, and Martin Sefton. “Information feed-
back and contest structure in rent-seeking games”. In: European Economic
Review 64 (2013), pp. 223–240.
Emel Filiz-Ozbay and Erkut Y Ozbay. “Auctions with anticipated re-
gret: Theory and Experiment”. In: American Economic Review 97.4 (2007),
pp. 1407–1418.
Robert L Goldstone and Ji Yun Son. Similarity. Oxford University Press,
Brit Grosskopf. “Reinforcement and directional learning in the ultimatum
game with responder competition”. In: Experimental Economics 6.2 (2003),
pp. 141–158.
Brit Grosskopf, Rajiv Sarin, and Elizabeth Watson. “An experiment on
case-based decision making”. In: Theory and Decision 79.4 (2015), pp. 639–
Burkhard Hehenkamp, Wolfgang Leininger, and Alexandre Possajennikov.
“Evolutionary equilibrium in Tullock contests: spite and overdissipation”.
In: European Journal of Political Economy 20.4 (2004), pp. 1045–1057.
Charles A Holt and Alvin E Roth. “The Nash equilibrium: A perspective”.
In: Proceedings of the National Academy of Sciences 101.12 (2004), pp. 3999–
Wooyoung Lim, Alexander Matros, and Theodore L Turocy. “Bounded
rationality and group size in Tullock contests: Experimental evidence”. In:
Journal of Economic Behavior & Organization 99 (2014), pp. 155–167.
Volodymyr Lugovskyy, Daniela Puzzello, and Steven Tucker. “An experi-
mental investigation of overdissipation in the all pay auction”. In: European
Economic Review 54.8 (2010), pp. 974–997.
Shakun D Mago, Anya C Samak, and Roman M Sheremeta. “Facing your
opponents: Social identification and information feedback in contests”. In:
Journal of Conflict Resolution 60.3 (2016), pp. 459–481.
James G March. “Learning to be risk averse.” In: Psychological Review 103.2
(1996), p. 309.
Richard D McKelvey and Thomas R Palfrey. “Quantal response equilibria
for normal form games”. In: Games and Economic Behavior 10.1 (1995),
pp. 6–38.
Edward L Millner and Michael D Pratt. “An experimental investigation of
ecient rent-seeking”. In: Public Choice 62.2 (1989), pp. 139–151.
Edward L Millner and Michael D Pratt. “Risk aversion and rent-seeking:
An extension and some experimental evidence”. In: Public Choice 69.1
(1991), pp. 81–92.
John Morgan, Henrik Orzen, and Martin Sefton. “Endogenous entry in
contests”. In: Economic Theory 51.2 (2012), pp. 435–463.
Wieland M
uller and Andrew Schotter. “Workaholics and dropouts in
organizations”. In: Journal of the European Economic Association 8.4 (2010),
pp. 717–743.
M Kathleen Ngangou
e and Andrew Schotter. “The Common-Probability
Auction Puzzle”. In: (2019).
Axel Ockenfels and Reinhard Selten. “Impulse balance equilibrium and
feedback in first price auctions”. In: Games and Economic Behavior 51.1
(2005), pp. 155–170.
John W Payne. “It is whether you win or lose: The importance of the overall
probabilities of winning or losing in risky choice”. In: Journal of Risk and
Uncertainty 30.1 (2005), pp. 5–19.
Jan Potters, Casper G De Vries, and Frans Van Winden. “An experimental
examination of rational rent-seeking”. In: European Journal of Political
Economy 14.4 (1998), pp. 783–800.
Alvin E Roth and Ido Erev. “Learning in extensive-form games: Experi-
mental data and simple dynamic models in the intermediate term”. In:
Games and Economic Behavior 8.1 (1995), pp. 164–212.
Andrew Donald Roy. “Safety first and the holding of assets”. In: Economet-
rica: Journal of the econometric society (1952), pp. 431–449.
Rajiv Sarin and Farshid Vahid. “Strategy similarity and coordination”. In:
The Economic Journal 114.497 (2004), pp. 506–527.
Reinhard Selten and Rolf Stoecker. “End behavior in sequences of finite
Prisoner’s Dilemma supergames A learning theory approach”. In: Journal
of Economic Behavior & Organization 7.1 (1986), pp. 47–70.
Roman M Sheremeta. “Experimental comparison of multi-stage and one-
stage contests”. In: Games and Economic Behavior 68.2 (2010), pp. 731–
Jason F Shogren and Kyung H Baik. “Reexamining ecient rent-seeking
in laboratory markets”. In: Public Choice 69.1 (1991), pp. 69–79.
Paul Slovic and Sarah Lichtenstein. “Relative importance of probabilities
and payos in risk taking.” In: Journal of experimental psychology 78.3p2
(1968), p. 1.
Joel Sobel. “Economists’ models of learning”. In: Journal of Economic Theory
94.2 (2000), pp. 241–261.
Kinneret Teodorescu and Ido Erev. “Learned helplessness and learned
prevalence: Exploring the causal relations among perceived controllability,
reward prevalence, and exploration”. In: Psychological Science 25.10 (2014),
pp. 1861–1869.
Gordan Tullock. Toward a Theory of the Rent-Seeking Society. Ecient Rent
Seeking. Texas A & M University, 1980.
Amos Tversky. “Intransitivity of preferences.” In: Psychological review 76.1
(1969), p. 31.
Amos Tversky and Daniel Kahneman. “Rational choice and the framing of
decisions”. In: Journal of Business (1986), S251–S278.
Vinod Venkatraman, John W Payne, and Scott A Huettel. “An overall
probability of winning heuristic for complex risky decisions: Choice and
eye fixation evidence”. In: Organizational Behavior and Human Decision
Processes 125.2 (2014), pp. 73–87.
Stefan Zeisberger. “Do people care about loss probabilities?” In: Available
at SSRN 2169394 (2020).
9 Appendix
9.1 Further Simulations
We further simulate the model to see trajectories of aggregate eort over the
period of time to find that the model can track dierent trajectories for various
parametric values.
Figure 2.15: Simulations for the game having group size equals six, for dierent
parametric values. Where pT stands for αi
cand x stands for λi
Figure 2.16: Simulations for the game having group size equals six, for dierent
parametric values. Where pT stands for αi
cand x stands for λi
Figure 2.17: Simulations for the game having group size equals six, for dierent
parametric values. Where pT stands for αi
cand x stands for λi
9.2 Reference for Proposition 1
Theorem 2.15 [Page 83, Markov Processes - Advances in applied mathematics
by James R Kirkwood (2015)]:
The probability that any state in a finite absorbing
Markov chain is absorbed after n steps approaches 1 as n goes to infinity.
Proof: We show that the probability that any state in an absorbing Markov chain
is not absorbed after n steps approaches 0 as n goes to infinity. Let i be a non-
absorbing state. Since the Markov state is absorbing, there is a positive integer
for which there is a positive probability
that i has moved to an absorbing
state after
steps. Note that this implies that if
, then the probability
that i has been absorbed after l steps is greater than or equal to
. Thus, the
probability that state i has not been absorbed after
steps is less than or equal
to 1
. Repeat the aforementioned procedure for each transient state, and
max {mi}
min {pi}
. Then, beginning in any state, the probability the
process has not been absorbed after m steps is less than or equal to 1
. Thus,
for any positive integer N, the probability the process has not been absorbed
after Nm steps is less than or equal to (1
= 0. This
also means that each entry of Qngoes to 0 as n becomes large.
9.3 Learning Model Simulation Code
C:\Users\Vikas\workspace\contestsimulation\src\contestsimulation\ Saturda
, Januar
12, 2019 8:47 A
*** Simulation code for Chapter:Learning in Contests, Section:Simulations
package contestsimulation;
/* *** @author Vikas
public class target probabilityzone3 {
public static void main(String[] args){
// TODO Auto-generated method stub
//for(int k=0;k<10;k++){
int session=25;
int period=100;
int prizemoney=80;
int playercount=4;
int effortcount=11;
double[] agentpriorslope =new double[playercount]; // m in y=mx
double[] agentpT =new double[playercount]; // pT- target probability zone
double[] agentx =new double[playercount]; //
double[][] effortprior =new double[playercount][effortcount]; // prior for each
double[][][] effortchoice =new double [session][period][playercount]; // effort
double[][][] effortoutcome =new double [session][period][playercount]; // effort
double[][] aggeffortchoice =new double [period][playercount]; // agg effort choice
double[][] aggeffortoutcome =new double [period][playercount]; // agg effort
//this program is target probability experiment, where agent tries to reach target
probability then updates her belief.
//first agent chooses slope of her prior 'm', y=mx then she chooses the level of
target probability she is looking for
//then she chooses how to update her priors depending on the result of her last
//Then there are exogenous shocks if agents has stopped working.
//for(int j=0;j<10;j++){
for(int s=0;s<session;s++)
//agent choosing the target probability zone and slope of its priorsand making sure
the slope
//if such that target probability zone choice is actionable
for(int agent=0;agent<playercount;agent++)
{double a,b,c;
a=(Math.random()*20+20)/100;//20 to 40 % minimum target probability pT
C:\Users\Vikas\workspace\contestsimulation\src\contestsimulation\ Saturda
, Januar
12, 2019 8:47 A
b=(Math.random());//*60+40)/100; //40 to 100 % at 100 effort m
//}while(b<a); // choose again if m<pT
//agent setting the its priors for each effort
for(int agent=0;agent<playercount;agent++)
{for(int efforti=0;efforti<effortcount;efforti++)
for(int p=0;p<period;p++)
double totaleffort=0;
int winner=0;
int [] PWin=new int[playercount];
//agents making the strategy decision
for(int agent=0;agent<playercount;agent++)
{for(int efforti=0;efforti<effortcount;efforti++)
{if (effortprior[agent][efforti]>=agentpT[agent])
else {effortchoice[s][p][agent]=0;}
totaleffort=totaleffort +effortchoice[s][p][agent];
//generate probabilistic prize
double randomnumber=(Math.random()*totaleffort);
double teffort=0;
for (int agent=0;agent<playercount;agent++)
{teffort=teffort +effortchoice[s][p][agent];
if(randomnumber<= teffort)
C:\Users\Vikas\workspace\contestsimulation\src\contestsimulation\ Saturda
, Januar
12, 2019 8:47 A
//screen output
for(int agent=0;agent<playercount;agent++)
{if(agent==winner )
//print screen
/* System.out.printf("%4d %4d %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.0f
%5.0f %5.0f %5.0f %5.0f %5.0f%n",
// belief revision of monotonically higher(win) and lower(lose) strategies with
immediate below(win)/above(lose) strategies treated as similar
for(int agent=0;agent<playercount;agent++)
{for(int efforti=0;efforti<effortcount;efforti++)
if (PWin[agent]==1&& ((efforti+1)*10 >= effortchoice[s][p][agent]))
C:\Users\Vikas\workspace\contestsimulation\src\contestsimulation\ Saturda
, Januar
12, 2019 8:47 A
if (effortprior[agent][efforti]>1)
if (PWin[agent]==0&& ( (efforti-1)*10 <= effortchoice[s][p][agent]))
if (effortprior[agent][efforti]<0)
// print belief revision
/* for(int agent=0;agent<playercount;agent++)
{ System.out.printf("%5.2f %5.2f %5.2f %5.2f %5.2f %5.2f %5.2f %5.2f
%5.2f %5.2f%n",
for(int agent=0;agent<playercount;agent++){
for(int p=0;p<period;p++){
for(int s=0;s<session;s++){
for(int p=0;p<period;p++){
System.out.printf("%4d %5.1f %5.1f %5.1f %5.2f %5.2f %5.2f%n",
for(int agent=0;agent<playercount;agent++){
for(int p=0;p<period;p++){
C:\Users\Vikas\workspace\contestsimulation\src\contestsimulation\ Saturda
, Januar
12, 2019 8:47 A
//}//closing for(int k;)
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
We design an experiment to study investment behavior in different repeated contest settings, varying the uncertainty of the outcomes and the number of participants in contests. We find decreasing over-expenditures and a higher rate of ‘dropout’ in contests with high uncertainty over outcomes (winner-take-all contests), while we detect a quick convergence toward equilibrium predictions and a near to full participation when this type of uncertainty vanishes (proportional-prize contests). These results are robust to changes in the number of contestants. A learning parameter estimation using the experience-weighted attraction (EWA) model suggests that subjects adopt different learning modes across different contest structures and helps to explain expenditure patterns deviating from theoretical predictions.
Full-text available
Exposure to uncontrollable outcomes has been found to trigger learned helplessness, a state in which the agent, because of lack of exploration, fails to take advantage of regained control. Although the implications of this phenomenon have been widely studied, its underlying cause remains undetermined. One can learn not to explore because the environment is uncontrollable, because the average reinforcement for exploring is low, or because rewards for exploring are rare. In the current research, we tested a simple experimental paradigm that contrasts the predictions of these three contributors and offers a unified psychological mechanism that underlies the observed phenomena. Our results demonstrate that learned helplessness is not correlated with either the perceived controllability of one's environment or the average reward, which suggests that reward prevalence is a better predictor of exploratory behavior than the other two factors. A simple computational model in which exploration decisions were based on small samples of past experiences captured the empirical phenomena while also providing a cognitive basis for feelings of uncontrollability.
Full-text available
We explore how models of boundedly-rational decision-making in games can explain the overdissipation of rents in laboratory Tullock contest games. Using a new series of experiments in which group size is varied across sessions, we find that models based on logit choice organize the data well. In this setting, logit quantal response equilibrium (QRE) is a limit of a cognitive hierarchy (CH) model with logit best responses for appropriate parameters. While QRE captures the data well, the CH fits provide support for relaxing the equilibrium assumption. Both the QRE and CH models have parameters which capture boundedness of rationality. The maximum likelihood fits of both models yield parameters indicating rationality is more restricted as group size grows. Period-by-period adjustments of expenditures are more likely to be in the earnings-improving direction in smaller groups.
Full-text available
We investigate the effects of information feedback in rent-seeking games with two different contest structures. In the share contest a contestant receives a share of the rent equal to her share of rent-seeking expenditures, while in the lottery contest a contestant wins the entire rent with probability equal to her share of rent-seeking expenditures. In share contests average expenditures converge to equilibrium levels when subjects only get feedback about own earnings, and additional feedback about rivals' choices and earnings raises average expenditures. In lottery contests information feedback has an opposite, and even stronger, effect: when subjects only get feedback on own earnings we observe high levels of rent dissipation, usually exceeding the value of the rent, and additional feedback about rivals' choices and earnings has a significant moderating influence on expenditures. In a follow-up treatment we make information feedback endogenous by allowing contestants in a lottery contest to make public or private expenditures. Subjects make the vast majority of expenditures privately and overall excess expenditures are similar to the lottery contest with own feedback.
Experimental learning that conforms to standard learning models is shown to lead learners to favor less risky alternatives when possible outcomes are positive. This learning disadvantage for risky alternatives is likely to be quite substantial and persistent, particularly among relatively fast learners. Learning to choose among alternatives whose outcomes lie in the negative domain, on the other hand, leads to favoring more risky alternatives in the short run but tends to become risk neutral in the long run. Thus, the fact that human beings exhibit greater risk aversion for gains than for losses in a wide variety of situations may reflect accumulated learning rather than inexplicable human traits or utility functions. Some implications of an experiential learning interpretation of risk preferences are discussed.
When faced with multi-outcome gambles involving possibilities of both gains and losses, people often use a simple heuristic that maximizes the overall probability of winning (Pwin). Across three different studies, using choice data as well as process data from eye tracking, we demonstrate that the Pwin heuristic is a frequently used strategy for decisions involving complex (multiple outcome) mixed gambles. Crucially, we show systematic contextual and individual differences in the use of Pwin heuristic. We discuss the implication of these findings in the context of the broader debate about single versus multiple strategies in risky choice, and the need to extend the study of risky decision making from simple to more complex gambles.
Many economic, political and social environments can be described as contests in which agents exert costly efforts while competing over the distribution of a scarce resource. These environments have been studied using Tullock contests, all-pay auctions and rank-order tournaments. This survey provides a review of experimental research on these three canonical contests. First, we review studies investigating the basic structure of contests, including the contest success function, number of players and prizes, spillovers and externalities, heterogeneity, and incomplete information. Second, we discuss dynamic contests and multi-battle contests. Then we review research on sabotage, feedback, bias, collusion, alliances, and contests between groups, as well as real-effort and field experiments. Finally, we discuss applications of contests to the study of legal systems, political competition, war, conflict avoidance, sales, and charities, and suggest directions for future research.
Reveals that, under specified experimental conditions, consistent and predictable intransitivities can be demonstrated. The conditions under which intransitivities occur and their relationships to the structure of the alternatives and to processing strategies are investigated within the framework of a general theory of choice. Implications to the study of preference and the psychology of choice are discussed. (33 ref.) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Concerned with understanding how people make decisions about gambles when the relevant probabilities and payoffs are explicitly stated. It is proposed that decisions may be determined by a person's beliefs about the relative importance of probabilities and payoffs and by limitations on his ability to act on the basis of these beliefs when processing the information that describes a gamble. An explanation based on this information-processing orientation is offered as an alternative to present notions about probability and variance preferences. 2 experiments are reported to indicate the usefulness of considering gambling decisions within the context of information processing. (52 ref.) (PsycINFO Database Record (c) 2012 APA, all rights reserved)