Conference PaperPDF Available

Learning to Cooperate via a Selectionist Algorithm

Authors:
  • Constructor University

Abstract

Evolutionary game-theory is a popular way to investigate models of cooperation. But this has the obvious disadvantage that evolution with the propagation of genes over generations is an unrealistic assumption to investigate fast-changing social interactions. In this paper, it is shown how a transition from evolutionary game-theory to a learning model can be made. Especially, results from experiments in an N-Players Prisoners' Dilemma are presented where the agents learn to cooperate despite significant temptations to cheat.
Learning to Cooperate via a Selectionist Algorithm
Andreas Birk
International University Bremen, Germany
a.birk@iu-bremen.de
ABSTRACT
Evolutionary game-theory is a popular way to investigate
models of cooperation. But this has the obvious disad-
vantage that evolution with the propagation of genes over
generations is an unrealistic assumption to investigate fast-
changing social interactions. In this paper, it is shown how
a transition from evolutionary game-theory to a learning
model can be made. Especially, results from experiments in
an N-Players Prisoners’ Dilemma are presented where the
agents learn to cooperate despite significant temptations to
cheat.
KEY WORDS
Cooperation, Machine Learning, Prisoners’ dilemma,
Multi Agent System, Artificial Life
Appeared in:
A. Birk, ”Learning to Cooperate via a Selectionist Algo-
rithm,” in Artificial Intelligence and Applications, AIA’04,
Innsbruck, Austria, 2004.
@inproceedings{AIA04_coop_learning,
author = {Birk, Andreas},
title = {Learning to Cooperate via a
Selectionist Algorithm},
booktitle = {Artificial Intelligence and
Applications, AIA’04},
year = {2004},
type = {Conference Proceedings}
}
1 Introduction
Evolutionary game theory [1, 2] is a powerful tool for the
investigation of interactions between individuals. It has es-
pecially become popular with research on cooperation (see
e.g. [3] for an overview), but it also has been applied to
many other domains.
Evolutionary methods are built upon a transfer of en-
coded information, i.e., genes, between individuals. This
transfer of genes includes two main assumptions. The first
one is the “feasibility of breeding” assumption. Evolution
includes the generation of new individuals and the deletion
or death of others. The second one is the “obey mother na-
ture assumption”. When an off-spring is generated, it has
no choice whether to incorporate a particular gene or not;
the “decision” is made by mother nature or by stochastic
operators in simulated evolution.
Let us examine these assumptions from the perspec-
tive of social phenomenons. On the one hand, genetic evo-
lution is very likely to influence animal and especially hu-
man behavior also in respect to social interactions. On the
other hand, social developments happen on a completely
different time-scale than evolution. Therefore, it should be
clear that evolution can not serve as the only explanation.
The work presented here shows a possible way out of the
problems with evolutionary schemes. Instead of evolution,
a learning approach somewhat in the spirit of selectionism
[4, 5] is used here.
The main feature of this learning algorithm is that it
is based on a pool of potential solutions or so-called hy-
potheses in each individual. This type of algorithm can
also be highly efficient for the learning of individual skills
as demonstrated in experiments with learning eye-hand co-
ordination in simulations and real robot systems [6, 7] and
experiments on learning several basic behaviors in a robotic
ecosystem [8].
This paper builds on previous work based on evolu-
tionary game theory [9]. The latest results of this line of
research are submitted to this conference in an other pa-
per [10]. It is shown in this paper that cooperation in a
continuous-case N-player prisoner’s dilemma can not only
evolve, but that it also can be learned. Furthermore, it is
shown that cooperation can be learned without the special
trust function introduced in [11] that uses preferences of
the agents to interact with each other.
The rest of this paper is structured as follows. In sec-
tion 2, the basic ideas of the transition from evolutionary
game theory to social learning are explained. A continuous
case N-players prisoner’s dilemma is introduced in section
3 as a basis for the experimental framework. In section 4,
the concrete learning algorithms and results are presented.
Section 5 concludes the paper.
2 Selectionist Learning within Individual
Minds
Evolutionary algorithms with their major classes Genetic
Algorithms [12, 13], Evolutionary Programming [14], Evo-
lutionary Strategies [15, 16] and Genetic Programming
[17, 18], imitate, or at least are inspired by, the principle
of evolution in nature. They use a set of potential solutions
(the population) to a particular problem domain. Popula-
tions are generated in iterations (generations) using oper-
ations for selection and transformation. In doing so, the
selection and transformation operations focus on good, in
respect to a fitness function, members of the population. As
better members are more likely to be chosen an improve-
ment over time is expected.
For the sake of simplicity, we refer here to any repre-
sentation of a potential solution as a gene, typically a fixed-
length binary string or a parse-tree. When using evolution-
ary algorithms to investigate models of intelligent behavior
as in the fields of evolutionary game theory [1, 2] or evo-
lutionary robotics [19, 20, 21, 22, 23], a single gene de-
termines a crucial aspect of an individual system, such as
its morphology [24], its control [22], or highlevel behavior
like a strategy in social interactions [25].
Here we propose a mechanism which is not based on
evolution, but which is a learning mechanism inspired by
the evolutionary driving forces of selection and the gener-
ation of diversity, somewhat in the spirit of selectionism
[4, 5]. Here a crucial aspect of an individual system is not
determined by a fixed gene but by a so-called hypothesis.
For example, in the domain of robot-control, a certain hy-
pothesis hwould for example represent that given a situ-
ation sthe behavior bwould be appropriate. The crucial
aspect of a hypothesis is that, unlike in the case of a gene,
there is not a single hypothesis for a particular problem. In-
stead, an individual has multiple hypotheses about potential
solutions for a single instance from the domain. In the case
of robot-control, this means that given a situation sthere is
a set of hypotheses HS linking sto several possible behav-
iors.
Hypotheses are ranked within the individual by a so-
called preference function pref(). The best hypothesis ac-
cording to this ranking is most likely to be activated, e.g., to
be expressed as an behavior or to serve as a (partial) model
of the world. Lower ranking hypotheses also have a chance
to become activated. The retrieval of the hypothesis which
becomes active can for example be done with the roulette-
wheel (RW) principle as follows. Given a hypotheses-set
HS and the preferences pref (h)for all hin H S, the like-
lihood prob that a particular hypothesis his retrieved from
HS for activation is proportional to its preference, i.e.,
prob(his activated) = pref(h)/X
hHS
pref (h)
Note that it is important not to confuse RW-retrieval
with RW-selection from evolutionary algorithms. In the
case of evolutionary selection, a gene is transferred into
the next generation. If this does not happen, the gene dies
out, i.e., it disappears from the population. When a hypoth-
esis his selected by RW-retrieval, it is applied and tested.
This does not necessarily result in a change in the set HS
of hypotheses with which his in concurrency.
Through the retrieval and activation of h, informa-
tion about the usefulness of his gathered and pref (h)is
updated. This in turn can lead to changes of hand even
its elimination, but as mentioned above, this is not neces-
sarily the case. With evolution, potential solutions are en-
coded in genes and transferred among individuals as gen-
erations progress. With multiple-hypotheses learning, the
potential solutions in form of hypotheses are never trans-
fered between different individuals. Nevertheless, similar
hypotheses-sets in different individuals and coordinated us-
age of hypotheses can emerge through suited social inter-
actions as will be shown later on in experiments.
As already mentioned in the introduction, variations
of the learning algorithm presented here have also been ap-
plied to learning of individual skills on the level of sensor-
motor control and on the level of behaviors for robots in
real world environments [6, 7, 8]. In these experiments,
it has been shown that a pool of potential solutions in
the “mind” of a single individual increases the robustness
against distortions from real world noise and it can increase
the learning speed through the re-use of partial solutions
found in the pool.
3 The Experimental Framework
3.1 A Continuous-Case N-Player Prisoner’s
Dilemma
The basis for the experiments described later on is a version
of the prisoner’s dilemma with Nplayers and continuous
cases of investment and payoffs (CN-PD). It is motivated
and described in more detail in [9, 11].
Each agent aihas a so-called cooperation-level coi
[0.0; 1.0]. In a game, the cooperation-level determines the
agent’s investment Ii, which serves for the benefit of the
group (including aiitself). Concretely, the investment is
determined by:
Ii=coi·75
Let ¯co denote the average cooperation-level of the
group, i.e.:
¯co =X
1iN
coi/N
The so-called gain Gifor an agent aiis determined
by:
Gi= ¯co ·100
Roughly, all investments are collected, some profit is
generated with the investments, and finally investments and
profit are distributed among the investors. The dilemma
arises as investments and profit are equally shared among
all. Thus there is the temptation to invest less than the oth-
ers and to exploit their contribution to the profit. This be-
comes even clearer when we look at the netgain or payoff
for each agent. This payoff poifor an agent aiis the differ-
ence between gain and investment, i.e.:
poi=GiIi=X
1jN
coj/N ·100 coi·75
1/* update preferences for strategies */
2 if poi0
3pref (s)[t] = q·pref(s)[t1](1 q)·poi
4 else
5sHSs
i/{s}:pref (s)[t] = q·pref(s)[t1](1 q)· |poi|
Figure 1. The update of the preference function for strategies as hypotheses of how to behave towards other players in the game.
So, on the one hand, it is in the interest of each agent
that there is a high overall investment. On the other hand,
there is the temptation to leave the task of investing to
others, as the overall gain is distributed among all, inde-
pendent of the individual investment. Note, that the pay-
off for an agent depends on its own cooperation level coi
and on the average cooperation level ¯co. Its profit function
fp: [0,1] ×[0,1] RI is thus
fp(coi,¯co) = coi· 75 + ¯co ·100
Based on this, we can extend the well-known termi-
nology for payoff values in the standard prisoner’s dilemma
with payoff types for cooperation (C), punishment (P),
temptation (T), and sucking (S), as follows:
Full cooperation as all fully invest: Call =
fp(1.0,1.0) = 25
All punished as nobody invests: Pall =
fp(0.0,0.0) = 0
Maximum temptation: Tmax =fp(0.0,N1
N)50
Maximum sucking: Smax =fp(0.0,1
N) 25
For co, ¯co = 0.0,1.0, we get the following additional
types of payoffs, the so-called partial temptation, the weak
cooperation, the single punishment, and the partial suck-
ing. They are not constants (for a fixed N) like the previ-
ous ones, but actual functions in (co, ¯co). Concretely, they
are sub-functions of fp(co, ¯co), operating on sub-spaces de-
fined by relations of co in respect to ¯co.
3.2 Strategies for Iterated Games
When playing iterated games, the concept of a strategy [1]
can be used to determine the behavior of an agent. This
means, the outcome of previous games is used to compute
whether to cooperate or not in the recent game, or to com-
pute the degree of cooperation in the continuous case [26].
In [9] it is shown that the so-called justified snobism
(JS) is a successful strategy for the continuous case N-
player prisoner’s dilemma. JS cooperates slightly more
than the average cooperation level of the group of Nplay-
ers if a non-negative payoff was achieved in the previous
iteration, and it cooperates exactly at the previous average
cooperation level of the group otherwise.
Justified-Snobism (JS):
poi(t1) 0 : coi(t) = ¯co(t1) + cJ S
poi(t1) <0 : coi(t) = ¯co(t1)
So, JS tries to be slightly more cooperative than the
average. This leads to the name for this strategy as the
snobbish belief to be “better” (in terms of altruism) than
the average of the group is somehow justified for players
which use this strategy.
In addition, following strategies are used in the exper-
iments described lateron to challenge JS:
Follow-the-masses (FTM) : match the average coopera-
tion level from the previous iteration, i.e., coi[t] =
¯co[t1]
Hide-in-the-masses (HIM) : subtract a small constant c
from the average cooperation level, i.e., coi[t] =
¯co[t1] c
Occasional-short-changed-JS (OSC-JS) : a slight vari-
ation of JS, where occasionally a small constant cis
subtracted from the JS investment
Occasional-cheating-JS (OC-JS) : an other slight varia-
tion of JS, where occasionally nothing is invested
Challenge-the-masses (CTM) : Zero cooperation when
the previous average cooperation is below one’s co-
operation level, a constant cooperation level cother-
wise, i.e.,
coi[t1] ¯co :coi[t] = c
coi[t1] <¯co :coi[t]=0
Non-altruism (NA) : always completely defect, i.e.,
coi[t] = 0
Anything-will-do (AWD) : always cooperate at a fixed
level, i.e., coi[t] = c
In evolutionary game theory, each agent has exactly
one strategy, which is encoded in a single gene. The sur-
vival of the agent and the number of its off springs depend
on the performance of this strategy. Here, each agent has
a set of strategies from which he can choose. This means
strategies are encoded as multiple-hypotheses. The set of
strategy-hypotheses for an agent aiis denoted with HSs
i.
When playing games, an agent must first retrieve a
strategy sfrom HSs
i. This is done using roulette-wheel re-
trieval as introduced above. The outcome of the game is
then used to update the preference pref (h)for the hypoth-
esis that strategy sis useful for getting a high payoff in a
game.
4 Learning to Cooperate
The overall experiments are done as follows. detail, let us
first have another look at the overall game. There is a set of
agents with a fixed cardinality nS, the so-called society S.
The society plays iterated CN-PDs in time-steps t. At the
beginning of each time-step, the society is randomly split
into groups of size N.
After the groups are formed, several CN-PD games
are played and payoffs for each agent are generated. The
payoffs are used to update the preferences of the differ-
ent hypotheses, i.e., strategies. The groups are afterwards
mixed together into a uniform society again and the overall
process proceeds to the next time-step, t+ 1.
The main idea for the update is simply that the run-
ning average of payoffs is used as indication of how prefer-
able a certain strategy is. As a minor problem, negative
payoffs have to be taken care of. To ensure that the prefer-
ences never become negative as the standard roulette-wheel
principle is only applicable with positive weights. There-
fore, negative payoffs do not decrease the preference for a
strategy s, which was active in the last time-step, but they
lead to an increase in the preference of all other strategies
except s(figure 1).
4.1 Results
In the experiments reported here, the size nSof the society
is 100, the group sizes are always 10 agents. After group-
ing, the agents always play 50 games together to collect
payoffs before they proceed to the next time-step and new
groups are formed. The preferences for the different hy-
potheses, i.e., strategies are randomly initialized. In doing
so, the different strategies are all equally likely in the stan-
dard experiments.
Figure 2 shows a result from a typical experiment.
Much like in work on the evolution of cooperation, the co-
operation level increases in the learning experiments pre-
sented here. This is also shown in figure 3 where the av-
eraged results from 500 experiments are presented. An ac-
cording analysis of the most preferred strategies shows that
Justified Snobism and its variants become the most dom-
inantly preferred hypotheses. So, the agents learn to play
most of the time this cooperative strategy despite the sig-
nificant temptation to cheat on others.
To further test the stability of the presented learn-
ing algorithm, a series of experiments was done where
the starting societies had a strong bias to be non-altruistic.
Concretely, three series of experiments were done where
10, 25, respectively 50% of the agents in the starting soci-
ety had an initial likelyhood of p= 0.9to choose NA as
their preferred strategy. In the remaining part of the soci-
ety, non-altruism is of course also present; it is simply as
likely as the other strategies. As shown in figure 4, even
in the case of a large amount of non-altruism in the start-
ing societies, this non-altruism disappears, i.e., cooperation
takes over.
5 Conclusion
Evolutionary game-theory is a well known tool for investi-
gating basic properties of interactions between individuals.
But the transfer of encoded information, i.e., genes, is un-
suited as main basis for models of social interactions. So-
cial interactions do not operate on the time-scale of natural
evolution nor do they provide such powerful means of in-
formation exchange as the transfer of genes. Here, we show
how learning can be used to overcome this severe draw-
back. The so-called multiple-hypotheses approach is used
to successfully develop cooperation in scenarios modeled
by a continuous-case N-players prisoners’ dilemma.
References
[1] Robert Axelrod. The Evolution of Cooperation. Basic
Books, 1984.
[2] J. Maynard Smith. Evolution and the Theory of
Games. Cambridge University Press, Cambridge,
1982.
[3] Robert Axelrod and Lisa D’Ambrosio. An annotated
bibliography on the evolution of cooperation, 1994.
[4] Gerald M. Edelman. Neural Darwinism: The The-
ory of Neuronal Group Selection. Basic Books, New
York, 1987.
[5] Gerald M. Edelman. Neural darwinism: Population
thinking and higher brain function. In Michael Shafto,
editor, How We Know, pages 1–30. Harper and Row,
1985.
0
10
20
30
40
50
60
70
80
90
0
275
550
825
1100
1375
1650
1925
2200
2475
2750
3025
3300
3575
3850
4125
4400
4675
4950
5225
5500
5775
6050
6325
6600
6875
7150
7425
7700
7975
8250
8525
8800
9075
9350
9625
9900
time step
cooperation level (in %)
Figure 2. The development of the cooperation level in a typical experiment.
0
10
20
30
40
50
60
70
80
90
0
600
1200
1800
2400
3000
3600
4200
4800
5400
6000
6600
7200
7800
8400
9000
9600
time step
average cooperation (in %)
Figure 3. The development of the cooperation level aver-
aged over 500 experiments.
0
10
20
30
40
50
60
0
750
1500
2250
3000
3750
4500
5250
6000
6750
7500
8250
9000
9750
time step
non altruistic strategies (in %)
Figure 4. The decrease of non-altruism as most popular
strategy over time. The graphs shown averaged data from
500 experiments with three differently seeded societies
with 50, 25, respectively 10% of the agents having non-
altruism as preferred strategy. In all cases, the agents learn
that cooperation is beneficial despite the inherent tempta-
tion in the Prisoners’ Dilemma to cheat on others.
[6] Andreas Birk and Wolfgang J. Paul. Schemas and
genetic programming. In Ritter, Cruse, and Dean,
editors, Prerational Intelligence, volume 2. Kluwer,
2000.
[7] Andreas Birk. Learning geometric concepts with an
evolutionary algorithm. In Proceedings of The Fifth
Annual Conference on Evolutionary Programming.
The MIT Press, Cambridge, 1996.
[8] Andreas Birk. Robot learning and self-sufficiency:
What the energy-level can tell us about a robot’s per-
formance. In Proceedings of the Sixth European
Workshop on Learning Robots, LNAI 1545. Springer,
1998.
[9] Andreas Birk and Julie Wiernik. An n-players pris-
oner’s dilemma in a robotic ecosystem. Interna-
tional Journal of Robotics and Autonomous Systems,
39:223–233, 2002.
[10] Andreas Birk. The evolution of cooperation in groups
with up to 25 agents. In The Third IASTED Interna-
tional Conference on Artificial Intelligence and Ap-
plications, AIA’03. 2003.
[11] Andreas Birk. Boosting cooperation by evolving
trust. Applied Artificial Intelligence Journal, 14(8),
2000.
[12] David Goldberg. Genetic Algorithms in Search Opti-
mization and Machine Learning. Kluwer Academic
Publishers, 1989.
[13] John H. Holland. Adaptation in Natural and Artifi-
cial Systems. The University of Michigan Press, Ann
Arbor, 1975.
[14] L.J. Fogel, A.J. Owens, and M.J. Walsh. Artificial In-
telligence through Simulated Evolution. Wiley, New
York, 1966.
[15] Hans Paul Schwefel. Numerische Optimierung von
Computer-Modellen mittels der Evolutions-Strategie.
Birkh¨
auser, Basel, 1977.
[16] Ingo Rechenberg. Evolutionsstrategie: Optimierung
technischer Systeme nach Prinzipien der biologischen
Evolution. Fromman-Holzboog, Stuttgart, 1973.
[17] John R. Koza. Genetic programming II. The MIT
Press, Cambridge, 1994.
[18] John R. Koza. Genetic programming. The MIT Press,
Cambridge, 1992.
[19] R. Ghanea-Hercock and A. P Fraser. Evolution of au-
tonomous robot control architectures. In T. C. Foga-
rty, editor, Evolutionary Computing, Lecture notes in
Computer Science. Springer-Verlag, 1994.
[20] Richard A. Watson, Sevan G. Ficici, and Jordan B.
Pollack. Embodied evolution: Embodying an evolu-
tionary algorithm in a population of robots. In Pe-
ter J. Angeline, Zbyszek Michalewicz, Marc Schoe-
nauer, Xin Yao, and Ali Zalzala, editors, Proceedings
of the Congress on Evolutionary Computation, vol-
ume 1, pages 335–342. IEEE Press, 1999.
[21] Peter Dittrich, Andreas Burgel, and Wolfgang
Banzhaf. Random morphology robot - A test plat-
form for online evolution. In Robots and Autonomous
Systems, 1998.
[22] D. Floreano and F. Mondada. Automatic creation of
an autonomous agent: Genetic evolution of a neural-
network driven robot. In Proceedings of the Confer-
ence on Simulation of Adaptive Behavior. 1994.
[23] John R. Koza. Evolution of a subsumption architec-
ture that performs a wall following task for an au-
tonomous mobile robot, chap. 19. In Computational
Learning Theory and Natural Learning Systems, vol-
ume II: Intersections Between Theory and Experi-
ment, pages 321–346. MIT Press, 1994.
[24] Karl Sims. Evolving 3D morphologgy and behavior
by competition. In R. Brooks and P. Maes, editors, Ar-
tificial Life IV, pages 28–39. MIT Press, Cambridge,
MA, 1994.
[25] Robert Axelrod and William D. Hamilton. The evolu-
tion of cooperation. Science, 211:1390–1396, 1981.
[26] Gilbert Roberts and Thomas N. Sherratt. Develop-
ment of cooperative relationships through increasing
investment. Nature, 394 (July):175–179, 1998.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The iterated prisoner's dilemma (iPD) in its standard form is a very well known, popular basis for research on cooperation through evolutionary game theory. Based on a generalization of the iPD in form of a N-players game with continuous degrees of cooperation, the novel strategy of so-called justified-snobism (JS) is presented. This strategy tries to cooperate slightly more than the average of the group of players. The results presented in this paper extends previous work as here cooperation is achieved in relatively large groups with N = 25 players.
Chapter
This book brings together contributions to the Fourth Artificial Life Workshop, held at the Massachusetts Institute of Technology in the summer of 1994. July 6-8, 1994 · The Massachusetts Institute of Technology The field of artificial life has recently emerged through the interaction of research in biology, physics, parallel computing, artificial intelligence, and complex adaptive systems. The goal is to understand, through synthetic experiments, the organizational principles underlying the dynamics (usually the nonlinear dynamics) of living systems. This book brings together contributions to the Fourth Artificial Life Workshop, held at the Massachusetts Institute of Technology in the summer of 1994. Topics include: Self-organization and emergent functionality • Definitions of life • Origin of life • Self-reproduction • Computer viruses • Synthesis of "the living state." • Evolution and population genetics • Coevolution and ecological dynamics • Growth, development, and differentiation • Organization and behavior of social and colonial organisms • Animal behavior • Global and local ecosystems and their intersections • Autonomous agents (mobile robots and software agents) • Collective intelligence ("swarm" intelligence) • Theoretical biology • Philosophical issues in A-life (from ontology to ethics) • Formalisms and tools for A-life research • Guidelines and safeguards for the practice of A-life. Bradford Books imprint
Chapter
Computational learning theory, neural networks, and AI machine learning appear to be disparate fields; in fact they have the same goal: to build a machine or program that can learn from its environment. Accordingly, many of the papers in this volume deal with the problem of learning from examples. In particular, they are intended to encourage discussion between those trying to build learning algorithms (for instance, algorithms addressed by learning theoretic analyses are quite different from those used by neural network or machine-learning researchers) and those trying to analyze them. The first section provides theoretical explanations for the learning systems addressed, the second section focuses on issues in model selection and inductive bias, the third section presents new learning algorithms, the fourth section explores the dynamics of learning in feedforward neural networks, and the final section focuses on the application of learning algorithms. Bradford Books imprint