ArticlePDF Available

Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning

Authors:

Abstract and Figures

We consider the problem of learning in repeated general-sum matrix games when a learning algorithm can observe the actions but not the payoffs of its associates. Due to the non-stationarity of the environment caused by learning associates in these games, most state-of-the-art algorithms perform poorly in some important repeated games due to an inability to make profitable compromises. To make these compromises, an agent must effectively balance competing objectives, including bounding losses, playing optimally with respect to current beliefs, and taking calculated, but profitable, risks. In this paper, we present, discuss, and analyze M-Qubed, a reinforcement learning algorithm designed to overcome these deficiencies by encoding and balancing best-response, cautious, and optimistic learning biases. We show that M-Qubed learns to make profitable compromises across a wide-range of repeated matrix games played with many kinds of learners. Specifically, we prove that M-Qubed’s average payoffs meet or exceed its maximin value in the limit. Additionally, we show that, in two-player games, M-Qubed’s average payoffs approach the value of the Nash bargaining solution in self play. Furthermore, it performs very well when associating with other learners, as evidenced by its robust behavior in round-robin and evolutionary tournaments of two-player games. These results demonstrate that an agent can learn to make good compromises, and hence receive high payoffs, in repeated games by effectively encoding and balancing best-response, cautious, and optimistic learning biases.
Content may be subject to copyright.
A preview of the PDF is not available
... ? Nash-Q (Hu and Wellman 1998) **Q-learning (Sandholm and Crites 1996) *WoLF-PHC (Bowling and Veloso 2002) *GIGA-WoLF (Bowling 2005) *Exp3 (Auer et al. 1995) Minimax-Q (Littman 1994) M-Qubed (Crandall and Goodrich 2011) *Satisficing (Stimpson and Goodrich 2003) Tit-for-tat (Axelrod 1984) Godfather (Littman and Stone 2001) Bully (Littman and Stone 2001) * The algorithm can be used in games played with minimal information. ** The cited version of the algorithm cannot be used in games played with minimal information, but other versions of the algorithm can. ...
... For example, Tit-for-tat defines world state by its associate's previous action (Axelrod 1984). Similarly, learning algorithms have used the previous actions of all agents (called joint actions) to define state (Sandholm and Crites 1996;Crandall and Goodrich 2011;Bouzy and Metivier 2010). However, it is not possible to define state with associates' previous actions in games played with minimal information. ...
... For example, tit-for-tat sets the initial state of the world to cooperate, which essentially gives associates the benefit of the doubt. Similarly, M-Qubed and the satisficing learning algorithm rely on high initial Q-values and aspirations (Crandall and Goodrich 2011;Stimpson and Goodrich 2003). This biases these algorithms toward solutions that produce higher long-term payoffs. ...
Article
Automated agents for electricity markets, social networks, and other distributed networks must repeatedly interact with other intelligent agents, often without observing associates' actions or payoffs (i.e., minimal information). Given this reality, our goal is to create algorithms that learn effectively in repeated games played with minimal information. As in other applications of machine learning, the success of a learning algorithm in repeated games depends on its learning bias. To better understand what learning biases are most successful, we analyze the learning biases of previously published multi-agent learning (MAL) algorithms. We then describe a new algorithm that adapts a successful learning bias from the literature to minimal information environments. Finally, we compare the performance of this algorithm with ten other algorithms in repeated games played with minimal information.
... Finally, the theory behind interacting agents is important for many machine learning applications in general 65 and, in particular, in adversarial settings 66 , where one agent tries to trick the other agent into thinking that a generated output is good. Understanding prosocial dynamics in multiagent systems 67 and fostering cooperation in them 68 is essential for developing robust and trustworthy artificial intelligence systems that can navigate complex social environments 69 . ...
Article
Full-text available
Large language models (LLMs) are increasingly used in applications where they interact with humans and other agents. We propose to use behavioural game theory to study LLMs’ cooperation and coordination behaviour. Here we let different LLMs play finitely repeated 2 × 2 games with each other, with human-like strategies, and actual human players. Our results show that LLMs perform particularly well at self-interested games such as the iterated Prisoner’s Dilemma family. However, they behave suboptimally in games that require coordination, such as the Battle of the Sexes. We verify that these behavioural signatures are stable across robustness checks. We also show how GPT-4’s behaviour can be modulated by providing additional information about its opponent and by using a ‘social chain-of-thought’ strategy. This also leads to better scores and more successful coordination when interacting with human players. These results enrich our understanding of LLMs’ social behaviour and pave the way for a behavioural game theory for machines.
... Reinforcement learning (RL) focuses on training agents to make intelligent decisions through their interactions with the environment [10]. It has found extensive application in a variety of industries, such as robotics [11] and game playing [12]. Furthermore, multi-agent reinforcement learning (MARL) has also devoloped repaidly recently as an extension of reinforcement learning. ...
Article
Full-text available
In this paper, we propose a distributed multi-agent reinforcement learning (MARL) method to learn cooperative searching and tracking policies for multiple unmanned aerial vehicles (UAVs) with limited sensing range and communication ability. Firstly, we describe the system model for multi-UAV cooperative searching and tracking for moving targets and consider average observation rate and average exploration rate as the metrics. Moreover, we propose the information update and fusion mechanisms to enhance environment perception ability of the multi-UAV system. Then, the details of our method are demonstrated, including observation and action space representation, reward function design and training framework based on multi-agent proximal policy optimization (MAPPO). The simulation results have shown that our method has well convergence performance and outperforms other baseline algorithms in terms of average observation rate and average exploration rate.
... Finally, the theory behind interacting agents is important for many machine learning applications in general [Crandall and Goodrich, 2011], and in particular, in adversarial settings [Goodfellow et al., 2020], where one agent tries to trick the other agent into thinking that a generated output is good. Step (1), we turn the payoff matrix into textual game rules. ...
Preprint
Large Language Models (LLMs) are transforming society and permeating into diverse applications. As a result, LLMs will frequently interact with us and other agents. It is, therefore, of great societal value to understand how LLMs behave in interactive social settings. Here, we propose to use behavioral game theory to study LLM's cooperation and coordination behavior. To do so, we let different LLMs (GPT-3, GPT-3.5, and GPT-4) play finitely repeated games with each other and with other, human-like strategies. Our results show that LLMs generally perform well in such tasks and also uncover persistent behavioral signatures. In a large set of two players-two strategies games, we find that LLMs are particularly good at games where valuing their own self-interest pays off, like the iterated Prisoner's Dilemma family. However, they behave sub-optimally in games that require coordination. We, therefore, further focus on two games from these distinct families. In the canonical iterated Prisoner's Dilemma, we find that GPT-4 acts particularly unforgivingly, always defecting after another agent has defected only once. In the Battle of the Sexes, we find that GPT-4 cannot match the behavior of the simple convention to alternate between options. We verify that these behavioral signatures are stable across robustness checks. Finally, we show how GPT-4's behavior can be modified by providing further information about the other player as well as by asking it to predict the other player's actions before making a choice. These results enrich our understanding of LLM's social behavior and pave the way for a behavioral game theory for machines.
Chapter
This volume provides a unique perspective on an emerging area of scholarship and legislative concern: the law, policy, and regulation of human-robot interaction (HRI). The increasing intelligence and human-likeness of social robots points to a challenging future for determining appropriate laws, policies, and regulations related to the design and use of AI robots. Japan, China, South Korea, and the US, along with the European Union, Australia and other countries are beginning to determine how to regulate AI-enabled robots, which concerns not only the law, but also issues of public policy and dilemmas of applied ethics affected by our personal interactions with social robots. The volume's interdisciplinary approach dissects both the specificities of multiple jurisdictions and the moral and legal challenges posed by human-like robots. As robots become more like us, so too will HRI raise issues triggered by human interactions with other people.
Article
Brian Skyrms' study of ideas of cooperation and collective action explores the implications of a prototypical story found in Rousseau's A Discourse on Inequality. It is therein that Rousseau contrasts the pay-off of hunting hare (where the risk of non-cooperation is small and the reward equally small) against the pay-off of hunting the stag (where maximum cooperation is required but the reward is much greater.) Thus, rational agents are pulled in one direction by considerations of risk and in another by considerations of mutual benefit. Written with Skyrms' characteristic clarity and verve, The Stage Hunt will be eagerly sought by readers who enjoyed his earlier work Evolution of the Social Contract. Brian Skyrms, distinguished Professor of Logic and Philosophy of Science and Economics at the University of California at Irvine and director of its interdisciplinary program in history and philosophy of science, has published widely in the areas of inductive logic, decision theory, rational deliberation and causality. Seminal works include Evolution of the Social Contract (Cambridge, 1996), The Dynamics of Rational Deliberation (Harvard, 1990), Pragmatics and Empiricism (Yale, 1984), and Causal Necessity (Yale, 1980).
Article
We consider a class of matrix games in which successful strategies are rewarded by high reproductive rates, so become more likely to participate in subsequent playings of the game. Thus, over time, the strategy mix should evolve to some type of optimal or stable state. Maynard Smith and Price (1973) have introduced the concept of ESS (evolutionarily stable strategy) to describe a stable state of the game. We attempt to model the dynamics of the game both in the continuous case, with a system of non-linear first-order differential equations, and in the discrete case, with a system of non-linear difference equations. Using this model, we look at the notions of stability and asymptotic behavior. Our notion of stable equilibrium for the continuous dynamic includes, but is somewhat more general than, the notion of ESS.
Article
This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word ``reinforcement.'' The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning. Comment: See http://www.jair.org/ for any accompanying files