Article

Superhuman AI for multiplayer poker

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

AI now masters six-player poker Computer programs have shown superiority over humans in two-player games such as chess, Go, and heads-up, no-limit Texas hold'em poker. However, poker games usually include six players—a much trickier challenge for artificial intelligence than the two-player variant. Brown and Sandholm developed a program, dubbed Pluribus, that learned how to play six-player no-limit Texas hold'em by playing against five copies of itself (see the Perspective by Blair and Saffidine). When pitted against five elite professional poker players, or with five copies of Pluribus playing against one professional, the computer performed significantly better than humans over the course of 10,000 hands of poker. Science , this issue p. 885 ; see also p. 864

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... AlphaStar exploited the game's fog-of-war mechanics to feint: to pretend to move its troops in one direction while secretly planning an alternative attack (Piper 2019). • Bluffs: Pluribus, a poker-playing model created by Meta, successfully bluffed human players into folding (Brown et al. 2019). • Cheating the safety test: AI agents learned to play dead, in order to avoid being detected by a safety test designed to eliminate faster-replicating variants of the AI (Lehman et al. 2020). ...
... Some situations naturally lend themselves to AIs learning how to deceive. For example, consider the poker-playing AI system Pluribus, developed by Meta and Carnegie Mellon University (Brown et al. 2019). Because players cannot see each others' cards, poker offers many opportunities for players to misrepresent their own strength and gain an advantage. ...
Preprint
This paper argues that a range of current AI systems have learned how to deceive humans. We define deception as the systematic inducement of false beliefs in the pursuit of some outcome other than the truth. We first survey empirical examples of AI deception, discussing both special-use AI systems (including Meta's CICERO) built for specific competitive situations, and general-purpose AI systems (such as large language models). Next, we detail several risks from AI deception, such as fraud, election tampering, and losing control of AI systems. Finally, we outline several potential solutions to the problems posed by AI deception: first, regulatory frameworks should subject AI systems that are capable of deception to robust risk-assessment requirements; second, policymakers should implement bot-or-not laws; and finally, policymakers should prioritize the funding of relevant research, including tools to detect AI deception and to make AI systems less deceptive. Policymakers, researchers, and the broader public should work proactively to prevent AI deception from destabilizing the shared foundations of our society.
... P OKER, a game that intertwines strategy and chance, has emerged as an exciting and challenging domain for artificial intelligence (AI) research, primarily due to its inherent nature of imperfect information. Pioneering AI agents such as Deepstack and Pluribus have leveraged poker, more specifically Texas hold'em, as a benchmark for evaluating their algorithms [1], [2]. Nonetheless, poker is not a monolithic game but a collection of numerous variants, each introducing its unique set of complexities. ...
... The continuous action space and the imperfect information nature inherent to poker present a compelling challenge for artificial intelligence. Most of the breakthroughs in Poker AI agents were limited to Texas hold'em variants [1], [2]. PokerKit's capacity to support a comprehensive set of poker variants, coupled with its robust and error-free nature, makes it an ideal framework for developing, testing, and benchmarking poker AI models that can generalize beyond Texas hold'em. ...
Preprint
Full-text available
PokerKit is an open-source Python library designed to overcome the restrictions of existing poker game simulation and hand evaluation tools, which typically support only a handful of poker variants and lack flexibility in game state control. In contrast, PokerKit significantly expands this scope by supporting an extensive array of poker variants and it provides a flexible architecture for users to define their custom games. This paper details the design and implementation of PokerKit, including its intuitive programmatic API, multi-variant game support, and a unified hand evaluation suite across different hand types. The flexibility of PokerKit allows for applications in diverse areas, such as poker AI development, tool creation, and online poker casino implementation. PokerKit's reliability has been established through static type checking, extensive doctests, and unit tests, achieving 97\% code coverage. The introduction of PokerKit represents a significant contribution to the field of computer poker, fostering future research and advanced AI development for a wide variety of poker games.
... [8] Or Pluribus, which consistently wins Texas holdem poker Tournaments? [9] Or Alpha Go, which conquered the Go world. [10] Agreed, fantastic achievements, especially in the case of Pluribus, which addresses uncertainty in every hand. ...
... Note that since the B/T was not an exact match to the Taylor series as shown in Table (3), the model was applied only to a subset of vessels in the dataset with the highest B/T; specifically, the subset contained 13 vessels with B/T range between 1.74 and 1.98. Also, since the hull shape is not an exact t as shown in Figure ( 9), this analysis should be considered only as an approximate estimate of resistance. ...
Article
Full-text available
Today’s naval operations are often faced with challenging decision spaces as technology exponentially expands and threat environments become more complex. Advances in artificial intelligence (AI) offer potential solutions to address the growing complexity in naval operations. Future AI systems offer potentially far-reaching benefits—improving situation awareness, increasing knowledge of threats and adversary capabilities and intents, identifying and evaluating possible tactical courses of action, and offering methods to predict outcomes and effects of course of action decisions. AI systems will play a critical role in supporting future naval warfighters and maintaining operational and tactical mission superiority. AI systems offer advantages to naval warfare, but only if these systems are engineered and implemented in a manner that supports effective warfighter-machine teaming, improves uncertainty in the operational situation, and makes recommendations that improve operational and tactical outcomes. Implementing AI systems that meet these demanding needs for naval applications presents challenges for the engineering design community. This paper identifies four challenges and describes how they affect warfare operations, the engineering community, and naval missions. This paper offers solution ideas for addressing the challenges through research and engineering initiatives.
... Applications of MCTS can also be found in Poker (Brown et al. [19]). The authors employed tree search with Counterfactual Regret Minimization described in section 1.1. ...
... The network's training is performed on randomly generated positions so that Introduction a sufficient degree of generalization of the game states can be achieved. The proposed extension allows DeepStack to defeat professional Texas Hold'em players.Brown et al.[18] [19] combine CFR with Monte Carlo (MCCFR). MC simulation is used to sample actions on every iteration to avoid passing through the entire game tree. ...
Thesis
Full-text available
The thesis presents machine learning methods used to create agents for The Lord of the Rings card game (LOTRCG). The distinguishing features of LOTRCG are the non-competitive nature and deep strategic dimension of the gameplay. The player makes decisions at five, well-defined moments during a round. Actions are sequential, meaning that a decision made in a particular game phase impacts a decision in the next phase. This mechanism of the round places emphasis on strategic card management. In addition, random events in the form of card discovery occur between decisions. These events prevent simplifications of the gameplay mechanism by combining several decision moments into one. LOTRCG is a collectible card game, which means a wide selection of cards available to the player. These cards have different statistics and special abilities profiling them for a particular phase of the game. The wide variety of cards translates into the high popularity of this title among card game enthusiasts. This element determining the game's appeal also poses a major challenge for AI agents. The complex nature of the round is the first object of research of this dissertation. A random agent was implemented to analyze the different phases of the game. The obtained results made it possible to identify key decision moments. In addition, the random agent was used to perform sampling during iterations of the MCTS algorithm. The dissertation uses two families of computational intelligence techniques. The first is Monte Carlo Tree Search (MCTS). This method is based on a heuristic search that uses random sampling of game states. MCTS stores gameplay as a tree, which is expanded iteratively. The MCTS algorithm was extended to include action reduction, which reduces the tree size. This modification was based on expert knowledge, which allows the elimination of cards with low utility value. In addition, the optimization of MCTS hyperparameters was carried out not only in terms of efficiency but also in terms of computational time. In this way, an optimal set of MCTS agents set was constructed, which achieved 82.8\% of winrate. The second family of techniques is reinforcement learning (RL). Reinforcement learning is based on a trial-and-error method in which the agent interacts with the environment. The agent receives an observation of the state of the game and then chooses an action, which is executed in the environment. The environment returns the reward and the next observation. The agent's goal is to maximize the sum of rewards throughout the episodes. Due to the agent's operation in an environment with a variable number of actions, the selection of RL algorithms was based on how the actions were encoded. Two types of coding were introduced. The first is macro actions, which allow a fixed number of actions. The second type is direct action choice, which is an agent's playing of individual cards. Q-Learning and Actor-Critic (AC) algorithms were implemented for macro-actions. Q-Learning is a widely used solution in the RL field that approximates the utility value of a game state. Actor-Critic uses an approximation of both utility and strategy functions. The AC agent is programmed for macro-actions as well as direct actions. As a result of the optimization, the best agents RL set showed an efficiency of 95.3\%. The paper compares the best MCTS and RL agents. The conducted experiments show a significant advantage of reinforcement learning, but this solution requires a long learning time, which strongly depends on the available computational power. MCTS seemingly has fewer requirements, but the decision time is much longer.
... This commercial model made THHUP a true skill-based gambling format, as considered by the gambling literature (Turner & Fritz, 2001). This was not an unwise bet, as another computer had in 2008 beaten a top team of professional players in this poker format (Newall, 2018), the use of computers as training aides by professionals was just about to begin (Newall, 2013), and computers would soon beat top professionals at nolimit hold 'em, a more complex poker game (Brown & Sandholm, 2017, 2019. ...
... It is possible that the machine was of little interest to amateur gamblers, given that no-limit hold 'em is more popular than limit hold 'em, which the machine plays. However, computers can now play no-limit hold 'em better than top professionals can (Brown & Sandholm, 2017, 2019, and so if this was the reason for THHUP being discontinued, it would seem like a more popular version playing no-limit hold 'em could be developed. Amateur players can also be suspicious when they lose, and the machine made many plays that appear unconventional to amateurs and yet have theoretical justification (Chen & Ankenman, 2006). ...
... obtained in the future from a hybrid approach using interaction with a quantum device to define and calculate the objective function. We consider reinforcement learning (RL) because it has already achieved beyond human level performance in a variety of areas, including Chess, Go, Poker, and Atari games [18][19][20][21]. In order to learn these strategies, all that is required is lots of experience, that is, the ability to play the game many times over and over. ...
... To fix the ambiguity of going from a tensor network to a code, we will always take the first qubit of the first lego to be the only logical leg. As a baseline, we compare to naïve code concatenation, requiring five copies of the base code, that yields a distance [ [21,1,4]] code (Fig. 3). This code can clearly be constructed as a T6 QL code. ...
Preprint
Full-text available
The recently introduced Quantum Lego framework provides a powerful method for generating complex quantum error correcting codes (QECCs) out of simple ones. We gamify this process and unlock a new avenue for code design and discovery using reinforcement learning (RL). One benefit of RL is that we can specify \textit{arbitrary} properties of the code to be optimized. We train on two such properties, maximizing the code distance, and minimizing the probability of logical error under biased Pauli noise. For the first, we show that the trained agent identifies ways to increase code distance beyond naive concatenation, saturating the linear programming bound for CSS codes on 13 qubits. With a learning objective to minimize the logical error probability under biased Pauli noise, we find the best known CSS code at this task for $\lesssim 20$ qubits. Compared to other (locally deformed) CSS codes, including Surface, XZZX, and 2D Color codes, our $[[17,1,3]]$ code construction actually has \textit{lower} adversarial distance, yet better protects the logical information, highlighting the importance of QECC desiderata. Lastly, we comment on how this RL framework can be used in conjunction with physical quantum devices to tailor a code without explicit characterization of the noise model.
... Board games such as checkers [1] and chess [2] were the first to be solved by AI algorithms in the last century. With the boost in computational power and the application of new algorithms, in the last decade, AI systems have achieved superhuman performance in many games that once, only humans were believed to be able to master, from board games such as Go [3][4][5] and card games such as Texas Hold'em [6][7][8] to video games such as StarCraft [9], Dota 2 [10], and HoK [11]. These games are all popular among humans, with various annual and seasonal competitions held all over the world. ...
... In the last decade, AI systems have achieved superhuman performance in many popular games played by humans, thanks to the advances in AI techniques and the boost in computational power. These games include card games such as Texas Hold'em [6][7][8], which is the most popular card game played in America, and multi-player video games such as StarCraft [9] and Dota 2 [10], which have well-established e-sports events popular all over the world. One important reason these games draw attention to AI research is that these games are famous enough to form communities of players who organize annual or seasonal tournaments both for human players and for AI programs. ...
Article
Full-text available
Games have long been benchmarks and testbeds for AI research. In recent years, with the development of new algorithms and the boost in computational power, many popular games played by humans have been solved by AI systems. Mahjong is one of the most popular games played in China and has been spread worldwide, which presents challenges for AI research due to its multi-agent nature, rich hidden information, and complex scoring rules, but it has been somehow overlooked in the community of game AI research. In 2020 and 2022, we held two AI competitions of Official International Mahjong, the standard variant of Mahjong rules, in conjunction with a top-tier AI conference called IJCAI. We are the first to adopt the duplicate format in evaluating Mahjong AI agents to mitigate the high variance in this game. By comparing the algorithms and performance of AI agents in the competitions, we conclude that supervised learning and reinforcement learning are the current state-of-the-art methods in this game and perform much better than heuristic methods based on human knowledge. We also held a human-versus-AI competition and found that the top AI agent still could not beat professional human players. We claim that this game can be a new benchmark for AI research due to its complexity and popularity among people.
... In general, artificial systems based on neural networks and deep learning have achieved impressive success in many domains based on pattern recognition, like speech recognition, lipreading, and game-playing. Examples in this area include LipNet (Assael et al. 2016), a program for lung cancer screening (Ardila et al. 2019), AlphaFold -predicting protein structure (Jumper et al. 2021), and AlphaTensor -discovering novel matrix multiplication algorithms (Fawzi et al. 2022), DeepBlue (Campbell 2002), AlphaGo and other gameplaying programs (Brown & Sandholm 2019;Silver et al. 2016Silver et al. , 2018. With respect to language, we have observed impressive progress in automatic translation (DeepL) and computer code generation (GitHub Copilot) over the past few years, all relying on LLMs and deep learning. ...
... Likewise, the increased quality of translation software is indeed a helpful tool, although it is still advisable not to release the translation without human review. In other domains, such as game-playing AIs, neural network architectures have led to success (Campbell 2002;Brown & Sandholm 2019;Silver et al. 2016Silver et al. , 2018. And impressive results can also be pointed to in scientific fields, such as a program for lung cancer screening (Ardila et al. 2019), or AlphaFold, that is predicting protein structure (Jumper et al. 2021), and AlphaTensor, that is discovering novel matrix multiplication algorithms (Fawzi et al. 2022). ...
Preprint
Full-text available
Natural language processing based on large language models (LLMs) is a booming field of AI research. After neural networks have proven to outperform humans in games and practical domains based on pattern recognition, we might stand now at a road junction where artificial entities might eventually enter the realm of human communication. However, this comes with serious risks. Due to the inherent limitations regarding the reliability of neural networks, overreliance on LLMs can have disruptive consequences. Since it will be increasingly difficult to distinguish between human-written and machine-generated text, one is confronted with new ethical challenges. This begins with the no longer undoubtedly verifiable human authorship and continues with various types of fraud, such as a new form of plagiarism. This also concerns the violation of privacy rights, the possibility of circulating counterfeits of humans, and, last but not least, it makes a massive spread of misinformation possible.
... Existing VSMM trackers rarely take data accumulation and learning methods into account when performing MSA. Reinforcement learning (RL) [31,32] is a machine learning technique that considers both data learning and decision making [33], and its online decision mechanism has the potential to improve MSA. As a result, we incorporate RL into the MSA mechanism. ...
Article
Full-text available
A well‐liked maneuvering target tracking algorithm is a variable structure multi‐model (VSMM). One of the crucial elements determining the tracking effect is the successful model set adaptation (MSA). The ability to further enhance tracking accuracy for the conventional VSMM method is constrained by the absence of a mechanism to thoroughly utilise observation and tracking data to optimise MSA. We incorporate the Reinforcement learning (RL) approach into the MSA procedure to address this issue and provide a VSMM algorithm based on Monte Carlo (MC) learning. To formulate the challenge of optimising the number of effective models as a RL problem, we first used the prediction error, the number of effective model sets, and tracking accuracy to build the models of the appropriate state space, decision space, and reward. The number of efficient models was optimised using MC learning, and the entire VSMM algorithm was then created. The proposed approach was compared to the simulated experiment's five maneuvering target tracking algorithms. The outcomes demonstrate that the suggested algorithm has a lower computation scale and accurate tracking accuracy.
... Although theoretical understanding of their success in practice is still a mystery (Farina et al. 2023), in practice CFR+ converges much faster than vanilla CFR. CFR+ was used to solve headsup limit Texas hold'em poker (Bowling et al. 2015) and heads-up no-limit Texas hold'em poker (Brown and Sandholm 2019b). Warm start is also a technique to accelerate convergence. ...
Preprint
Full-text available
Counterfactual Regret Minimization (CFR) and its variants are the best algorithms so far for solving large-scale incomplete information games. Building upon CFR, this paper proposes a new algorithm named Pure CFR (PCFR) for achieving better performance. PCFR can be seen as a combination of CFR and Fictitious Play (FP), inheriting the concept of counterfactual regret (value) from CFR, and using the best response strategy instead of the regret matching strategy for the next iteration. Our theoretical proof that PCFR can achieve Blackwell approachability enables PCFR's ability to combine with any CFR variant including Monte Carlo CFR (MCCFR). The resultant Pure MCCFR (PMCCFR) can significantly reduce time and space complexity. Particularly, the convergence speed of PMCCFR is at least three times more than that of MCCFR. In addition, since PMCCFR does not pass through the path of strictly dominated strategies, we developed a new warm-start algorithm inspired by the strictly dominated strategies elimination method. Consequently, the PMCCFR with new warm start algorithm can converge by two orders of magnitude faster than the CFR+ algorithm.
... One of the significant milestones was witnessed in 2016 when an AI player AlphaGo [9,10], was able to defeat the human world champion Sedol Lee in a game of Go. Since then, deep reinforcement learning (RL)based programs and applications have been developed to determine the degree to which they can compete with human champion players in various games like Texas hold'em poker [11] and Dota 2 [12]. Taking into consideration the revolutionary success of AI, researchers predict that it will snowball in the next few years, reaching $190.61 billion market value in 2025 [13][14][15]. ...
Article
Full-text available
Recent years have seen a tremendous growth in Artificial Intelligence (AI)-based methodological development in a broad range of domains. In this rapidly evolving field, large number of methods are being reported using machine learning (ML) and Deep Learning (DL) models. Majority of these models are inherently complex and lacks explanations of the decision making process causing these models to be termed as 'Black-Box'. One of the major bottlenecks to adopt such models in mission-critical application domains, such as banking, e-commerce, healthcare, and public services and safety, is the difficulty in interpreting them. Due to the rapid proleferation of these AI models, explaining their learning and decision making process are getting harder which require transparency and easy predictability. Aiming to collate the current state-of-the-art in interpreting the black-box models, this study provides a comprehensive analysis of the explainable AI (XAI) models. To reduce false negative and false positive outcomes of these back-box models, finding flaws in them is still difficult and inefficient. In this paper, the development of XAI is reviewed meticulously through careful selection and analysis of the current state-of-the-art of XAI research. It also provides a comprehensive and in-depth evaluation of the XAI frameworks and their efficacy to serve as a starting point of XAI for applied and theoretical researchers. Towards the end, it highlights emerging and critical issues pertaining to XAI research to showcase major, model-specific trends for better explanation, enhanced transparency, and improved prediction accuracy.
... The success of ML primarily hinges on the development of the designing and training of deep neural networks, which approximate complex, high-dimensional mappings with desirable accuracy and rapid evaluations. As a result, ML has become a leading force in different areas of artificial intelligence (AI), such as natural language processing (large language models such as chatGPT [66]), computer vision (e.g., NerF [43], Diffusion Models [30]) and games (e.g., mastering the game of Go [56], Poker [6], Starcraft II [63]). Moreover, the excellent approximation capabilities of deep neural networks have helped discover new patterns or principles within large, multidimensional data sets. ...
Preprint
This paper presents a novel, interdisciplinary study that leverages a Machine Learning (ML) assisted framework to explore the geometry of affine Deligne-Lusztig varieties (ADLV). The primary objective is to investigate the nonemptiness pattern, dimension and enumeration of irreducible components of ADLV. Our proposed framework demonstrates a recursive pipeline of data generation, model training, pattern analysis, and human examination, presenting an intricate interplay between ML and pure mathematical research. Notably, our data-generation process is nuanced, emphasizing the selection of meaningful subsets and appropriate feature sets. We demonstrate that this framework has a potential to accelerate pure mathematical research, leading to the discovery of new conjectures and promising research directions that could otherwise take significant time to uncover. We rediscover the virtual dimension formula and provide a full mathematical proof of a newly identified problem concerning a certain lower bound of dimension. Furthermore, we extend an open invitation to the readers by providing the source code for computing ADLV and the ML models, promoting further explorations. This paper concludes by sharing valuable experiences and highlighting lessons learned from this collaboration.
... Multi-agent reinforcement learning (MARL) is a discipline developed to analyze strategic interactions in a dynamically changing environment, where agents' actions affect both the state of the nature and the rewards of the other agents. Guided by the game-theoretic principles, several cornerstone results in benchmark domains in AI have been witnessed [Bowling et al., 2015, Silver et al., 2017, Moravčík et al., 2017, Brown and Sandholm, 2018, 2019, Brown et al., 2020, Perolat et al., 2022. Most of the outstanding progress relies on the scalability and the decentralized manner of the algorithms for computing Nash equilibria (NE), which is a standard game-theoretic notion characterizing the rationality of the agents, in the complex multi-agent settings. ...
Preprint
Computing approximate Nash equilibria in multi-player general-sum Markov games is a computationally intractable task. However, multi-player Markov games with certain cooperative or competitive structures might circumvent this intractability. In this paper, we focus on multi-player zero-sum polymatrix Markov games, where players interact in a pairwise fashion while remain overall competitive. To the best of our knowledge, we propose the first policy optimization algorithm called Entropy-Regularized Optimistic-Multiplicative-Weights-Update (ER-OMWU) for finding approximate Nash equilibria in finite-horizon zero-sum polymatrix Markov games with full information feedback. We provide last-iterate convergence guarantees for finding an $\epsilon$-approximate Nash equilibrium within $\tilde{O}(1/\epsilon)$ iterations, which is near-optimal compared to the optimal $O(1/\epsilon)$ iteration complexity in two-player zero-sum Markov games, which is a degenerate case of zero-sum polymatrix games with only two players involved. Our algorithm combines the regularized and optimistic learning dynamics with separated smooth value update within a single loop, where players update strategies in a symmetric and almost uncoupled manner. It provides a natural dynamics for finding equilibria and is more probable to be adapted to a sample-efficient and fully decentralized implementation where only partial information feedback is available in the future.
... Citing the rise of Ai, 4 the human go champion, Lee Sedol (who lost four games, but won one to Alphago in 2016) recently announced his retirement! poker -an imperfect information game, as other players' cards are hidden -has also seen tremendous advances of late, with machines trumping over humans (Brown and Sandholm 2019). ...
Article
Full-text available
Innovative, bold initiatives that capture the imagination of researchers and system builders are often required to spur a field of science or technology forward. A vision for the future of artificial intelligence was laid out by Turing Award winner Raj Reddy in his 1988 Presidential address to the Association for the Advancement of Artificial Intelligence. It is time to provide an accounting of the progress that has been made in the field, over the last three decades, toward the challenge goals. While some tasks such as the world‐champion chess machine were accomplished in short order, many others, such as self‐replicating systems, require more focus and breakthroughs for completion. A new set of challenges for the current decade is also proposed, spanning the health, wealth, and wisdom spheres.
... Since their early success in computer vision, 328-332 they have set new standards in natural language processing 333-336 and the playing of complex games, such as Go 337,338 or Poker. [339][340][341] Deep learning also increasingly impacts the natural sciences; 342 for example, deep neural networks recently helped predict the 3D-structure of a nearly every human protein 343 in a breakthrough for structural biology. Further applications of machine learning to solve physics problems are also given in Sec. ...
Article
Adaptivity is a dynamical feature that is omnipresent in nature, socio-economics, and technology. For example, adaptive couplings appear in various real-world systems, such as the power grid, social, and neural networks, and they form the backbone of closed-loop control strategies and machine learning algorithms. In this article, we provide an interdisciplinary perspective on adaptive systems. We reflect on the notion and terminology of adaptivity in different disciplines and discuss which role adaptivity plays for various fields. We highlight common open challenges and give perspectives on future research directions, looking to inspire interdisciplinary approaches.
... Recently reinforcement learning (RL), including single agent RL and multi-agent RL (MARL), has received significant research interests, partly due to its many applications in a variety of scenarios such as the autonomous driving, traffic signal control, cooperative robotics, economic policy-making, and video games [1,2,3,4,5,6,7,8]. In MARL, at each state, each agent takes its own action, and these actions jointly determine the next state of the environment and the reward of each agent. ...
Preprint
Full-text available
Due to the broad range of applications of multi-agent reinforcement learning (MARL), understanding the effects of adversarial attacks against MARL model is essential for the safe applications of this model. Motivated by this, we investigate the impact of adversarial attacks on MARL. In the considered setup, there is an exogenous attacker who is able to modify the rewards before the agents receive them or manipulate the actions before the environment receives them. The attacker aims to guide each agent into a target policy or maximize the cumulative rewards under some specific reward function chosen by the attacker, while minimizing the amount of manipulation on feedback and action. We first show the limitations of the action poisoning only attacks and the reward poisoning only attacks. We then introduce a mixed attack strategy with both the action poisoning and the reward poisoning. We show that the mixed attack strategy can efficiently attack MARL agents even if the attacker has no prior information about the underlying environment and the agents' algorithms.
... However, solving equilibrium in multi-player zero-sum games with three or more players remains a tricky challenge. There are three main reasons for this: Firstly, the CFR-like algorithms for finding NE are widely used in 2p0s games, but no theoretical guarantees are provided in the literature whether they can be directly used in multi-player games [4]; Secondly, NEs are not unique in multi-player games, and the independent strategies of each player cannot easily form a unique NE [8]; Thirdly, computing NEs is PPAD-complete for multi-player zero-sum games [13]. ...
Preprint
Many recent practical and theoretical breakthroughs focus on adversarial team multi-player games (ATMGs) in ex ante correlation scenarios. In this setting, team members are allowed to coordinate their strategies only before the game starts. Although there existing algorithms for solving extensive-form ATMGs, the size of the game tree generated by the previous algorithms grows exponentially with the number of players. Therefore, how to deal with large-scale zero-sum extensive-form ATMGs problems close to the real world is still a significant challenge. In this paper, we propose a generic multi-player transformation algorithm, which can transform any multi-player game tree satisfying the definition of AMTGs into a 2-player game tree, such that finding a team-maxmin equilibrium with correlation (TMECor) in large-scale ATMGs can be transformed into solving NE in 2-player games. To achieve this goal, we first introduce a new structure named private information pre-branch, which consists of a temporary chance node and coordinator nodes and aims to make decisions for all potential private information on behalf of the team members. We also show theoretically that NE in the transformed 2-player game is equivalent TMECor in the original multi-player game. This work significantly reduces the growth of action space and nodes from exponential to constant level. This enables our work to outperform all the previous state-of-the-art algorithms in finding a TMECor, with 182.89, 168.47, 694.44, and 233.98 significant improvements in the different Kuhn Poker and Leduc Poker cases (21K3, 21K4, 21K6 and 21L33). In addition, this work first practically solves the ATMGs in a 5-player case which cannot be conducted by existing algorithms.
... Competitive games are a common domain for A.I. research. 8,[81][82][83][84] In our studies, game-competitor systems consistently appeared competent yet cold, potentially challenging the establishment of human-A.I. cooperation and trust. This insight may be particularly relevant in the labor domain, given the widespread concerns about job displacement that A.I. development has prompted. ...
Article
Full-text available
Artificial intelligence (A.I.) increasingly suffuses everyday life. However, people are frequently reluctant to interact with A.I. systems. This challenges both the deployment of beneficial A.I. technology and the development of deep learning systems that depend on humans for oversight, direction, and regulation. Nine studies (N = 3,300) demonstrate that social-cognitive processes guide human interactions across a diverse range of real-world A.I. systems. Across studies, perceived warmth and competence emerge prominently in participants' impressions of A.I. systems. Judgments of warmth and competence systematically depend on human-A.I. interdependence and autonomy. In particular, participants perceive systems that optimize interests aligned with human interests as warmer and systems that operate independently from human direction as more competent. Finally, a prisoner's dilemma game shows that warmth and competence judgments predict participants' willingness to cooperate with a deep-learning system. These results underscore the generality of intent detection to perceptions of a broad array of algorithmic actors.
... Deep Reinforcement Learning (DRL) has shown its great potential in computer games [1][2][3], and other games like chess [4][5][6] and poker [7][8][9]. In those areas, the DRL agent can achieve even better performance than human players. ...
Preprint
Full-text available
Deep reinforcement learning (DRL) has advanced robot manipulations with an alternative solution to design a control strategy using the raw image as the input directly. Although the image usually comes up with more knowledge about the environment, it needs the policy to achieve representation learning and task learning simultaneously, which is a sample inefficient task. Previous attempts, such as Variational Autoencoder (VAE) based DRL algorithms have attempted to solve this problem by learning a visual representation model, which encodes the entire image into a low-dimension vector. However, since the vector contains both the robot and object information, the coupling within the state is inevitable, which could mislead the training process of DRL policy. In this study, a novel method named Reinforcement Learning with Decoupled State Representation (RLDS) is proposed to decouple the robot and object information to increase the learning efficiency and effectiveness. The experimental results have shown that the proposed method has a faster learning speed and can achieve better performance compared with previous methods in several typical robot tasks. Additionally, with only 3,096 offline images, the proposed method can be successfully applied to a real robot pushing task, which demonstrates its high practicability.
... The field of material science heavily relies on data, and as such, the community places significant emphasis on AI due to its exceptional data-mining capabilities. By utilizing both big data and AI, vast amounts of pre-existing information can be synthesized into untested hypotheses that may guide future research endeavors [2][3][4][5][6][7][8]. Meanwhile, this approach is suitable for addressing complex composite spaces or nonlinear processes, which facilitates the resolution of current challenges encountered in material research. ...
... The algorithmic game theory of strategy solving is more familiar to researchers in computing, which searches for the optimal strategy under a fixed mechanism. The methods of finding equilibrium are constantly being improved, from the DeepBlue victory over chess master Kasparov, the AlphaGo [139], AlphaGo Zero [140] defeat of Go world champion Li Shishi and Ke Jie, to the Cepheus [18], DeepStack [104], Libratus [16] and Pluribus [19] successive conquests of the challenges in two-player limit and nolimit, multiplayer no-limit Texas Hold'em games. The state-of-the-art researches also include the Awareness that defeated a team of professional players from the Glory of Kings and exploring correlated equilibrium [23,48,176,177], Stackelberg equilibrium [10,81,89,182] and other different kinds of equilibrium solutions. ...
Article
Full-text available
The mechanism design theory can be applied not only in the economy but also in many fields, such as politics and military affairs, which has important practical and strategic significance for countries in the period of system innovation and transformation. As Nobel Laureate Paul said, the complexity of the real economy makes it difficult for “Unorganized Markets” to ensure supply-demand balance and the efficient allocation of resources. When traditional economic theory cannot explain and calculate the complex scenes of reality, we require a high-performance computing solution based on traditional theory to evaluate the mechanisms, meanwhile, get better social welfare. The mechanism design theory is undoubtedly the best option. Different from other existing works, which are based on the theoretical exploration of optimal solutions or single perspective analysis of scenarios, this paper focuses on the more real and complex markets. It explores to discover the common difficulties and feasible solutions for the applications. Firstly, we review the history of traditional mechanism design and algorithm mechanism design. Subsequently, we present the main challenges in designing the actual data-driven market mechanisms, including the inherent challenges in the mechanism design theory, the challenges brought by new markets and the common challenges faced by both. In addition, we also comb and discuss theoretical support and computer-aided methods in detail. This paper guides cross-disciplinary researchers who wish to explore the resource allocation problem in real markets for the first time and offers a different perspective for researchers struggling to solve complex social problems. Finally, we discuss and propose new ideas and look to the future.
... Reinforcement learning (RL) (Sutton and Barto, 2018) is a prominent approach to solving sequential decision making problems. Its tremendous successes (Kober et al., 2013;Silver et al., 2016Silver et al., , 2017Brown and Sandholm, 2019) can be attributed, in large part, to the advent of deep learning (LeCun et al., 2015) and the development of powerful deep RL algorithms (Mnih et al., 2015;Schulman et al., 2015Schulman et al., , 2017Haarnoja et al., 2018). Among these algorithms, the proximal policy optimization (PPO) (Schulman et al., 2017) stands out as a particularly significant approach. ...
Preprint
The proximal policy optimization (PPO) algorithm stands as one of the most prosperous methods in the field of reinforcement learning (RL). Despite its success, the theoretical understanding of PPO remains deficient. Specifically, it is unclear whether PPO or its optimistic variants can effectively solve linear Markov decision processes (MDPs), which are arguably the simplest models in RL with function approximation. To bridge this gap, we propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback, and establish a $\tilde{\mathcal{O}}(d^{3/4}H^2K^{3/4})$ regret for it. Here $d$ is the ambient dimension of linear MDPs, $H$ is the length of each episode, and $K$ is the number of episodes. Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both stochastic linear MDPs and adversarial linear MDPs with full information. Additionally, our algorithm design features a novel multi-batched updating mechanism and the theoretical analysis utilizes a new covering number argument of value and policy classes, which might be of independent interest.
... Population-based approaches take advantage of the high parallelism and large search space typically found in optimization problems. This method has demonstrated remarkable performance in MARL, resulting in exceptional performance in games such as Pluribus [10] and OpenAI Five [11] without any expert experience. Czarnecki et al. [12] conducted research on the importance of population-based training techniques for large-scale multiagent environments, which can include both virtual and real-world scenarios. ...
Article
Full-text available
Many real-world applications can be described as large-scale games of imperfect information, which require extensive prior domain knowledge, especially in competitive or human–AI cooperation settings. Population-based training methods have become a popular solution to learn robust policies without any prior knowledge, which can generalize to policies of other players or humans. In this survey, we shed light on population-based deep reinforcement learning (PB-DRL) algorithms, their applications, and general frameworks. We introduce several independent subject areas, including naive self-play, fictitious self-play, population-play, evolution-based training methods, and the policy-space response oracle family. These methods provide a variety of approaches to solving multi-agent problems and are useful in designing robust multi-agent reinforcement learning algorithms that can handle complex real-life situations. Finally, we discuss challenges and hot topics in PB-DRL algorithms. We hope that this brief survey can provide guidance and insights for researchers interested in PB-DRL algorithms.
... Reinforcement learning (RL) studies how a world-agnostic agent makes sequential decisions to maximize utility. It gains increasing popularity and achieves milestone progress in Atari [34,15], Go [36], Poker [5,6,27], video games [1,3], and bioinformatics [17], economics [45], etc. The canonical RL formulation requires the environment to be stationary, meaning that other agents could not respond to the ego agent's policy by adapting their own policies [38]. ...
Preprint
Reinforcement learning (RL) mimics how humans and animals interact with the environment. The setting is somewhat idealized because, in actual tasks, other agents in the environment have their own goals and behave adaptively to the ego agent. To thrive in those environments, the agent needs to influence other agents so their actions become more helpful and less harmful. Research in computational economics distills two ways to influence others directly: by providing tangible goods (mechanism design) and by providing information (information design). This work investigates information design problems for a group of RL agents. The main challenges are two-fold. One is the information provided will immediately affect the transition of the agent trajectories, which introduces additional non-stationarity. The other is the information can be ignored, so the sender must provide information that the receivers are willing to respect. We formulate the Markov signaling game, and develop the notions of signaling gradient and the extended obedience constraints that address these challenges. Our algorithm is efficient on various mixed-motive tasks and provides further insights into computational economics. Our code is available at https://github.com/YueLin301/InformationDesignMARL.
... During the past decade, under the joint driving force of big data and the availability of powerful computing hardware, Deep Neural Networks (DNNs) [1]- [4] have made revolutionary progress, and are now applied in a wide spectrum of fields including computer vision [5], [6] , speech recognition [7], natural language processing [8], autonomous driving [9], cancer detection [10], medicine discovery [11], playing complex games [12]- [14], recommendation systems [15], and robotics [16]. Acknowledgedly, it is of high cost to obtain high-quality models, as training accurate, large DNNs requires the following essential resources: • Massive amounts of high-quality data that are costly to gather and manage, usually requiring the painstaking efforts of experienced human annotators or even experts; • Expensive high-performance computing hardwares like GPUs; • Advanced DNNs themselves that require experienced experts to design their architectures and tune their hyperparameters, which is also costly and time consuming. ...
Preprint
With the widespread application in industrial manufacturing and commercial services, well-trained deep neural networks (DNNs) are becoming increasingly valuable and crucial assets due to the tremendous training cost and excellent generalization performance. These trained models can be utilized by users without much expert knowledge benefiting from the emerging ''Machine Learning as a Service'' (MLaaS) paradigm. However, this paradigm also exposes the expensive models to various potential threats like model stealing and abuse. As an urgent requirement to defend against these threats, Deep Intellectual Property (DeepIP), to protect private training data, painstakingly-tuned hyperparameters, or costly learned model weights, has been the consensus of both industry and academia. To this end, numerous approaches have been proposed to achieve this goal in recent years, especially to prevent or discover model stealing and unauthorized redistribution. Given this period of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field. More than 190 research contributions are included in this survey, covering many aspects of Deep IP Protection: challenges/threats, invasive solutions (watermarking), non-invasive solutions (fingerprinting), evaluation metrics, and performance. We finish the survey by identifying promising directions for future research.
... Over the last few years, the capabilities of artificial intelligence (AI) have undergone considerable technical advances. Nowadays, the performance of AI models is similar to, and in certain application areas even exceeds the performance of human experts [12,29,64]. For example, in the medical domain, AI models can detect certain diseases as accurately as radiologists [18,27,36]. ...
Article
Full-text available
Anticipatory thinking (AT) and design have many commonalities. We identify three challenges for all computational AT systems: representation, generation, and evaluation. We discuss how existing artificial intelligence techniques provide some methods for addressing these, but also fall significantly short. Next, we articulate where AT concepts appear in three computational design paradigms: configuration design, design for resilience, and conceptual design. We close by identifying two promising future directions at the intersection of AT and design: modeling other humans and new interfaces to support human decision‐makers.
Article
Can large language models produce expert‐quality philosophical texts? To investigate this, we fine‐tuned GPT‐3 with the works of philosopher Daniel Dennett. To evaluate the model, we asked the real Dennett 10 philosophical questions and then posed the same questions to the language model, collecting four responses for each question without cherry‐picking. Experts on Dennett's work succeeded at distinguishing the Dennett‐generated and machine‐generated answers above chance but substantially short of our expectations. Philosophy blog readers performed similarly to the experts, while ordinary research participants were near chance distinguishing GPT‐3's responses from those of an “actual human philosopher”.
Article
One of the most popular methods for learning Nash equilibrium (NE) in large-scale imperfect information extensive-form games (IIEFGs) is the neural variants of counterfactual regret minimization (CFR). CFR is a special case of Follow-The-Regularized-Leader (FTRL). At each iteration, the neural variants of CFR update the agent's strategy via the estimated counterfactual regrets. Then, they use neural networks to approximate the new strategy, which incurs an approximation error. These approximation errors will accumulate since the counterfactual regrets at iteration t are estimated using the agent's past approximated strategies. Such accumulated approximation error causes poor performance. To address this accumulated approximation error, we propose a novel FTRL algorithm called FTRL-ORW, which does not utilize the agent's past strategies to pick the next iteration strategy. More importantly, FTRL-ORW can update its strategy via the trajectories sampled from the game, which is suitable to solve large-scale IIEFGs since sampling multiple actions for each information set is too expensive in such games. However, it remains unclear which algorithm to use to compute the next iteration strategy for FTRL-ORW when only such sampled trajectories are revealed at iteration t. To address this problem and scale FTRL-ORW to large-scale games, we provide a model-free method called Deep FTRL-ORW, which computes the next iteration strategy using model-free Maximum Entropy Deep Reinforcement Learning. Experimental results on two-player zero-sum IIEFGs show that Deep FTRL-ORW significantly outperforms existing model-free neural methods and OS-MCCFR.
Article
Full-text available
Counterfactual Regret Minimization algorithms are the most popular way of estimating the Nash Equilibrium in imperfect-information zero-sum games. In particular, DeepStack -- the state-of-the-art Poker bot -- employs the so-called Deep Counterfactual Value Network (DCVN) to learn the Counterfactual Values (CFVs) associated with various states in the game. Each CFV is a multiplication of two factors: (1) the probability that the opponent would reach a given state in a game, which can be explicitly calculated from the input data, and (2) the expected value (EV) of a payoff in that state, which is a complex function of the input data, hard to calculate. In this paper, we propose a simple yet powerful modification to the CFVs estimation process, which consists in utilizing a deep neural network to estimate only the EV factor of CFV. This new target setting significantly simplifies the learning problem and leads to much more accurate CFVs estimation. A direct comparison, in terms of CFVs prediction losses, shows a significant prediction accuracy improvement of the proposed approach (DEVN) over the original DCVN formulation (relatively by 9.18-15.70% when using card abstraction, and by 3.37-8.39% without card abstraction, depending on a particular setting). Furthermore, the application of DEVN improves the theoretical lower bound of the error by 29.05-31.83% compared to the DCVN pipeline when card abstraction is applied. Additionally, DEVN is able to achieve the goal using significantly smaller, and faster to infer, networks. While the proposed modification may seem to be of a rather technical nature, it, in fact, presents a fundamentally different approach to the overall process of learning and estimating CFVs, since the distributions of the training signals differ significantly between DCVN and DEVN. The former estimates CFVs, which are biased by the probability of reaching a given game state, while training the latter relies on a direct EV estimation, regardless of the state probability. In effect, the learning signal of DEVN presents a better estimation of the true value of a given state, thus allowing more accurate CFVs estimation.
Article
Function approximation (FA) has been a critical component in solving large zero-sum games. Yet, little attention has been given towards FA in solving general-sum extensive-form games, despite them being widely regarded as being computationally more challenging than their fully competitive or cooperative counterparts. A key challenge is that for many equilibria in general-sum games, no simple analogue to the state value function used in Markov Decision Processes and zero-sum games exists. In this paper, we propose learning the Enforceable Payoff Frontier (EPF)---a generalization of the state value function for general-sum games. We approximate the optimal Stackelberg extensive-form correlated equilibrium by representing EPFs with neural networks and training them by using appropriate backup operations and loss functions. This is the first method that applies FA to the Stackelberg setting, allowing us to scale to much larger games while still enjoying performance guarantees based on FA error. Additionally, our proposed method guarantees incentive compatibility and is easy to evaluate without having to depend on self-play or approximate best-response oracles.
Article
Full-text available
Board games are extensively studied in the AI community because of their ability to reflect/represent real-world problems with a high-level of abstraction, and their irreplaceable role as testbeds of state-of-the-art AI algorithms. Modern board games are commonly featured with partially observable state spaces and imperfect information. Despite some recent successes in AI tackling perfect information board games like chess and Go, most imperfect information games are still challenging and have yet to be solved. This paper empirically explores the capabilities of a state-of-the-art Reinforcement Learning (RL) algorithm -- Proximal Policy Optimization (PPO) in playing Ticket to Ride, a popular board game with features of imperfect information, large state-action space, and delayed rewards. This paper explores the feasibility of the proposed generalizable modelling and training schemes using a general-purpose RL algorithm with no domain knowledge-based heuristics beyond game rules, game states and scores to tackle this complex imperfect information game. The performance of the proposed methodology is demonstrated in a scaled-down version of Ticket to Ride with a range of RL agents obtained with different training schemes. All RL agents achieve clear advantages over a set of well-designed heuristic agents. The agent constructed through a self-play training scheme outperforms the other RL agents in a Round Robin tournament. The high performance and versality of this self-play agent provide a solid demonstration of the capabilities of this framework.
Preprint
Full-text available
The design of autonomous agents that can interact effectively with other agents without prior coordination is a core problem in multi-agent systems. Type-based reasoning methods achieve this by maintaining a belief over a set of potential behaviours for the other agents. However, current methods are limited in that they assume full observability of the state and actions of the other agent or do not scale efficiently to larger problems with longer planning horizons. Addressing these limitations, we propose Partially Observable Type-based Meta Monte-Carlo Planning (POTMMCP) - an online Monte-Carlo Tree Search based planning method for type-based reasoning in large partially observable environments. POTMMCP incorporates a novel meta-policy for guiding search and evaluating beliefs, allowing it to search more effectively to longer horizons using less planning time. We show that our method converges to the optimal solution in the limit and empirically demonstrate that it effectively adapts online to diverse sets of other agents across a range of environments. Comparisons with the state-of-the art method on problems with up to $10^{14}$ states and $10^8$ observations indicate that POTMMCP is able to compute better solutions significantly faster.
Preprint
Recently, sequence learning methods have been applied to the problem of off-policy Reinforcement Learning, including the seminal work on Decision Transformers, which employs transformers for this task. Since transformers are parameter-heavy, cannot benefit from history longer than a fixed window size, and are not computed using recurrence, we set out to investigate the suitability of the S4 family of models, which are based on state-space layers and have been shown to outperform transformers, especially in modeling long-range dependencies. In this work we present two main algorithms: (i) an off-policy training procedure that works with trajectories, while still maintaining the training efficiency of the S4 model. (ii) An on-policy training procedure that is trained in a recurrent manner, benefits from long-range dependencies, and is based on a novel stable actor-critic mechanism. Our results indicate that our method outperforms multiple variants of decision transformers, as well as the other baseline methods on most tasks, while reducing the latency, number of parameters, and training time by several orders of magnitude, making our approach more suitable for real-world RL.
Article
Full-text available
A hallmark of human intelligence is the ability to plan multiple steps into the future1,2. Despite decades of research3–5, it is still debated whether skilled decision-makers plan more steps ahead than novices6–8. Traditionally, the study of expertise in planning has used board games such as chess, but the complexity of these games poses a barrier to quantitative estimates of planning depth. Conversely, common planning tasks in cognitive science often have a lower complexity9,10 and impose a ceiling for the depth to which any player can plan. Here we investigate expertise in a complex board game that offers ample opportunity for skilled players to plan deeply. We use model fitting methods to show that human behaviour can be captured using a computational cognitive model based on heuristic search. To validate this model, we predict human choices, response times and eye movements. We also perform a Turing test and a reconstruction experiment. Using the model, we find robust evidence for increased planning depth with expertise in both laboratory and large-scale mobile data. Experts memorize and reconstruct board features more accurately. Using complex tasks combined with precise behavioural modelling might expand our understanding of human planning and help to bridge the gap with progress in artificial intelligence.
Preprint
Full-text available
The works of (Daskalakis et al., 2009, 2022; Jin et al., 2022; Deng et al., 2023) indicate that computing Nash equilibria in multi-player Markov games is a computationally hard task. This fact raises the question of whether or not computational intractability can be circumvented if one focuses on specific classes of Markov games. One such example is two-player zero-sum Markov games, in which efficient ways to compute a Nash equilibrium are known. Inspired by zero-sum polymatrix normal-form games (Cai et al., 2016), we define a class of zero-sum multi-agent Markov games in which there are only pairwise interactions described by a graph that changes per state. For this class of Markov games, we show that an $\epsilon$-approximate Nash equilibrium can be found efficiently. To do so, we generalize the techniques of (Cai et al., 2016), by showing that the set of coarse-correlated equilibria collapses to the set of Nash equilibria. Afterwards, it is possible to use any algorithm in the literature that computes approximate coarse-correlated equilibria Markovian policies to get an approximate Nash equilibrium.
Article
In this paper, we propose a new algorithm for solving convex-concave saddle-point problems using regret minimization in the repeated game framework. To do so, we introduce the Conic Blackwell Algorithm ⁺ ([Formula: see text]), a new parameter- and scale-free regret minimizer for general convex compact sets. [Formula: see text] is based on Blackwell approachability and attains [Formula: see text] regret. We show how to efficiently instantiate [Formula: see text] for many decision sets of interest, including the simplex, [Formula: see text] norm balls, and ellipsoidal confidence regions in the simplex. Based on [Formula: see text], we introduce [Formula: see text], a new parameter-free algorithm for solving convex-concave saddle-point problems achieving a [Formula: see text] ergodic convergence rate. In our simulations, we demonstrate the wide applicability of [Formula: see text] on several standard saddle-point problems from the optimization and operations research literature, including matrix games, extensive-form games, distributionally robust logistic regression, and Markov decision processes. In each setting, [Formula: see text] achieves state-of-the-art numerical performance and outperforms classical methods, without the need for any choice of step sizes or other algorithmic parameters. Funding: J. Grand-Clément is supported by the Agence Nationale de la Recherche [Grant 11-LABX-0047] and by Hi! Paris. C. Kroer is supported by the Office of Naval Research [Grant N00014-22-1-2530] and by the National Science Foundation [Grant IIS-2147361].
Article
Full-text available
In recent years, deep neural networks for strategy games have made significant progress. AlphaZero-like frameworks which combine Monte-Carlo tree search with reinforcement learning have been successfully applied to numerous games with perfect information. However, they have not been developed for domains where uncertainty and unknowns abound, and are therefore often considered unsuitable due to imperfect observations. Here, we challenge this view and argue that they are a viable alternative for games with imperfect information—a domain currently dominated by heuristic approaches or methods explicitly designed for hidden information, such as oracle-based techniques. To this end, we introduce a novel algorithm based solely on reinforcement learning, called AlphaZe∗∗, which is an AlphaZero-based framework for games with imperfect information. We examine its learning convergence on the games Stratego and DarkHex and show that it is a surprisingly strong baseline, while using a model-based approach: it achieves similar win rates against other Stratego bots like Pipeline Policy Space Response Oracle (P2SRO), while not winning in direct comparison against P2SRO or reaching the much stronger numbers of DeepNash. Compared to heuristics and oracle-based approaches, AlphaZe∗∗ can easily deal with rule changes, e.g., when more information than usual is given, and drastically outperforms other approaches in this respect.
Preprint
Full-text available
Deep reinforcement learning (DRL) has been applied to a variety of problems during the past decade, and has provided effective control strategies in high-dimensional and non-linear situations that are challenging to traditional methods. Flourishing applications now spread out into the field of fluid dynamics, and specifically of active flow control (AFC). In the community of AFC, the encouraging results obtained in two-dimensional and chaotic conditions have raised interest to study increasingly complex flows. In this review, we first provide a general overview of the reinforcement-learning (RL) and DRL frameworks, as well as their recent advances. We then focus on the application of DRL to AFC, highlighting the current limitations of the DRL algorithms in this field, and suggesting some of the potential upcoming milestones to reach, as well as open questions that are likely to attract the attention of the fluid-mechanics community.
Article
Full-text available
When applying Artificial Intelligence into the traditional Chinese poker game Doudizhu, it is faced with many challenging issues resulted from the characteristics of Doudizhu. One of these challenging issues is the sparse reward, due to the truth that a valid feedback could be obtained only at the end of a round of the game. Against this, in this paper, a deep neural framework, DQN-IRL, is proposed to address the challenging issue of sparse reward in Doudizhu. The experimental results proves the efficiency of DQN-IRL (Inverse Reinforcement Learning) in terms of winning rate.
Article
Full-text available
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo's own move selections and also the winner of AlphaGo's games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
Article
Full-text available
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.
Article
Full-text available
How long does it take until economic agents converge to an equilibrium? By studying the complexity of the problem of computing a mixed Nash equilibrium in a game, we provide evidence that there are games in which convergence to such an equilibrium takes prohibitively long. Traditionally, computational problems fall into two classes: those that have a polynomial-time algorithm and those that are NP-hard. However, the concept of NP-hardness cannot be applied to the rare problems where "every instance has a solution"--for example, in the case of games Nash's theorem asserts that every game has a mixed equilibrium (now known as the Nash equilibrium, in honor of that result). We show that finding a Nash equilibrium is complete for a class of problems called PPAD, containing several other known hard problems; all problems in PPAD share the same style of proof that every instance has a solution.
Article
Libratus versus humans Pitting artificial intelligence (AI) against top human players demonstrates just how far AI has come. Brown and Sandholm built a poker-playing AI called Libratus that decisively beat four leading human professionals in the two-player variant of poker called heads-up no-limit Texas hold'em (HUNL). Over nearly 3 weeks, Libratus played 120,000 hands of HUNL against the human professionals, using a three-pronged approach that included precomputing an overall strategy, adapting the strategy to actual gameplay, and learning from its opponent. Science , this issue p. 418
Article
Artificial intelligence has seen a number of breakthroughs in recent years, with games often serving as significant milestones. A common feature of games with these successes is that they involve information symmetry among the players, where all players have identical information. This property of perfect information, though, is far more common in games than in real-world problems. Poker is the quintessential game of imperfect information, and it has been a longstanding challenge problem in artificial intelligence. In this paper we introduce DeepStack, a new algorithm for imperfect information settings such as poker. It combines recursive reasoning to handle information asymmetry, decomposition to focus computation on the relevant decision, and a form of intuition about arbitrary poker situations that is automatically learned from self-play games using deep learning. In a study involving dozens of participants and 44,000 hands of poker, DeepStack becomes the first computer program to beat professional poker players in heads-up no-limit Texas hold'em. Furthermore, we show this approach dramatically reduces worst-case exploitability compared to the abstraction paradigm that has been favored for over a decade.
Article
Artificial neural networks could become the technological driver that replaces Moore's law, boosting computers' utlity through a process akin to automatic programming--although physics and computer architecture would also factor in.
Article
Poker is a family of games that exhibit imperfect information, where players do not have full knowledge of past events. Whereas many perfect-information games have been solved (e.g., Connect Four and checkers), no nontrivial imperfect-information game played competitively by humans has previously been solved. Here, we announce that heads-up limit Texas hold'em is now essentially weakly solved. Furthermore, this computation formally proves the common wisdom that the dealer in the game holds a substantial advantage. This result was enabled by a new algorithm, CFR(+), which is capable of solving extensive-form games orders of magnitude larger than previously possible. Copyright © 2015, American Association for the Advancement of Science.
Article
Imperfect-information games model settings where players have private information. Tremendous progress has been made in solving such games over the past 20 years, especially since the Annual Computer Poker Competition was established in 2006, where programs play each other. This progress can fuel the operationalization of seminal game-theoretic solution concepts into detailed game models, powering a host of applications in business (e.g., auctions and negotiations), medicine (e.g., making sophisticated sequential plans against diseases), (cyber)security, and other domains. On page 145 of this issue, Bowling et al. ( 1 ) report on having computed a strategy for two-player limit Texas Hold'em poker that is so close to optimal that, at the pace a human plays poker, it cannot be beaten with statistical significance in a lifetime. While strong strategies have been computed for larger imperfect-information games as well ( 2 – 6 ), this is, to my knowledge, the largest imperfect-information game essentially solved to date, and the first one competitively played by humans that has now been essentially solved.
Article
Why did I write this book? I'm still not sure. After all, I'm a researcher, which means I think I know how to write technical papers. But writing for a n- technical audience is something I know nothing about. It took a lot of effort before I could force myself to sit down to write the first word. Once I did, however, it was hard not to stop! When I started this project, I didn't know that I had a lot to say and, in some sense, the results show this. The book is much longer than I even imagined it would be. Worse yet is that there is a lot of material that I decided not to include. It's a good thing that the publishers decided to limit how long the book could be! However, after much soul searching, I think I now know the reasons why I wrote this book. First and foremost, this book tells an interesting story. It's about the life of a checkers-playing computer program, Chinook, from its creation in 1989 to its retirement in 1996. In reality the story revolves around two people with different views of the program. As the creator of Chinook, I wanted to push the program to become the best player in the world, in much the same way that a father might encourage his son to excel at sports.
Article
Deep Blue is the chess machine that defeated then-reigning World Chess Champion Garry Kasparov in a six-game match in 1997. There were a number of factors that contributed to this success, including: •a single-chip chess search engine,•a massively parallel system with multiple levels of parallelism,•a strong emphasis on search extensions,•a complex evaluation function, and•effective use of a Grandmaster game database.This paper describes the Deep Blue system, and gives some of the rationale that went into the design decisions behind Deep Blue.
Article
Poker is an interesting test-bed for artificial intelligence research. It is a game of imperfect information, where multiple competing agents must deal with probabilistic knowledge, risk assessment, and possible deception, not unlike decisions made in the real world. Opponent modeling is another difficult problem in decision-making applications, and it is essential to achieving high performance in poker.This paper describes the design considerations and architecture of the poker program Poki. In addition to methods for hand evaluation and betting strategy, Poki uses learning techniques to construct statistical models of each opponent, and dynamically adapts to exploit observed patterns and tendencies. The result is a program capable of playing reasonably strong poker, but there remains considerable research to be done to play at a world-class level.
Article
In December 2009 and November 2010, the first and second Lemonade Stand game competitions were held. In each competition, 9 teams competed, from University of Southampton, University College London, Yahoo!, Rutgers, Carnegie Mellon, Brown, Princeton, et cetera. The competition, in the spirit of Axelrod's iterated prisoner's dilemma competition, which addressed whether or not you should cooperate, asks the questions, "how should you cooperate, and with whom?" The third competition (whose results will be announced at IJCAI 2011) is open for submissions until July 1st, 2011.
Article
Finding an equilibrium of an extensive form game of imperfect information is a fundamental problem in computational game theory, but current techniques do not scale to large games. To address this, we introduce the ordered game isomorphism and the related ordered game isomorphic abstraction transformation. For a multi-player sequential game of imperfect information with observable actions and an ordered signal space, we prove that any Nash equilibrium in an abstracted smaller game, obtained by one or more applications of the transformation, can be easily converted into a Nash equilibrium in the original game. We present an algorithm, GameShrink, for abstracting the game using our isomorphism exhaustively. Its complexity is ˜ O(n2), where n is the number of nodes in a structure we call the signal tree. It is no larger than the game tree, and on nontrivial games it is drastically smaller, so GameShrink has time and space complexity sublinear in the size of the game tree. Using GameShrink, we find an equilibrium to a poker game with 3.1 billion nodes—over four orders of magnitude more than in the largest poker game solved previously. To address even larger games, we introduce approximation methods that do not preserve equilibrium, but nevertheless yield (ex post) provably close-to-optimal strategies.
Article
Ever since the days of Shannon's proposal for a chess-playing algorithm [12] and Samuel's checkers-learning program [10] the domain of complex board games such as Go, chess, checkers, Othello, and backgammon has been widely regarded as an ideal testing ground for exploring a variety of concepts and approaches in artificial intelligence and machine learning. Such board games offer the challenge of tremendous complexity and sophistication required to play at expert level. At the same time, the problem inputs and performance measures are clear-cut and well defined, and the game environment is readily automated in that it is easy to simulate the board, the rules of legal play, and the rules regarding when the game is over and determining the outcome.
Article
We prove that Bimatrix, the problem of finding a Nash equilibrium in a two-player game, is complete for the complexity class PPAD (Polynomial Parity Argument, Directed version) introduced by Papadimitriou in 1991. Our result, building upon the work of Daskalakis et al. [2006a] on the complexity of four-player Nash equilibria, settles a long standing open problem in algorithmic game theory. It also serves as a starting point for a series of results concerning the complexity of two-player Nash equilibria. In particular, we prove the following theorems: —Bimatrix does not have a fully polynomial-time approximation scheme unless every problem in PPAD is solvable in polynomial time. —The smoothed complexity of the classic Lemke-Howson algorithm and, in fact, of any algorithm for Bimatrix is not polynomial unless every problem in PPAD is solvable in randomized polynomial time. Our results also have a complexity implication in mathematical economics: —Arrow-Debreu market equilibria are PPAD -hard to compute.
Regret minimization in games and the development of champion multiplayer computer poker-playing agents Ph
  • R Gibson
R. Gibson, Regret minimization in games and the development of champion multiplayer computer poker-playing agents, Ph.D. thesis, University of Alberta (2014).
  • S Ganzfried
  • T Sandholm
Robust strategies and counter-strategies: from superhuman to optimal play
  • M B Johanson
M. B. Johanson, Robust strategies and counter-strategies: from superhuman to optimal play, Ph.D. thesis, University of Alberta (2016).
  • J Neumann
J. von Neumann, Math. Ann. 100, 295-320 (1928).
  • M Bowling
  • N Burch
  • M Johanson
  • O Tammelin
M. Bowling, N. Burch, M. Johanson, O. Tammelin, Science 347, 145-149 (2015).
  • M Moravčík
M. Moravčík et al., Science 356, 508-513 (2017).
  • N Brown
  • T Sandholm
N. Brown, T. Sandholm, Science 359, 418-424 (2018).
  • D Silver
D. Silver et al., Nature 529, 484-489 (2016).
  • S Ganzfried
  • T Sandholm
S. Ganzfried, T. Sandholm, ACM Trans. Econ. Comp. (TEAC) 3, 8 (2015). Best of EC-12 special issue.
  • X Chen
  • X Deng
  • S.-H Teng
X. Chen, X. Deng, S.-H. Teng, J. Assoc. Comput. Mach. 56, 14 (2009).
  • D Silver
D. Silver et al., Nature 550, 354-359 (2017).
  • T Sandholm
T. Sandholm, Science 347, 122-123 (2015).
  • E G Jackson
E. G. Jackson, AAAI Workshop on Computer Poker and Imperfect Information (2013).
Best of EC-12 special issue
  • S Ganzfried