We consider the problem of learning in repeated general-sum matrix games when a learning algorithm can observe the actions
but not the payoffs of its associates. Due to the non-stationarity of the environment caused by learning associates in these
games, most state-of-the-art algorithms perform poorly in some important repeated games due to an inability to make profitable
compromises. To make these compromises, an agent must effectively balance competing objectives, including bounding losses,
playing optimally with respect to current beliefs, and taking calculated, but profitable, risks. In this paper, we present,
discuss, and analyze M-Qubed, a reinforcement learning algorithm designed to overcome these deficiencies by encoding and balancing
best-response, cautious, and optimistic learning biases. We show that M-Qubed learns to make profitable compromises across
a wide-range of repeated matrix games played with many kinds of learners. Specifically, we prove that M-Qubed’s average payoffs
meet or exceed its maximin value in the limit. Additionally, we show that, in two-player games, M-Qubed’s average payoffs
approach the value of the Nash bargaining solution in self play. Furthermore, it performs very well when associating with
other learners, as evidenced by its robust behavior in round-robin and evolutionary tournaments of two-player games. These
results demonstrate that an agent can learn to make good compromises, and hence receive high payoffs, in repeated games by
effectively encoding and balancing best-response, cautious, and optimistic learning biases.
Figures - uploaded by
Michael A GoodrichAuthor contentAll figure content in this area was uploaded by Michael A Goodrich
Content may be subject to copyright.