To read the full-text of this research, you can request a copy directly from the authors.
Abstract
Forward model learning algorithms enable the application of simulation-based search methods in environments for which the forward model is unknown. Multiple studies have shown great performance in game-related and motion control applications. In these, forward model learning agents often required less training time while achieving a similar performance than state-of-the-art reinforcement learning methods. However, several problems can emerge when replacing the environment’s true model with a learned approximation. While the true forward model allows the accurate prediction of future time-steps, a learned forward model may always be inaccurate in its prediction. These inaccuracies become problematic when planning long action sequences since the confidence in predicted time-steps reduces with increasing depth of the simulation. In this work, we explore methods for balancing risk and reward in decision-making using inaccurate forward models. Therefore, we propose methods for measuring the variance of a forward model and the confidence in the predicted outcome of planned action sequences. Based on these metrics, we define methods for learning and using forward models under consideration of their current prediction accuracy. Proposed methods have been tested in various motion control tasks of the Open AI Gym framework. Results show that the information on the model’s accuracy can be used to increase the efficiency of the agent’s training and the agent’s performance during evaluation.
To read the full-text of this research, you can request a copy directly from the authors.
... Wir haben in diesem Aufsatz vier Arten von Transitionsmodellen mit deren Modellbildungsheuristiken vorgestellt und können nun deren Einsatz in einer variantenreichen Auswahl an Szenarien demonstrieren. Die Ergebnisse dieser Fallstudien sind im Rahmen früherer Arbeiten [3,6,7,8,10,11,12,13,18] entstanden und werden hier zusammengetragen, um ein vollständiges Bild der Vor-und Nachteile der vorgestellten Verfahren zu vermitteln. ...
... Weiterhin können nicht-parametrische Mehrschritt-Vorhersagemodelle für die Vorhersage von nicht-linearen Dynamiken verwendet werden [17]. Schlussendlich kann die Messung der Unsicherheit einer Vorhersage zur Bewertung des Vertrauens in eine Simulation verwendet werden, um eine robustere Entscheidungsfindung zu erlauben [8]. ...
... Weiterhin kann die Verwendung einer Abhängigkeitsanalyse die Modellbildung unterstützen[12]. Hierfür muss jedoch ein geeigneter Zeitpunkt gefunden werden, zu welchem bereits ausreichend Daten vorhanden sind, um zu einem repräsentativen Ergebnis zu führen.In einer weiteren Arbeit werden Methoden zur effizienten Nutzung der Trainingszeit zur Verbesserung der Modellgenauigkeit vorgeschlagen[8]. Diese zeigt, dass eine nicht zufällige Exploration während des Trainings zu einer deutlichen Verbesserung der Modellgenauigkeit führen kann. ...
Zusammenfassung
Forward Model Learning, also das Erlernen vorwärtsgerichteter Modelle aus Daten, findet Anwendung in der vorhersagebasierten Regelung. Dazu werden Ein- und Ausgaben des Systems beobachtet, um ein Transitionsmodell zu erstellen und Vorhersagen über zukünftige Zeitschritte zu ermöglichen. Insbesondere komplexe Zustandsräume erfordern den Einsatz von spezialisierten Such- und Modellbildungsverfahren. In dieser Arbeit stellen wir Abstraktionsheuristiken für hochdimensionale Zustandsräume vor, welche es ermöglichen, die Modellkomplexität zu reduzieren und in vielen Fällen ein interpretierbares Ergebnis herbeizuführen. In zwei Fallstudien zeigen wir die Wirksamkeit des vorgestellten Verfahrens anhand von Methoden der Künstlichen Intelligenz in Spielen und in Motion Control Szenarien. Deren Übertragung ermöglicht vielversprechende Anwendungen in der Automatisierungstechnik.
Monte Carlo Tree Search techniques have generally dominated General Video Game Playing, but recent research has started looking at Evolutionary Algorithms and their potential at matching Tree Search level of play or even outperforming these methods. Online or Rolling Horizon Evolution is one of the options available to evolve sequences of actions for planning in General Video Game Playing, but no research has been done up to date that explores the capabilities of the vanilla version of this algorithm in multiple games. This study aims to critically analyse the different configurations regarding population size and individual length in a set of 20 games from the General Video Game AI corpus. Distinctions are made between deterministic and stochastic games, and the implications of using superior time budgets are studied. Results show that there is scope for the use of these techniques, which in some configurations outperform Monte Carlo Tree Search, and also suggest that further research in these methods could boost their performance.
Model-free reinforcement learning has been successfully applied to a range of challenging problems, and has recently been extended to handle large neural network policies and value functions. However, the sample complexity of model-free algorithms, particularly when using high-dimensional function approximators, tends to limit their applicability to physical systems. In this paper, we explore algorithms and representations to reduce the sample complexity of deep reinforcement learning for continuous control tasks. We propose two complementary techniques for improving the efficiency of such algorithms. First, we derive a continuous variant of the Q-learning algorithm, which we call normalized adantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods. NAF representation allows us to apply Q-learning with experience replay to continuous tasks, and substantially improves performance on a set of simulated robotic control tasks. To further improve the efficiency of our approach, we explore the use of learned models for accelerating model-free reinforcement learning. We show that iteratively refitted local linear models are especially effective for this, and demonstrate substantially faster learning on domains where such models are applicable.
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.
General Video Game Playing is a sub-field of Game Artificial Intelligence, where the goal is to find algorithms capable of playing many different real-time games, some of them unknown a priori. In this scenario, the presence of domain knowledge must be severely limited, or the algorithm will overfit to the training games and perform poorly on the unknown games of the test set. Research in this area has been of special interest in the last years, with emerging contests like the General Video Game AI (GVG-AI) Competition. This paper introduces three different open loop techniques for dealing with this problem. First, a simple directed depth first search algorithm is employed as a baseline. Then, a tree search algorithm with a multi-armed bandit based tree policy is presented, followed by a Rolling Horizon Evolutionary Algorithm (RHEA) approach. In order to test these techniques, the games from the GVG-AI Competition framework are used as a benchmark, evaluation on a training set of 29 games, and submitting to the 10 unknown games at the competition website. Results show how the general game-independent heuristic proposed works well across all algorithms and games, and how the RHEA becomes the best evolutionary technique in the rankings of the test set.
Monte Carlo tree search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm's derivation, impart some structure on the many variations and enhancements that have been proposed, and summarize the results from the key game and nongame domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.
This article introduces a class of incremental learning procedures specialized for prediction-that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional prediction-learning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions. Although such temporal-difference methods have been used in Samuel's checker player, Holland's bucket brigade, and the author's Adaptive Heuristic Critic, they have remained poorly understood. Here we prove their convergence and optimality for special cases and relate them to supervised-learning methods. For most real-world prediction problems, temporal-difference methods require less memory and less peak computation than conventional methods and they produce more accurate predictions. We argue that most problems to which supervised learning is currently applied are really prediction problems of the sort to which temporal-difference methods can be applied to advantage.
Forward models enable a robot to predict the ef- fects of its actions on its own motor system and its environment. This is a vital aspect of intelligent be- haviour, as the robot can use predictions to decide the best set of actions to achieve a goal. The ability to learn forward models enables robots to be more adaptable and autonomous; this paper describes a system whereby they can be learnt and represented as a Bayesian network. The robot's motor system is controlled and explored using 'motor babbling'. Feedback about its motor system comes from com- puter vision techniques requiring no prior informa- tion to perform tracking. The learnt forward model can be used by the robot to imitate human move- ment.
Q-learning (Watkins, 1989) is a simple way for agents to learn how to act optimally in controlled Markovian domains. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states.
This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989). We show thatQ-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely. We also sketch extensions to the cases of non-discounted, but absorbing, Markov environments, and where manyQ values can be changed each iteration, rather than just one.
Models are among the most essential tools in robotics, such as kinematics and dynamics models of the robot's own body and controllable external objects. It is widely believed that intelligent mammals also rely on internal models in order to generate their actions. However, while classical robotics relies on manually generated models that are based on human insights into physics, future autonomous, cognitive robots need to be able to automatically generate models that are based on information which is extracted from the data streams accessible to the robot. In this paper, we survey the progress in model learning with a strong focus on robot control on a kinematic as well as dynamical level. Here, a model describes essential information about the behavior of the environment and the influence of an agent on this environment. In the context of model-based learning control, we view the model from three different perspectives. First, we need to study the different possible model learning architectures for robotics. Second, we discuss what kind of problems these architecture and the domain of robotics imply for the applicable learning methods. From this discussion, we deduce future directions of real-time learning algorithms. Third, we show where these scenarios have been used successfully in several case studies.
The cross-entropy method is an efficient and general optimization algorithm. However, its applicability in reinforcement learning (RL) seems to be limited because it often converges to suboptimal policies. We apply noise for preventing early convergence of the cross-entropy method, using Tetris, a computer game, for demonstration. The resulting policy outperforms previous RL algorithms by almost two orders of magnitude.
Recent progress in legged locomotion research has produced robots that can perform agile blind-walking with robustness comparable to a blindfolded human. However, this walking approach has not yet been integrated with planners for high-level activities. In this paper, we take a step towards high-level task planning for these robots by studying a planar simulated biped that captures their essential dynamics. We investigate variants of Monte-Carlo Tree Search (MCTS) for selecting an appropriate blind-walking controller at each decision cycle. In particular, we consider UCT with an intelligently selected rollout policy, which is shown to be capable of guiding the biped through treacherous terrain. In addition, we develop a new MCTS variant, called Monte-Carlo Discrepancy Search (MCDS), which is shown to make more effective use of limited planning time than UCT for this domain. We demonstrate the effectiveness of these planners in both deterministic and stochastic environments across a range of algorithm parameters. In addition, we present results for using these planners to control a full-order 3D simulation of Cassie, an agile bipedal robot, through complex terrain.
This paper provides an overview of the recently proposed forward model approximation framework for learning games of the general video game artificial intelligence (GVGAI) framework. In contrast to other general game-playing algorithms, the proposed agent model does not need a full description of the game but can learn the game's rules by observing game state transitions. Based on hierarchical knowledge bases, the forward model can be learned and revised during game-play, improving the accuracy of the agent's state predictions over time. This allows the application of simulation-based search algorithms and belief revision techniques to previously unknown settings. We show that the proposed framework is able to quickly learn a model for dynamic environments in the context of the GVGAI framework.
This is the first textbook dedicated to explaining how artificial intelligence (AI) techniques can be used in and for games. After introductory chapters that explain the background and key techniques in AI and games, the authors explain how to use AI to play games, to generate content for games and to model players.
The book will be suitable for undergraduate and graduate courses in games, artificial intelligence, design, human-computer interaction, and computational intelligence, and also for self-study by industrial game developers and practitioners. The authors have developed a website (http://www.gameaibook.org) that complements the material covered in the book with up-to-date exercises, lecture slides and reading.
We introduce Imagination-Augmented Agents (I2As), a novel architecture for deep reinforcement learning combining model-free and model-based aspects. In contrast to most existing model-based reinforcement learning and planning methods, which prescribe how a model should be used to arrive at a policy, I2As learn to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. I2As show improved data efficiency, performance, and robustness to model misspecification compared to several baselines.
Simulation and the Monte Carlo Method, Third Edition reflects the latest developments in the field and presents a fully updated and comprehensive account of the state-of-the-art theory, methods and applications that have emerged in Monte Carlo simulation since the publication of the classic First Edition over more than a quarter of a century ago. While maintaining its accessible and intuitive approach, this revised edition features a wealth of up-to-date information that facilitates a deeper understanding of problem solving across a wide array of subject areas, such as engineering, statistics, computer science, mathematics, and the physical and life sciences. The book begins with a modernized introduction that addresses the basic concepts of probability, Markov processes, and convex optimization. Subsequent chapters discuss the dramatic changes that have occurred in the field of the Monte Carlo method, with coverage of many modern topics including: Markov Chain Monte Carlo, variance reduction techniques such as importance (re-)sampling, and the transform likelihood ratio method, the score function method for sensitivity analysis, the stochastic approximation method and the stochastic counter-part method for Monte Carlo optimization, the cross-entropy method for rare events estimation and combinatorial optimization, and application of Monte Carlo techniques for counting problems. An extensive range of exercises is provided at the end of each chapter, as well as a generous sampling of applied examples. The Third Edition features a new chapter on the highly versatile splitting method, with applications to rare-event estimation, counting, sampling, and optimization. A second new chapter introduces the stochastic enumeration method, which is a new fast sequential Monte Carlo method for tree search. In addition, the Third Edition features new material on: Random number generation, including multiple-recursive generators and the Mersenne Twister. Simulation of Gaussian processes, Brownian motion, and diffusion processes. Multilevel Monte Carlo method. New enhancements of the cross-entropy (CE) method, including the "improved" CE method, which uses sampling from the zero-variance distribution to find the optimal importance sampling parameters. Over 100 algorithms in modern pseudo code with flow control. Over 25 new exercises. Simulation and the Monte Carlo Method, Third Edition is an excellent text for upper-undergraduate and beginning graduate courses in stochastic simulation and Monte Carlo techniques. The book also serves as a valuable reference for professionals who would like to achieve a more formal understanding of the Monte Carlo method.
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Covering both noncooperative and cooperative games, this comprehensive introduction to game theory also includes some advanced chapters on auctions, games with incomplete information, games with vector payoffs, stable matchings and the bargaining set. Mathematically oriented, the book presents every theorem alongside a proof. The material is presented clearly and every concept is illustrated with concrete examples from a broad range of disciplines. With numerous exercises the book is a thorough and extensive guide to game theory from undergraduate through graduate courses in economics, mathematics, computer science, engineering and life sciences to being an authoritative reference for researchers.
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
It is shown how a system consisting of two neuronlike adaptive elements can solve a difficult learning control problem. The task is to balance a pole that is hinged to a movable cart by applying forces to the cart's base. It is argued that the learning problems faced by adaptive elements that are components of adaptive networks are at least as difficult as this version of the pole-balancing problem. The learning system consists of a single associative search element (ASE) and a single adaptive critic element (ACE). In the course of learning to balance the pole, the ASE constructs associations between input and output by searching under the influence of reinforcement feedback, and the ACE constructs a more informative evaluation function than reinforcement feedback alone can provide. The differences between this approach and other attempts to solve problems using neurolike elements are discussed, as is the relation of this work to classical and instrumental conditioning in animal learning studies and its possible implications for research in the neurosciences.
Dream to Control: Learning Behaviors by Latent Imagination
D Hafner
T Lillicrap
J Ba
M Norouzi
Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: Learning
behaviors by latent imagination (2019). URL http://arxiv.org/abs/
1912.01603
Recurrent world models facilitate policy evolution
D Ha
J Schmidhuber
S Bengio
H Wallach
H Larochelle
K Grauman
N Cesa-Bianchi
Prediction-based search for autonomous game-playing
A Dockhorn
Dockhorn, A.: Prediction-based search for autonomous game-playing. Ph.D.
thesis, Otto-von-Guericke-Universität Magdeburg, Fakultät für Informatik
(2020). DOI 10.25673/34014. URL https://opendata.uni-halle.
de//handle/1981185920/34209
Detecting sensor dependencies for building complementary model ensembles
A Dockhorn
R Kruse
Dockhorn, A., Kruse, R.: Detecting Sensor Dependencies for Building Complementary Model Ensembles. In: Proceedings of the 28. Workshop Computational
Intelligence, Dortmund, 29.-30. November 2018, pp. 217-234 (2018)
Recurrent world models facilitate policy evolution
Jan 2018
2450-2462
D Ha
J Schmidhuber
Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution. In:
S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett
(eds.) Advances in Neural Information Processing Systems 31, pp. 2450-2462.
Curran Associates, Inc. (2018)
Reinforcement Learning, 2 edn
Jan 2018
R S Sutton
A G Barto
Sutton, R.S., Barto, A.G.: Reinforcement Learning, 2 edn. The MIT Press,
Cambridge (2018)