Preprint

Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

While the history of machine learning so far encompasses a series of problems posed by researchers and algorithms that learn their solutions, an important question is whether the problems themselves can be generated by the algorithm at the same time as they are being solved. Such a process would in effect build its own diverse and expanding curricula, and the solutions to problems at various stages would become stepping stones towards solving even more challenging problems later in the process. The Paired Open-Ended Trailblazer (POET) algorithm introduced in this paper does just that: it pairs the generation of environmental challenges and the optimization of agents to solve those challenges. It simultaneously explores many different paths through the space of possible problems and solutions and, critically, allows these stepping-stone solutions to transfer between problems if better, catalyzing innovation. The term open-ended signifies the intriguing potential for algorithms like POET to continue to create novel and increasingly complex capabilities without bound. The results show that POET produces a diverse range of sophisticated behaviors that solve a wide range of environmental challenges, many of which cannot be solved by direct optimization alone, or even through a direct, single-path curriculum-based control algorithm introduced to highlight the critical role of open-endedness in solving ambitious challenges. The ability to transfer solutions from one environment to another proves essential to unlocking the full potential of the system as a whole, demonstrating the unpredictable nature of fortuitous stepping stones. We hope that POET will inspire a new push towards open-ended discovery across many domains, where algorithms like POET can blaze a trail through their interesting possible manifestations and solutions.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We present a method for using neural networks to model evolutionary population dynamics, and draw parallels to recent deep learning advancements in which adversarially-trained neural networks engage in coevolutionary interactions. We conduct experiments which demonstrate that models from evolutionary game theory are capable of describing the behavior of these neural population systems.
Article
Full-text available
Evolution strategies (ES) are a family of black-box optimization algorithms able to train deep neural networks roughly as well as Q-learning and policy gradient methods on challenging deep reinforcement learning (RL) problems, but are much faster (e.g. hours vs. days) because they parallelize better. However, many RL problems require directed exploration because they have reward functions that are sparse or deceptive (i.e. contain local optima), and it is not known how to encourage such exploration with ES. Here we show that algorithms that have been invented to promote directed exploration in small-scale evolved neural networks via populations of exploring agents, specifically novelty search (NS) and quality diversity (QD) algorithms, can be hybridized with ES to improve its performance on sparse or deceptive deep RL tasks, while retaining scalability. Our experiments confirm that the resultant new algorithms, NS-ES and a version of QD we call NSR-ES, avoid local optima encountered by ES to achieve higher performance on tasks ranging from playing Atari to simulated robots learning to walk around a deceptive trap. This paper thus introduces a family of fast, scalable algorithms for reinforcement learning that are capable of directed exploration. It also adds this new family of exploration algorithms to the RL toolbox and raises the interesting possibility that analogous algorithms with multiple simultaneous paths of exploration might also combine well with existing RL algorithms outside ES.
Article
Full-text available
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo's own move selections and also the winner of AlphaGo's games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
Article
Full-text available
The reinforcement learning paradigm allows, in principle, for complex behaviours to be learned directly from simple reward signals. In practice, however, it is common to carefully hand-design the reward function to encourage a particular solution, or to derive it from demonstration data. In this paper explore how a rich environment can help to promote the learning of complex behavior. Specifically, we train agents in diverse environmental contexts, and find that this encourages the emergence of robust behaviours that perform well across a suite of tasks. We demonstrate this principle for locomotion -- behaviours that are known for their sensitivity to the choice of reward. We train several simulated bodies on a diverse set of challenging terrains and obstacles, using a simple reward function based on forward progress. Using a novel scalable variant of policy gradient reinforcement learning, our agents learn to run, jump, crouch and turn as required by the environment without explicit reward-based guidance. A visual depiction of highlights of the learned behavior can be viewed in this \href\href{https://goo.gl/8rTx2F}{video} .
Article
Full-text available
We describe the content and outcomes of the First Workshop on Open-Ended Evolution: Recent Progress and Future Milestones (OEE1), held during the ECAL 2015 conference at the University of York, UK, in July 2015. We briefly summarize the content of the workshop's talks, and identify the main themes that emerged from the open discussions. Two important conclusions from the discussions are: (1) the idea of pluralism about OEE-it seems clear that there is more than one interesting and important kind of OEE; and (2) the importance of distinguishing observable behavioral hallmarks of systems undergoing OEE from hypothesized underlying mechanisms that explain why a system exhibits those hallmarks. We summarize the different hallmarks and mechanisms discussed during the workshop, and list the specific systems that were highlighted with respect to particular hallmarks and mechanisms. We conclude by identifying some of the most important open research questions about OEE that are apparent in light of the discussions. The York workshop provides a foundation for a follow-up OEE2 workshop taking place at the ALIFE XV conference in Cancún, Mexico, in July 2016. Additional materials from the York workshop, including talk abstracts, presentation slides, and videos of each talk, are available at http://alife.org/ws/oee1 . THIS PAPER IS FREELY AVAILABLE OPEN ACCESS AT http://www.mitpressjournals.org/doi/abs/10.1162/ARTL_a_00210
Article
Full-text available
While evolutionary computation and evolutionary robotics take inspiration from nature, they have long focused mainly on problems of performance optimization. Yet evolution in nature can be interpreted as more nuanced than a process of simple optimization. In particular, natural evolution is a divergent search that optimizes locally within each niche as it simultaneously diversifies. This tendency to discover both quality and diversity at the same time differs from many of the conventional algorithms of machine learning, and also thereby suggests a different foundation for inferring the approach of greatest potential for evolutionary algorithms. In fact, several recent evolutionary algorithms called quality diversity (QD) algorithms(e.g. novelty search with local competition and MAP-Elites) have drawn inspiration from this more nuanced view, aiming to fill a space of possibilities with the best possible example of each type of achievable behavior. The result is a new class of algorithms that return an archive of diverse, high-quality behaviors in a single run. The aim in this paper is to study the application of QD algorithms in challenging environments (in particular complex mazes) to establish their best practices for ambitious domains in the future. In addition to providing insight into cases when QD succeeds and fails, a new approach is investigated that hybridizes multiple views of behaviors (called behavior characterizations) in the same run, which succeeds in overcoming some of the challenges associated with searching for QD with respect to a behavior characterization that is not necessarily sufficient for generating both quality and diversity at the same time.
Article
Full-text available
The Achilles Heel of stochastic optimization algorithms is getting trapped on local optima. Novelty Search mitigates this problem by encouraging exploration in all interesting directions by replacing the performance objective with a reward for novel behaviors. This reward for novel behaviors has traditionally required a human-crafted, behavioral distance function. While Novelty Search is a major conceptual breakthrough and outperforms traditional stochastic optimization on certain problems, it is not clear how to apply it to challenging, high-dimensional problems where specifying a useful behavioral distance function is difficult. For example, in the space of images, how do you encourage novelty to produce hawks and heroes instead of endless pixel static? Here we propose a new algorithm, the Innovation Engine, that builds on Novelty Search by replacing the human-crafted behavioral distance with a Deep Neural Network (DNN) that can recognize interesting differences between phenotypes. The key insight is that DNNs can recognize similarities and differences between phenotypes at an abstract level, wherein novelty means interesting novelty. For example, a DNN based novelty search in the image space does not explore in the low-level pixel space, but instead creates a pressure to create new types of images (e.g. churches, mosques, obelisks, etc.). Here we describe the long-term vision for the Innovation Engine algorithm, which involves many technical challenges that remain to be solved. We then implement a simplified version of the algorithm that enables us to explore some of the algorithm's key motivations. Our initial results, in the domain of images, suggest that Innovation Engines could ultimately automate the production of endless streams of interesting solutions in any domain: e.g. producing intelligent software, robot controllers, optimized physical components, and art.
Article
Full-text available
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task involving finding rewards in random 3D mazes using a visual input.
Article
Full-text available
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.
Article
Full-text available
Many fields use search algorithms, which automatically explore a search space to find high-performing solutions: chemists search through the space of molecules to discover new drugs; engineers search for stronger, cheaper, safer designs, scientists search for models that best explain data, etc. The goal of search algorithms has traditionally been to return the single highest-performing solution in a search space. Here we describe a new, fundamentally different type of algorithm that is more useful because it provides a holistic view of how high-performing solutions are distributed throughout a search space. It creates a map of high-performing solutions at each point in a space defined by dimensions of variation that a user gets to choose. This Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) algorithm illuminates search spaces, allowing researchers to understand how interesting attributes of solutions combine to affect performance, either positively or, equally of interest, negatively. For example, a drug company may wish to understand how performance changes as the size of molecules and their cost-to-produce vary. MAP-Elites produces a large diversity of high-performing, yet qualitatively different solutions, which can be more helpful than a single, high-performing solution. Interestingly, because MAP-Elites explores more of the search space, it also tends to find a better overall solution than state-of-the-art search algorithms. We demonstrate the benefits of this new algorithm in three different problem domains ranging from producing modular neural networks to designing simulated and real soft robots. Because MAP- Elites (1) illuminates the relationship between performance and dimensions of interest in solutions, (2) returns a set of high-performing, yet diverse solutions, and (3) improves finding a single, best solution, it will advance science and engineering.
Article
Full-text available
While animals can quickly adapt to a wide variety of injuries, current robots cannot "think outside the box" to find a compensatory behavior when damaged: they are either limited to the contingency plans provided by their designers, or need many hours to search for suitable compensatory behaviors. We introduce an intelligent trial and error algorithm that allows robots to adapt to damage in less than 2 minutes, thanks to intuitions that they develop before their mission and experiments that they conduct to validate or invalidate them after damage. The result is a creative process that adapts to a variety of injuries, including damaged, broken, and missing legs. This new technique will enable more robust, effective, autonomous robots and suggests principles that animals may use to adapt to injury.
Conference Paper
Full-text available
In 1994 Karl Sims showed that computational evolution can produce interesting morphologies that resemble natural or- ganisms. Despite nearly two decades of work since, evolved morphologies are not obviously more complex or natural, and the field seems to have hit a complexity ceiling. One hypothesis for the lack of increased complexity is that most work, including Sims’, evolves morphologies composed of rigid elements, such as solid cubes and cylinders, limiting the design space. A second hypothesis is that the encod- ings of previous work have been overly regular, not allow- ing complex regularities with variation. Here we test both hypotheses by evolving soft robots with multiple materials and a powerful generative encoding called a compositional pattern-producing network (CPPN). Robots are selected for locomotion speed. We find that CPPNs evolve faster robots than a direct encoding and that the CPPN morphologies appear more natural. We also find that locomotion per- formance increases as more materials are added, that di- versity of form and behavior can be increased with di↵er- ent cost functions without stifling performance, and that organisms can be evolved at di↵erent levels of resolution. These findings suggest the ability of generative soft-voxel systems to scale towards evolving a large diversity of com- plex, natural, multi-material creatures. Our results suggest that future work that combines the evolution of CPPN- encoded soft, multi-material robots with modern diversity- encouraging techniques could finally enable the creation of creatures far more complex and interesting than those pro- duced by Sims nearly twenty years ago.
Article
Full-text available
Coevolutionary algorithms approach problems for which no function for evaluating potential solutions is present or known. Instead, algorithms rely on the aggregation of outcomes from interactions among evolving entities in order to make selection decisions. Given the lack of an explicit yardstick, understanding the dy-namics of coevolutionary algorithms, judging whether a given algorithm is progress-ing, and designing effective new algorithms present unique challenges unlike those faced by optimization or evolutionary algorithms. The purpose of this chapter is to provide a foundational understanding of coevolutionary algorithms and to highlight critical theoretical and empirical work done over the last two decades. The chapter outlines the ends and means of coevolutionary algorithms: what they are meant to find, and how they should find it.
Chapter
Full-text available
By synthesizing a growing body ofwork in search processes that are not driven by explicit objectives, this paper advances the hypothesis that there is a fundamental problem with the dominant paradigm of objective-based search in evolutionary computation and genetic programming: Most ambitious objectives do not illuminate a path to themselves. That is, the gradient of improvement induced by ambitious objectives tends to lead not to the objective itself but instead to deadend local optima. Indirectly supporting this hypothesis, great discoveries often are not the result of objective-driven search. For example, the major inspiration for both evolutionary computation and genetic programming, natural evolution, innovates through an open-ended process that lacks a final objective. Similarly, large-scale cultural evolutionary processes, such as the evolution of technology, mathematics, and art, lack a unified fixed goal. In addition, direct evidence for this hypothesis is presented from a recently-introduced search algorithm called novelty search. Though ignorant of the ultimate objective of search, in many instances novelty search has counter-intuitively outperformed searching directly for the objective, including a wide variety of randomly-generated problems introduced in an experiment in this chapter. Thus a new understanding is beginning to emerge that suggests that searching for a fixed objective, which is the reigning paradigm in evolutionary computation and even machine learning as a whole, may ultimately limit what can be achieved. Yet the liberating implication of this hypothesis argued in this paper is that by embracing search processes that are not driven by explicit objectives, the breadth and depth of what is reachable through evolutionary methods such as genetic programming may be greatly expanded. KeywordsNovelty search-objective-based search-non-objective search-deception-evolutionary computation
Article
Full-text available
This paper investigates how an evolutionary algorithm with an indirect encoding exploits the property of phenotypic regularity, an important design principle found in natural organisms and engineered designs. We present the first comprehensive study showing that such phenotypic regularity enables an indirect encoding to outperform direct encoding controls as problem regularity increases. Such an ability to produce regular solutions that can exploit the regularity of problems is an important prerequisite if evolutionary algorithms are to scale to high-dimensional real-world problems, which typically contain many regularities, both known and unrecognized. The indirect encoding in this case study is HyperNEAT, which evolves artificial neural networks (ANNs) in a manner inspired by concepts from biological development. We demonstrate that, in contrast to two direct encoding controls, HyperNEAT produces both regular behaviors and regular ANNs, which enables HyperNEAT to significantly outperform the direct encodings as regularity increases in three problem domains. We also show that the types of regularities HyperNEAT produces can be biased, allowing domain knowledge and preferences to be injected into the search. Finally, we examine the downside of a bias toward regularity. Even when a solution is mainly regular, some irregularity may be needed to perfect its functionality. This insight is illustrated by a new algorithm called HybrID that hybridizes indirect and direct encodings, which matched HyperNEAT's performance on regular problems yet outperformed it on problems with some irregularity. HybrID's ability to improve upon the performance of HyperNEAT raises the question of whether indirect encodings may ultimately excel not as stand-alone algorithms, but by being hybridized with a further process of refinement, wherein the indirect encoding produces patterns that exploit problem regularity and the refining process modifies that pattern to capture irregularities. This- - paper thus paints a more complete picture of indirect encodings than prior studies because it analyzes the impact of the continuum between irregularity and regularity on the performance of such encodings, and ultimately suggests a path forward that combines indirect encodings with a separate process of refinement.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Full-text available
Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them "curriculum learning". In the context of recent research studying the difficulty of training in the presence of non-convex training criteria (for deep deterministic and stochastic neural networks), we explore curriculum learning in various set-ups. The experiments show that significant improvements in generalization can be achieved. We hypothesize that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and, in the case of non-convex criteria, on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).
Conference Paper
Full-text available
This paper presents natural evolution strategies (NES), a novel algorithm for performing real-valued dasiablack boxpsila function optimization: optimizing an unknown objective function where algorithm-selected function measurements constitute the only information accessible to the method. Natural evolution strategies search the fitness landscape using a multivariate normal distribution with a self-adapting mutation matrix to generate correlated mutations in promising regions. NES shares this property with covariance matrix adaption (CMA), an evolution strategy (ES) which has been shown to perform well on a variety of high-precision optimization tasks. The natural evolution strategies algorithm, however, is simpler, less ad-hoc and more principled. Self-adaptation of the mutation matrix is derived using a Monte Carlo estimate of the natural gradient towards better expected fitness. By following the natural gradient instead of the dasiavanillapsila gradient, we can ensure efficient update steps while preventing early convergence due to overly greedy updates, resulting in reduced sensitivity to local suboptima. We show NES has competitive performance with CMA on unimodal tasks, while outperforming it on several multimodal tasks that are rich in deceptive local optima.
Conference Paper
Full-text available
Legged robots show promise for complex mobility tasks, such as navigating rough terrain, but the design of their control software is both challenging and laborious. Traditional evolutionary algorithms can produce these controllers, but require manual decomposition or other problem simplification because conventionally-used direct encodings have trouble taking advantage of a problem's regularities and symmetries. Such active intervention is time consuming, limits the range of potential solutions, and requires the user to possess a deep understanding of the problem's structure. This paper demonstrates that HyperNEAT, a new and promising generative encoding for evolving neural networks, can evolve quadruped gaits without an engineer manually decomposing the problem. Analyses suggest that HyperNEAT is successful because it employs a generative encoding that can more easily reuse phenotypic modules. It is also one of the first neuroevolutionary algorithms that exploits a problem's geometric symmetries, which may aid its performance. We compare HyperNEAT to FT-NEAT, a direct encoding control, and find that HyperNEAT is able to evolve impressive quadruped gaits and vastly outperforms FT-NEAT. Comparative analyses reveal that HyperNEAT individuals are more holistically affected by genetic operators, resulting in better leg coordination. Overall, the results suggest that HyperNEAT is a powerful algorithm for evolving control systems for complex, yet regular, devices, such as robots.
Conference Paper
Full-text available
An ambitious challenge in artificial life is to craft an evolutionary process that discovers a wide diversity of well-adapted virtual creatures within a single run. Unlike in nature, evolving creatures in virtual worlds tend to converge to a single morphology because selection therein greedily rewards the morphology that is easiest to exploit. However, novelty search, a technique that explicitly rewards diverging, can potentially mitigate such convergence. Thus in this paper an existing creature evolution platform is extended with multi-objective search that balances drives for both novelty and performance. However, there are different ways to combine performance-driven search and novelty search. The suggested approach is to provide evolution with both a novelty objective that encourages diverse morphologies and a local competition objective that rewards individuals outperforming those most similar in morphology. The results in an experiment evolving locomoting virtual creatures show that novelty search with local competition discovers more functional morphological diversity within a single run than models with global competition, which are more predisposed to converge. The conclusions are that novelty search with local competition may complement recent advances in evolving virtual creatures and may in general be a principled approach to combining novelty search with pressure to achieve.
Article
Full-text available
Of all the issues discussed at Alife VII: Looking Forward, Looking Backward, the issue of whether it was possible to create an artificial life system that exhibits open-ended evolution of novelty is by far the biggest. Of the 14 open problems settled on as a result of debate at the conference, some 6 are directly, or indirectly related to this issue. Most people equate open-ended evolution with complexity growth, although a priori these seem to be different things. In this paper I report on experiments to measure the complexity of Tierran organisms, and show the results for a size-neutral run of Tierra. In this run, no increase in organismal complexity was observed, although organism size did increase through the run. This result is discussed, offering some signposts on path to solving the issue of open ended evolution.
Article
Full-text available
In evolutionary computation, the fitness function normally measures progress toward an objective in the search space, effectively acting as an objective function. Through deception, such objective functions may actually prevent the objective from being reached. While methods exist to mitigate deception, they leave the underlying pathology untreated: Objective functions themselves may actively misdirect search toward dead ends. This paper proposes an approach to circumventing deception that also yields a new perspective on open-ended evolution. Instead of either explicitly seeking an objective or modeling natural evolution to capture open-endedness, the idea is to simply search for behavioral novelty. Even in an objective-based problem, such novelty search ignores the objective. Because many points in the search space collapse to a single behavior, the search for novelty is often feasible. Furthermore, because there are only so many simple behaviors, the search for novelty leads to increasing complexity. By decoupling open-ended search from artificial life worlds, the search for novelty is applicable to real world problems. Counterintuitively, in the maze navigation and biped walking tasks in this paper, novelty search significantly outperforms objective-based search, suggesting the strange conclusion that some problems are best solved by methods that ignore the objective. The main lesson is the inherent limitation of the objective-based paradigm and the unexploited opportunity to guide search through other means.
Article
Full-text available
Research in neuroevolution-that is, evolving artificial neural networks (ANNs) through evolutionary algorithms-is inspired by the evolution of biological brains, which can contain trillions of connections. Yet while neuroevolution has produced successful results, the scale of natural brains remains far beyond reach. This article presents a method called hypercube-based NeuroEvolution of Augmenting Topologies (HyperNEAT) that aims to narrow this gap. HyperNEAT employs an indirect encoding called connective compositional pattern-producing networks (CPPNs) that can produce connectivity patterns with symmetries and repeating motifs by interpreting spatial patterns generated within a hypercube as connectivity patterns in a lower-dimensional space. This approach can exploit the geometry of the task by mapping its regularities onto the topology of the network, thereby shifting problem difficulty away from dimensionality to the underlying problem structure. Furthermore, connective CPPNs can represent the same connectivity pattern at any resolution, allowing ANNs to scale to new numbers of inputs and outputs without further evolution. HyperNEAT is demonstrated through visual discrimination and food-gathering tasks, including successful visual discrimination networks containing over eight million connections. The main conclusion is that the ability to explore the space of regular connectivity patterns opens up a new class of complex high-dimensional tasks to neuroevolution.
Article
Full-text available
Several researchers have demonstrated how complex behavior can be learned through neuro-evolution (i.e. evolving neural networks with genetic algorithms). However, complex general behavior such as evading predators or avoiding obstacles, which is not tied to specific environments, turns out to be very difficult to evolve. Often the system discovers mechanical strategies (such as moving back and forth) that help the agent cope, but are not very effective, do not appear believable and would not generalize to new environments. The problem is that a general strategy is too difficult for the evolution system to discover directly. This paper proposes an approach where such complex general behavior is learned incrementally, by starting with simpler behavior and gradually making the task more challenging and general. The task transitions are implemented through successive stages of delta-coding (i.e. evolving modifications), which allows even converged populations to adapt to the new task. The...
Conference Paper
Building on the recent successes of distributed training of RL agents, in this paper we investigate the training of RNN-based RL agents from distributed prioritized experience replay. We study the effects of parameter lag resulting in representational drift and recurrent state staleness and empirically derive an improved training strategy. Using a single network architecture and fixed set of hyperparameters, the resulting agent, Recurrent Replay Distributed DQN, quadruples the previous state of the art on Atari-57, and matches the state of the art on DMLab-30. It is the first agent to exceed human-level performance in 52 of the 57 Atari games.
Article
In many reinforcement learning tasks, the goal is to learn a policy to manipulate an agent, whose design is fixed, to maximize some notion of cumulative reward. The design of the agent’s physical structure is rarely optimized for the task at hand. In this work, we explore the possibility of learning a version of the agent’s design that is better suited for its task, jointly with the policy. We propose an alteration to the popular OpenAI Gym framework, where we parameterize parts of an environment, and allow an agent to jointly learn to modify these environment parameters along with its policy. We demonstrate that an agent can learn a better structure of its body that is not only better suited for the task, but also facilitates policy learning. Joint learning of policy and structure may even uncover design principles that are useful for assisted-design applications.
Article
One program to rule them all Computers can beat humans at increasingly complex games, including chess and Go. However, these programs are typically constructed for a particular game, exploiting its properties, such as the symmetries of the board on which it is played. Silver et al. developed a program called AlphaZero, which taught itself to play Go, chess, and shogi (a Japanese version of chess) (see the Editorial, and the Perspective by Campbell). AlphaZero managed to beat state-of-the-art programs specializing in these three games. The ability of AlphaZero to adapt to various game rules is a notable step toward achieving a general game-playing system. Science , this issue p. 1140 ; see also pp. 1087 and 1118
Article
An important challenge in reinforcement learning, including evolutionary robotics, is to solve multimodal problems, where agents have to act in qualitatively different ways depending on the circumstances. Because multimodal problems are often too difficult to solve directly, it is helpful to take advantage of staging, where a difficult task is divided into simpler subtasks that can serve as stepping stones for solving the overall problem. Unfortunately, choosing an effective ordering for these subtasks is difficult, and a poor ordering can reduce the speed and performance of the learning process. Here, we provide a thorough introduction and investigation of the Combinatorial Multi-Objective Evolutionary Algorithm (CMOEA), which avoids ordering subtasks by allowing all combinations of subtasks to be explored simultaneously. We compare CMOEA against two algorithms that can similarly optimize on multiple subtasks simultaneously: NSGA-II and Lexicase Selection. The algorithms are tested on a multimodal robotics problem with six subtasks as well as a maze navigation problem with a hundred subtasks. On these problems, CMOEA either outperforms or is competitive with the controls. Separately, we show that adding a linear combination over all objectives can improve the ability of NSGA-II to solve these multimodal problems. Lastly, we show that, in contrast to NSGA-II and Lexicase Selection, CMOEA can effectively leverage secondary objectives to achieve state-of-the-art results on the robotics task. In general, our experiments suggest that CMOEA is a promising, state-of-the-art algorithm for solving multimodal problems. Paper is available at: https://arxiv.org/abs/1807.03392
Conference Paper
An evolution strategy (ES) variant based on a simplification of a natural evolution strategy recently attracted attention because it performs surprisingly well in challenging deep reinforcement learning domains. It searches for neural network parameters by generating perturbations to the current set of parameters, checking their performance, and moving in the aggregate direction of higher reward. Because it resembles a traditional finite-difference approximation of the reward gradient, it can naturally be confused with one. However, this ES optimizes for a different gradient than just reward: It optimizes for the average reward of the entire population, thereby seeking parameters that are robust to perturbation. This difference can channel ES into distinct areas of the search space relative to gradient descent, and also consequently to networks with distinct properties. This unique robustness-seeking property, and its consequences for optimization, are demonstrated in several domains. They include humanoid locomotion, where networks from policy gradient-based reinforcement learning are significantly less robust to parameter perturbation than ES-based policies solving the same task. While the implications of such robustness and robustness-seeking remain open to further study this work's main contribution is to highlight such differences and their potential importance.
Article
We propose a distributed architecture for deep reinforcement learning at scale, that enables agents to learn effectively from orders of magnitude more data than previously possible. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared experience replay memory; the learner replays samples of experience and updates the neural network. The architecture relies on prioritized experience replay to focus only on the most significant data generated by the actors. Our architecture substantially improves the state of the art on the Arcade Learning Environment, achieving better final performance in a fraction of the wall-clock training time.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
We describe a iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.
Conference Paper
Recent studies have emphasized the merits of search processes that lack overarching objectives, instead promoting divergence by rewarding behavioral novelty. While this less objective search paradigm is more open-ended and divergent, it still differs significantly from nature's mechanism of divergence. Rather than measuring novelty explicitly nature is guided by a single, fundamental constraint: survive long enough to reproduce. Surprisingly, this simple constraint produces both complexity and diversity in a continual process unparalleled by any algorithm to date. Inspired by the relative simplicity of open-endedness in nature in comparison to recent non-objective algorithms, this paper investigates the extent to which interactions between two coevolving populations, both subject to their own constraint, or minimal criterion, can produce results that are both functional and diverse even without any behavior characterization or novelty archive. To test this new approach, a novel maze navigation domain is introduced wherein evolved agents must learn to navigate mazes whose structures are simultaneously coevolving and increasing in complexity. The result is a broad range of maze topologies and successful agent trajectories in a single run, thereby suggesting the viability of minimal criterion coevolution as a new approach to non-objective search and a step towards genuinely open-ended algorithms.
Article
We explore the use of Evolution Strategies, a class of black box optimization algorithms, as an alternative to popular RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using hundreds to thousands of parallel workers, ES can solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training time. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.
Article
OpenAI Gym is a toolkit for reinforcement learning research. It includes a growing collection of benchmark problems that expose a common interface, and a website where people can share their results and compare the performance of algorithms. This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software.
Conference Paper
Many argue that to evolve artificial intelligence that rivals that of natural animals, we need to evolve neural networks that are structurally organized in that they exhibit modularity, regularity, and hierarchy. It was recently shown that a cost for network connections, which encourages the evolution of modularity, can be combined with an indirect encoding, which encourages the evolution of regularity, to evolve networks that are both modular and regular. However, the bias towards regularity from indirect encodings may prevent evolution from independently optimizing different modules to perform different functions, unless modularity in the phenotype is aligned with modularity in the genotype. We test this hypothesis on two multi-modal problems—a pattern recognition task and a robotics task—that each require different phenotypic modules. In general, we find that performance is improved only when genotypic and phenotypic modularity are encouraged simultaneously, though the role of alignment remains unclear. In addition, intuitive manual decompositions fail to provide the performance benefits of automatic methods on the more challenging robotics problem, emphasizing the importance of automatic, rather than manual, decomposition methods. These results suggest encouraging modularity in both the genotype and phenotype as an important step towards solving large-scale multi-modal problems, but also indicate that more research is required before we can evolve structurally organized networks to solve tasks that require multiple, different neural modules.
Book
Why does modern life revolve around objectives? From how science is funded, to improving how children are educated -- and nearly everything in-between -- our society has become obsessed with a seductive illusion: that greatness results from doggedly measuring improvement in the relentless pursuit of an ambitious goal. In Why Greatness Cannot Be Planned, Stanley and Lehman begin with a surprising scientific discovery in artificial intelligence that leads ultimately to the conclusion that the objective obsession has gone too far. They make the case that great achievement can’t be bottled up into mechanical metrics; that innovation is not driven by narrowly focused heroic effort; and that we would be wiser (and the outcomes better) if instead we whole-heartedly embraced serendipitous discovery and playful creativity.Controversial at its heart, yet refreshingly provocative, this book challenges readers to consider life without a destination and discovery without a compass.
Article
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Conference Paper
Humans and animals acquire their wide repertoire of motor skills through an incremental learning process, during which progressively more complex skills are acquired and subsequently integrated with prior abilities. Inspired by this general idea, we develop an approach for learning motor skills based on a two-level curriculum. At the high level, the curriculum specifies an order in which different skills should be learned. At the low level, the curriculum defines a process for learning within a skill. We develop a set of integrated motor skills for a planar articulated figure capable of doing parameterized hops, flips, rolls, and acrobatic sequences. The same curriculum can be applied to yield individualized motor skill sets for articulated figures of varying proportions.
Article
Natural DNA can encode complexity on an enormous scale. Researchers are attempting to achieve the same representational efficiency in computers by implementing developmental encodings, i.e. encodings that map the genotype to the phenotype through a process of growth from a small starting point to a mature form. A major challenge in in this effort is to find the right level of abstraction of biological development to capture its essential properties without introducing unnecessary inefficiencies. In this paper, a novel abstraction of natural development, called Compositional Pattern Producing Networks (CPPNs), is proposed. Unlike currently accepted abstractions such as iterative rewrite systems and cellular growth simulations, CPPNs map to the phenotype without local interaction, that is, each individual component of the phenotype is determined independently of every other component. Results produced with CPPNs through interactive evolution of two-dimensional images show that such an encoding can nevertheless produce structural motifs often attributed to more conventional developmental abstractions, suggesting that local interaction may not be essential to the desirable properties of natural encoding in the way that is usually assumed.
Article
We present a model-free reinforcement learning method for partially observable Markov decision problems. Our method estimates a likelihood gradient by sampling directly in parameter space, which leads to lower variance gradient estimates than obtained by regular policy gradient methods. We show that for several complex control tasks, including robust standing with a humanoid robot, this method outperforms well-known algorithms from the fields of standard policy gradients, finite difference methods and population based heuristics. We also show that the improvement is largest when the parameter samples are drawn symmetrically. Lastly we analyse the importance of the individual components of our method by incrementally incorporating them into the other algorithms, and measuring the gain in performance after each step.