Preprint

Topological Guided Actor-Critic Modular Learning of Continuous Systems with Temporal Objectives

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

This work investigates the formal policy synthesis of continuous-state stochastic dynamic systems given high-level specifications in linear temporal logic. To learn an optimal policy that maximizes the satisfaction probability, we take a product between a dynamic system and the translated automaton to construct a product system on which we solve an optimal planning problem. Since this product system has a hybrid product state space that results in reward sparsity, we introduce a generalized optimal backup order, in reverse to the topological order, to guide the value backups and accelerate the learning process. We provide the optimality proof for using the generalized optimal backup order in this optimal planning problem. Further, this paper presents an actor-critic reinforcement learning algorithm when topological order applies. This algorithm leverages advanced mathematical techniques and enjoys the property of hyperparameter self-tuning. We provide proof of the optimality and convergence of our proposed reinforcement learning algorithm. We use neural networks to approximate the value function and policy function for hybrid product state space. Furthermore, we observe that assigning integer numbers to automaton states can rank the value or policy function approximated by neural networks. To break the ordinal relationship, we use an individual neural network for each automaton state's value (policy) function, termed modular learning. We conduct two experiments. First, to show the efficacy of our reinforcement learning algorithm, we compare it with baselines on a classic control task, CartPole. Second, we demonstrate the empirical performance of our formal policy synthesis framework on motion planning of a Dubins car with a temporal specification.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
We study the problem of learning control policies for complex tasks given by logical specifications Typically, these approaches automatically generate a reward function from a given specification and use a suitable reinforcement learning algorithm to learn a policy that maximizes the expected reward. These approaches, however, scale poorly to complex tasks that require high-level planning. In this work, we develop a compositional learning approach, called DiRL, that leverages the specification to decompose the task into a high-level planning problem and a set of simpler reinforcement learning tasks. An evaluation of DiRL on a challenging control benchmark with continuous state and action spaces demonstrates that it outperforms state-of-the-art baselines.
Article
Full-text available
We consider synthesis of control policies that maximize the probability of satisfying given temporal logic specifications in unknown, stochastic environments. We model the interaction between the system and its environment as a Markov decision process (MDP) with initially unknown transition probabilities. The solution we develop builds on the so-called model-based probably approximately correct Markov decision process (PAC-MDP) methodology. The algorithm attains an ε\varepsilon-approximately optimal policy with probability 1δ1-\delta using samples (i.e. observations), time and space that grow polynomially with the size of the MDP, the size of the automaton expressing the temporal logic specification, 1ε\frac{1}{\varepsilon}, 1δ\frac{1}{\delta} and a finite time horizon. In this approach, the system maintains a model of the initially unknown MDP, and constructs a product MDP based on its learned model and the specification automaton that expresses the temporal logic constraints. During execution, the policy is iteratively updated using observation of the transitions taken by the system. The iteration terminates in finitely many steps. With high probability, the resulting policy is such that, for any state, the difference between the probability of satisfying the specification under this policy and the optimal one is within a predefined bound.
Article
Full-text available
Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between disciplines has sufficient promise to be likened to that between physics and mathematics. In this article, we attempt to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots. We highlight both key challenges in robot reinforcement learning as well as notable successes. We discuss how contributions tamed the complexity of the domain and study the role of algorithms, representations, and prior knowledge in achieving these successes. As a result, a particular focus of our paper lies on the choice between model-based and model-free as well as between value-function-based and policy-search methods. By analyzing a simple problem in some detail we demonstrate how reinforcement learning approaches may be profitably applied, and we note throughout open questions and the tremendous potential for future research.
Conference Paper
Full-text available
Linear systems are one of the most commonly used models to rep- resent physical systems. Yet, only few automated tools have been developed to check their behaviors over time. In this paper, we propose a linear temporal logic for specifying complex properties of discrete time linear s ystems. The proposed logic can also be used in a control system to generate control input in the pro- cess of model checking. Although, developing a full feedback control system is beyond the scope of this paper, authors believe that a feedback loop can be easily introduced by adopting the receding horizon scheme of predictive controllers. In this paper we explain the syntax, the semantics, a model checking algorithm, and an example application of our proposed logic.
Article
Full-text available
Value iteration is a powerful yet inefficient algorithm for Markov decision processes (MDPs) because it puts the majority of its effort into backing up the entire state space, which turns out to be unnecessary in many cases. In order to overcome this problem, many approaches have been proposed. Among them, ILAO* and variants of RTDP are state-of-the-art ones. These methods use reachability analysis and heuristic search to avoid some unnecessary backups. However, none of these approaches build the graphical structure of the state transitions in a pre-processing step or use the structural information to systematically decompose a problem, whereby generating an intelligent backup sequence of the state space. In this paper, we present two optimal MDP algorithms. The first algorithm, topological value iteration (TVI), detects the structure of MDPs and backs up states based on topological sequences. It (1) divides an MDP into strongly-connected components (SCCs), and (2) solves these components sequentially. TVI outperforms VI and other state-of-the-art algorithms vastly when an MDP has multiple, close-to-equal-sized SCCs. The second algorithm, focused topological value iteration (FTVI), is an extension of TVI. FTVI restricts its attention to connected components that are relevant for solving the MDP. Specifically, it uses a small amount of heuristic search to eliminate provably sub-optimal actions; this pruning allows FTVI to find smaller connected components, thus running faster. We demonstrate that FTVI outperforms TVI by an order of magnitude, averaged across several domains. Surprisingly, FTVI also significantly outperforms popular heuristically-informed MDP algorithms such as ILAO*, LRTDP, BRTDP and Bayesian-RTDP in many domains, sometimes by as much as two orders of magnitude. Finally, we characterize the type of domains where FTVI excels --- suggesting a way to an informed choice of solver.
Chapter
The combination of data-driven learning methods with formal reasoning has seen a surge of interest, as either area has the potential to bolstering the other. For instance, formal methods promise to expand the use of state-of-the-art learning approaches in the direction of certification and sample efficiency. In this work, we propose a deep Reinforcement Learning (RL) method for policy synthesis in continuous-state/action unknown environments, under requirements expressed in Linear Temporal Logic (LTL). We show that this combination lifts the applicability of deep RL to complex temporal and memory-dependent policy synthesis goals. We express an LTL specification as a Limit Deterministic Büchi Automaton (LDBA) and synchronise it on-the-fly with the agent/environment. The LDBA in practice monitors the environment, acting as a modular reward machine for the agent: accordingly, a modular Deep Deterministic Policy Gradient (DDPG) architecture is proposed to generate a low-level control policy that maximises the probability of the given LTL formula. We evaluate our framework in a cart-pole example and in a Mars rover experiment, where we achieve near-perfect success rates, while baselines based on standard RL are shown to fail in practice.
Article
Of special interest in formal verification are safety properties, which assert that the system always stays within some allowed region. Proof rules for the verification of safety properties have been developed in the proof-based approach to verification, making verification of safety properties simpler than verification of general properties. In this paper we consider model checking of safety properties. A computation that violates a general linear property reaches a bad cycle, which witnesses the violation of the property. Accordingly, current methods and tools for model checking of linear properties are based on a search for bad cycles. A symbolic implementation of such a search involves the calculation of a nested fixed-point expression over the system's state space, and is often infeasible. Every computation that violates a safety property has a finite prefix along which the property is violated. We use this fact in order to base model checking of safety properties on a search for finite bad prefixes. Such a search can be performed using a simple forward or backward symbolic reachability check. A naive methodology that is based on such a search involves a construction of an automaton (or a tableau) that is doubly exponential in the property. We present an analysis of safety properties that enables us to prevent the doubly-exponential blow up and to use the same automaton used for model checking of general properties, replacing the search for bad cycles by a search for bad prefixes.
Conference Paper
We present Spot 2.0, a C++ library with Python bindings and an assortment of command-line tools designed to manipulate LTL and ω\omega -automata in batch. New automata-manipulation tools were introduced in Spot 2.0; they support arbitrary acceptance conditions, as expressible in the Hanoi Omega Automaton format. Besides being useful to researchers who have automata to process, its Python bindings can also be used in interactive environments to teach ω\omega -automata and model checking.
Conference Paper
Resource management problems in systems and networking often manifest as difficult online decision making tasks where appropriate solutions depend on understanding the workload and environment. Inspired by recent advances in deep reinforcement learning for AI problems, we consider building systems that learn to manage resources directly from experience. We present DeepRM, an example solution that translates the problem of packing tasks with multiple resource demands into a learning problem. Our initial results show that DeepRM performs comparably to state-of-the-art heuristics, adapts to different conditions, converges quickly, and learns strategies that are sensible in hindsight.
Article
In this paper, we develop a method to automatically generate a control policy for a dynamical system modeled as a Markov Decision Process (MDP). The control specification is given as a Linear Temporal Logic (LTL) formula over a set of propositions defined on the states of the MDP. Motivated by robotic applications requiring persistent tasks, such as environmental monitoring and data gathering, we synthesize a control policy that minimizes the expected cost between satisfying instances of a particular proposition over all policies that maximize the probability of satisfying the given LTL specification. Our approach is based on the definition of a novel optimization problem that extends the existing average cost per stage problem. We propose a sufficient condition for a policy to be optimal, and develop a dynamic programming algorithm that synthesizes a policy that is optimal for a set of LTL specifications.
Article
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Article
Shows how a system consisting of 2 neuronlike adaptive elements can solve a difficult control problem in which it is assumed that the equations of the system are not known and that the only feedback evaluating performance is a failure signal. This evaluative feedback is of much lower quality than is required by standard adaptive control techniques. It is argued that the learning problems faced by adaptive elements that are components of adaptive networks are at least as difficult as this problem. The learning system consists of a single associative search element (ASE) and a single adaptive critic element (ACE). In the course of learning to balance the pole, the ASE constructs associations between input and output by searching under the influence of reinforcement feedback, and the ACE constructs a more informative evaluation function than reinforcement feedback alone can provide. The differences between this approach and other attempts to solve problems using neuronlike elements are discussed, as is the relation of the ACE/ASE system to classical and instrumental conditioning in animal learning studies. Implications for research in the neurosciences are noted. (42 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Chaotic dynamical systems are often transitive, although this transitivity is sometimes very weak. It is of interest to divide the phase space into large regions, between which there is relatively little communication of trajectories. We present fast, simple algorithms to find such divisions. The present work builds on the results of Froyland and Dellnitz [G. Froyland, M. Dellnitz, Detecting and locating near-optimal almost-invariant sets and cycles, SIAM J. Sci. Comput. 24 (6) (2003) 1839–1863], focussing on a statistical description of transitivity that takes into account the fact that trajectories tend to visit different regions of phase space with different frequencies. The new work takes advantage of theoretical results from the theory of reversible Markov chains. A new adaptive algorithm is put forward to efficiently deal with situations where the boundaries of the weakly communicating regions are complicated. This algorithm is illustrated with the standard map. Relevant convergence results are proven.
Article
Chapter 6 Approximate Dynamic Programming This is an updated version of the research-oriented Chapter 6 on Approximate Dynamic Programming. It will be periodically updated as new research becomes available, and will replace the current Chapter 6 in the book's next printing. In addition to editorial revisions, rearrangements, and new exercises, the chapter includes an account of new research, which is collected mostly in Sections 6.3 and 6.8. Furthermore, a lot of new material has been added, such as an account of post-decision state simplifications (Section 6.1), regression-based TD methods (Section 6.3), exploration schemes and optimistic policy iteration (Section 6.3), convergence analysis of Q-learning (Section 6.4), aggregation methods (Section 6.5), and Monte Carlo linear algebra (Section 6.8). This chapter represents "work in progress." It more than likely con-tains errors (hopefully not serious ones). Furthermore, its references to the literature are incomplete. Your comments and suggestions to the author at dimitrib@mit.edu are welcome. When quoting, please refer to the date of last revision given below.
Article
Missions with high combinatorial complexity involving several logical and temporal constraints often arise in cooperative control of multiple Uninhabited Aerial Vehicles. In this paper, we propose a new class of problems that generalizes the standard Vehicle Routing Problem (VRP) by addressing complex tasks and constraints on the mission, called the ‘mission specifications’, expressed in a high-level specification language. In the generalized problem setup, these mission specifications are naturally specified using the Linear Temporal Logic language LTL−X. Using a novel systematic procedure, the LTL−X specification is converted to a set of constraints suitable to a Mixed-Integer Linear Programming (MILP) formulation, which in turn can be incorporated into two widely-used MILP formulations of the standard VRP. Solving the resulting MILP provides an optimal plan that satisfies the given mission specification. The paper also presents two mission planning applications. Copyright © 2011 John Wiley & Sons, Ltd.
Conference Paper
While the use of naturally-occurring features is a central focus of machine perception, artificial features (fiducials) play an important role in creating controllable experiments, ground truthing, and in simplifying the development of systems where perception is not the central objective. We describe a new visual fiducial system that uses a 2D bar code style "tag", allowing full 6 DOF localization of features from a single image. Our system improves upon previous systems, incorporating a fast and robust line detection system, a stronger digital coding system, and greater robustness to occlusion, warping, and lens distortion. While similar in concept to the ARTag system, our method is fully open and the algorithms are documented in detail.
Conference Paper
Autonomous helicopter flight is widely regarded to be a highl y challenging control problem. This paper presents the first successful autonomou s completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tail-in funnel, and nose-in funn el. Our experimental results significantly extend the state of the art in autonomo us helicopter flight. We used the following approach: First we had a pilot fly the hel icopter to help us find a helicopter dynamics model and a reward (cost) functi on. Then we used a reinforcement learning (optimal control) algorithm to fin d a controller that is optimized for the resulting model and reward function. More specifically, we used differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR).
Conference Paper
Recently, linear temporal logic (LTL) has been employed as a tool for formal specification in dynamical control systems. With this formal approach, control systems can be designed to provably accomplish a large class of complex tasks specified via LTL. For this purpose, language generating Buchi automata with finite abstractions of dynamical systems have been used in the literature. In this paper, we take a mathematical programming-based approach to control of a broad class of discrete-time dynamical systems, called mixed logic dynamical (MLD) systems, with LTL specifications. MLDs include discontinuous and hybrid piecewise discrete-time linear systems. We apply these tools for model checking and optimal control of MLD systems with LTL specifications. Our algorithms exploit mixed integer linear programming (MILP) as well as, in the appropriate setting, mixed integer quadratic programming (MIQP) techniques. Our solution approach introduces a general technique useful in representing LTL constraints as mixed-integer linear constraints.
Conference Paper
We consider the synthesis of a reactive module with input x and output y, which is specified by the linear temporal formula (x, y). We show that there exists a program satisfying iff the branching time formula (x)(y)A(x, y) is valid over all tree models.
Article
The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large-scale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach "fits" a linear combination of pre-selected basis functions to the dynamic programming cost-to-go function. We develop error bounds that offer performance guarantees and also guide the selection of both basis functions and "state-relevance weights" that influence quality of the approximation. Experimental results in the domain of queueing network control provide empirical support for the methodology.
Article
Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911) The idea of learning to make appropriate responses based on reinforcing events has its roots in early psychological theories such as Thorndike's "law of effect" (quoted above). Although several important contributions were made in the 1950s, 1960s and 1970s by illustrious luminaries such as Bellman, Minsky, Klopf and others (Farley and Clark, 1954; Bellman, 1957; Minsky, 1961; Samuel, 1963; Michie and Chambers, 1968; Grossberg, 1975; Klopf, 1982), the last two decades have wit- nessed perhaps the strongest advances in the mathematical foundations of reinforcement learning, in addition to several impressive demonstrations of the performance of reinforcement learning algo- rithms in real world tasks. The introductory book by Sutton and Barto, two of the most influential and recognized leaders in the field, is therefore both timely and welcome. The book is divided into three parts. In the first part, the authors introduce and elaborate on the es- sential characteristics of the reinforcement learning problem, namely, the problem of learning "poli- cies" or mappings from environmental states to actions so as to maximize the amount of "reward"
Formal Synthesis of Cyber-Physical Systems (Dagstuhl Seminar 17201)
  • C A Belta
  • R Majumdar
  • M Zamani
  • M Rungger
C. A. Belta, R. Majumdar, M. Zamani, and M. Rungger, "Formal Synthesis of Cyber-Physical Systems (Dagstuhl Seminar 17201)," Dagstuhl Reports, vol. 7, no. 5, pp. 84-96, 2017. [Online]. Available: http://drops.dagstuhl. de/opus/volltexte/2017/8281
Learning markov decision processes for model checking
  • H Mao
  • Y Chen
  • M Jaeger
  • T D Nielsen
  • K G Larsen
  • B Nielsen
H. Mao, Y. Chen, M. Jaeger, T. D. Nielsen, K. G. Larsen, and B. Nielsen, "Learning markov decision processes for model checking," arXiv preprint arXiv:1212.3873, 2012.
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
  • T Haarnoja
  • A Zhou
  • P Abbeel
  • S Levine
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor," in International conference on machine learning. PMLR, 2018, pp. 1861-1870.
Bridging the gap between value and policy based reinforcement learning
  • O Nachum
  • M Norouzi
  • K Xu
  • D Schuurmans
O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, "Bridging the gap between value and policy based reinforcement learning," arXiv preprint arXiv:1702.08892, 2017.
Off-policy actorcritic
  • T Degris
  • M White
  • R S Sutton
T. Degris, M. White, and R. S. Sutton, "Off-policy actorcritic," arXiv preprint arXiv:1205.4839, 2012.
  • J Schulman
  • F Wolski
  • P Dhariwal
  • A Radford
  • O Klimov
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal policy optimization algorithms," arXiv preprint arXiv:1707.06347, 2017.
Playing atari with deep reinforcement learning
  • V Mnih
  • K Kavukcuoglu
  • D Silver
  • A Graves
  • I Antonoglou
  • D Wierstra
  • M Riedmiller
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, "Playing atari with deep reinforcement learning," arXiv preprint arXiv:1312.5602, 2013.
Asynchronous methods for deep reinforcement learning
  • V Mnih
  • A P Badia
  • M Mirza
  • A Graves
  • T Lillicrap
  • T Harley
  • D Silver
  • K Kavukcuoglu
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, "Asynchronous methods for deep reinforcement learning," in International conference on machine learning. PMLR, 2016, pp. 1928-1937.
Autonomous rc car platform
  • J Ashton
  • M E Spencer
  • S Hunter
J. ASHTON, M. E. Spencer, and S. Hunter, "Autonomous rc car platform," Ph.D. dissertation, Worcester Polytechnic Institute, 2019.
Ltl control in uncertain environments with probabilistic satisfaction guarantees
  • X C D Ding
  • S L Smith
  • C Belta
  • D Rus
X. C. D. Ding, S. L. Smith, C. Belta, and D. Rus, "Ltl control in uncertain environments with probabilistic satisfaction guarantees," IFAC Proceedings Volumes, vol. 44, no. 1, pp. 3515-3520, 2011.
Optimization-based trajectory generation with linear temporal logic specifications
  • E M Wolff
  • U Topcu
  • R M Murray
E. M. Wolff, U. Topcu, and R. M. Murray, "Optimization-based trajectory generation with linear temporal logic specifications," in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 5319-5325.
Hierarchical ltl-task mdps for multi-agent coordination through auctioning and learning
  • P Schillinger
  • M Bürger
  • D V Dimarogonas
P. Schillinger, M. Bürger, and D. V. Dimarogonas, "Hierarchical ltl-task mdps for multi-agent coordination through auctioning and learning," The international journal of robotics research, 2019.