Chapter

On Normative Reinforcement Learning via Safe Reinforcement Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Reinforcement learning (RL) has proven a successful technique for teaching autonomous agents goal-directed behaviour. As RL agents further integrate with our society, they must learn to comply with ethical, social, or legal norms. Defeasible deontic logics are natural formal frameworks to specify and reason about such norms in a transparent way. However, their effective and efficient integration in RL agents remains an open problem. On the other hand, linear temporal logic (LTL) has been successfully employed to synthesize RL policies satisfying, e.g., safety requirements. In this paper, we investigate the extent to which the established machinery for safe reinforcement learning can be leveraged for directing normative behaviour for RL agents. We analyze some of the difficulties that arise from attempting to represent norms with LTL, provide an algorithm for synthesizing LTL specifications from certain normative systems, and analyze its power and limits with a case study.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, this technique is specifically defined for model-based RL, where the MDP the agent operates in is already known. It is known that there are limitations to LTL as a language for representing norms (see [13,36]), and to get around some of these, this technique specifically employs what is called "implicit representation" of norms in [36]. However, among these limitations remain an inability to represent strong permission naturally, cope with obligations/ permissions that are conditional on other obligations/permissions, and account for counts-as (or constitutive) norms. ...
... However, this technique is specifically defined for model-based RL, where the MDP the agent operates in is already known. It is known that there are limitations to LTL as a language for representing norms (see [13,36]), and to get around some of these, this technique specifically employs what is called "implicit representation" of norms in [36]. However, among these limitations remain an inability to represent strong permission naturally, cope with obligations/ permissions that are conditional on other obligations/permissions, and account for counts-as (or constitutive) norms. ...
... Finally, we propose an algorithm for constructing a "normative filter" which can be used to replace the continual running of the normative supervisor, allowing us to use techniques such as NGRL without calling a theorem prover at every step during training or operation; we will show that using this normative filter dramatically improves training time. Throughout the paper, we make use of a case study, the "Travelling Merchant" (introduced in [36]), to demonstrate the inadequacy of the notion of norm violation implicitly espoused by [35,37,38], how regular NGRL fails when faced with contrary-to-duty obligations, how violation counting remedies this issue, and how this technique manages trade-offs between immediate and delayed violations more generally. ...
Article
Full-text available
Reinforcement learning (RL) is a powerful tool for teaching agents goal-directed behaviour in stochastic environments, and many proposed applications involve adopting societal roles which have ethical, legal, or social norms attached to them. Though multiple approaches exist for teaching RL agents norm-compliant behaviour, there are limitations on what normative systems they can accommodate. In this paper we analyse and improve the techniques proposed for use with the Normative Supervisor (Neufeld, et al., 2021)—a module which uses conclusions gleaned from a defeasible deontic logic theorem prover to restrict the behaviour of RL agents. First, we propose a supplementary technique we call violation counting to broaden the range of normative systems we can learn from, thus covering normative conflicts and contrary-to-duty norms. Additionally, we propose an algorithm for constructing a “normative filter”, a function that can be used to implement the addressed techniques without requiring the theorem prover to be run at each step during training or operation, significantly decreasing the overall computational overhead of using the normative supervisor. In order to demonstrate these contributions, we use a computer game-based case study, and thereafter discuss remaining problems to be solved in the conclusion.
... Linear temporal logic [19] is another high-level formal language focusing on time-related reasoning. Significant works have been done to use LTL to express safety constraints [24,25] and ethical norms [26,27]. Furthermore, Governatori et al. [28,29] pointed out that CTL [20] and its advanced variant, CTL*, could be a better option to model the concept of permissions. ...
... Furthermore, Governatori et al. [28,29] pointed out that CTL [20] and its advanced variant, CTL*, could be a better option to model the concept of permissions. Up until recently, the expressiveness and computational feasibility between DDL, LTL, and CTL* in decision-making problems with ethical constraints are still an ongoing research topic, and the opinions are diverging in the literature, like [26,28,29]. ...
... In addition, Svegliato et al. [9] proposed an ethically compliant autonomous system to simplify and standardize the implementation of ethical principles to the Markov Decision Process (MDP) framework while optimizing task completion. Neufeld et al. [23,26] took distinctive approaches to RL problems by integrating a normative supervisor module into an RL agent and utilizing DL and LTL for ethical norm representation. Furthermore, Abel et al. [34] extended the decision-making model to a Partially Observable Markov Decision Process (POMDP) [35], in which the rewards are designed explicitly for moral objectives. ...
Article
Full-text available
Designing autonomous agents that follow moral norms presents a significant challenge in addressing AI decision-making under ethical constraints, especially when involving motion planning for complex tasks in partially observable environments. This paper proposes a model-free reinforcement learning approach to address these challenges. We formulate the motion planning problem as a Probabilistic-Labeled Partially Observable Markov Decision Process (PL-POMDP) model and express complex tasks using Linear Temporal Logic (LTL). To handle ethical norms, we categorize them into ‘hard’ and ‘soft’ ethical constraints. LTL is again employed to formulate ‘hard’ constraints, while a reward redesign method is applied to enforce ‘soft’ ethical constraints. Our approach also involves generating a product of PL-POMDP and an LTL-induced automaton. This transformation allows us to find an optimal policy on the product, ensuring both task completion and ethics satisfaction through model checking. To synthesize desired policies, we utilize a state-of-the-art Recurrent Neural Network (RNN)-based deep Q learning method, in which Q networks take into account observation history and task recognition as input features. We demonstrate the effectiveness and flexibility of the proposed approach through two simulation examples, which showcase its potential applicability to various scenarios and challenges in ethically guided AI decision-making.
Article
Full-text available
Recent years have yielded many discussions on how to endow autonomous agents with the ability to make ethical decisions, and the need for explicit ethical reasoning and transparency is a persistent theme in this literature. We present a modular and transparent approach to equip autonomous agents with the ability to comply with ethical prescriptions, while still enacting pre-learned optimal behaviour. Our approach relies on a normative supervisor module, that integrates a theorem prover for defeasible deontic logic within the control loop of a reinforcement learning agent. The supervisor operates as both an event recorder and an on-the-fly compliance checker w.r.t. an external norm base. We successfully evaluated our approach with several tests using variations of the game Pac-Man, subject to a variety of “ethical” constraints.
Conference Paper
Full-text available
AI research is being challenged with ensuring that autonomous agents learn to behave ethically, namely in alignment with moral values. A common approach, founded on the exploitation of Reinforcement Learning techniques, is to design environments that incentivise agents to behave ethically. However, to the best of our knowledge, current approaches do not theoretically guarantee that an agent will learn to behave ethically. Here, we make headway along this direction by proposing a novel way of designing environments wherein it is formally guaranteed that an agent learns to behave ethically while pursuing its individual objectives. Our theoretical results develop within the formal framework of Multi-Objective Reinforcement Learning to ease the handling of an agent's individual and ethical objectives. As a further contribution, we leverage on our theoretical results to introduce an algorithm that automates the design of ethical environments.
Chapter
Full-text available
We introduce a modular and transparent approach for augmenting the ability of reinforcement learning agents to comply with a given norm base. The normative supervisor module functions as both an event recorder and real-time compliance checker w.r.t. an external norm base. We have implemented this module with a theorem prover for defeasible deontic logic, in a reinforcement learning agent that we task with playing a “vegan” version of the arcade game Pac-Man.
Article
Full-text available
This paper proposes a low-cost, easily realizable strategy to equip a reinforcement learning (RL) agent the capability of behaving ethically. Our model allows the designers of RL agents to solely focus on the task to achieve, without having to worry about the implementation of multiple trivial ethical patterns to follow. Based on the assumption that the majority of human behavior, regardless which goals they are achieving, is ethical, our design integrates human policy with the RL policy to achieve the target objective with less chance of violating the ethical code that human beings normally obey.
Article
Full-text available
Reinforcement learning algorithms discover policies that maximize reward, but do not necessarily guarantee safety during learning or execution phases. We introduce a new approach to learn optimal policies while enforcing properties expressed in temporal logic. To this end, given the temporal logic specification that is to be obeyed by the learning system, we propose to synthesize a reactive system called a shield. The shield is introduced in the traditional learning process in two alternative ways, depending on the location at which the shield is implemented. In the first one, the shield acts each time the learning agent is about to make a decision and provides a list of safe actions. In the second way, the shield is introduced after the learning agent. The shield monitors the actions from the learner and corrects them only if the chosen action causes a violation of the specification. We discuss which requirements a shield must meet to preserve the convergence guarantees of the learner. Finally, we demonstrate the versatility of our approach on several challenging reinforcement learning scenarios.
Conference Paper
Full-text available
In the past few years several business process compliance frameworks based on temporal logic have been proposed. In this paper we investigate whether the use of temporal logic is suitable for the task at hand: namely to check whether the specifications of a business process are compatible with the formalisation of the norms regulating the business process. We provide an example inspired by real life norms where the use of linear temporal logic produces a result that is not compatible with the legal understanding of the norms in the example.
Conference Paper
Full-text available
In this paper we discuss some reasons why temporal logic might not be suitable to model real life norms. To show this, we present a novel deontic logic contrary-to-duty/derived permission paradox based on the interaction of obligations, permissions and contrary-to-duty obligations. The paradox is inspired by real life norms.
Technical Report
Full-text available
In this paper we discuss some reasons why temporal logic might not be suitable to model real life norms. To show this, we present a novel deontic logic contrary-to-duty/derived permission paradox based on the interaction of obligations, permissions and contrary-to-duty obligations. The paradox is inspired by real life norms.
Article
Full-text available
We consider synthesis of control policies that maximize the probability of satisfying given temporal logic specifications in unknown, stochastic environments. We model the interaction between the system and its environment as a Markov decision process (MDP) with initially unknown transition probabilities. The solution we develop builds on the so-called model-based probably approximately correct Markov decision process (PAC-MDP) methodology. The algorithm attains an ε\varepsilon-approximately optimal policy with probability 1δ1-\delta using samples (i.e. observations), time and space that grow polynomially with the size of the MDP, the size of the automaton expressing the temporal logic specification, 1ε\frac{1}{\varepsilon}, 1δ\frac{1}{\delta} and a finite time horizon. In this approach, the system maintains a model of the initially unknown MDP, and constructs a product MDP based on its learned model and the specification automaton that expresses the temporal logic constraints. During execution, the policy is iteratively updated using observation of the transitions taken by the system. The iteration terminates in finitely many steps. With high probability, the resulting policy is such that, for any state, the difference between the probability of satisfying the specification under this policy and the optimal one is within a predefined bound.
Conference Paper
Full-text available
We present a new algorithm to construct a deterministic Rabin automaton for an LTL formula φ\varphi. The automaton is the product of a master automaton and an array of slave automata, one for each G-subformula of φ\varphi. The slave automaton for GψG\psi is in charge of recognizing whether FGψFG\psi holds. As opposed to standard determinization procedures, the states of all our automata have a clear logical structure, which allows to apply various optimizations. Our construction subsumes former algorithms for fragments of LTL. Experimental results show improvement in the sizes of the resulting automata compared to existing methods.
Article
Full-text available
In this paper we follow the BOID (Belief, Obligation, Intention, Desire) architec-ture to describe agents and agent types in Defeasible Logic. We argue, in particular, that the introduction of obligations can provide a new reading of the concepts of intention and intentionality. Then we examine the notion of social agent (i.e., an agent where obligations prevail over intentions) and discuss some computational and philosophical issues related to it. We show that the notion of social agent either requires more complex computations or has some philosophical drawbacks.
Article
Full-text available
In this paper we propose an extension of Defeasible Logic to represent and compute three concepts of defeasible permission. In particular, we discuss different types of explicit permissive norms that work as exceptions to opposite obligations. Moreover, we show how strong permissions can be represented both with, and without introducing a new consequence relation for inferring conclusions from explicit permissive norms. Finally, we illustrate how a preference operator applicable to contrary-to-duty obligations can be combined with a new operator representing ordered sequences of strong permissions which derogate from prohibitions. The logical system is studied from a computational standpoint and is shown to have liner computational complexity.
Conference Paper
Full-text available
We present the design and implementation of SPINdle – an open source Java based defeasible logic reasoner capable to perform efficient and scalable reasoning on defeasible logic theories (including theories with over 1 million rules). The implementation covers both the standard and modal extensions to defeasible logics. It can be used as a standalone theory prover and can be embedded into any applications as a defeasible logic rule engine. It allows users or agents to issues queries, on a given knowledge base or a theory generated on the fly by other applications, and automatically produces the conclusions of its consequences. The theory can also be represented using XML.
Conference Paper
Full-text available
We provide a conceptual analysis of several kinds of deadlines, represented in Temporal Modal Defeasible Logic. The paper presents a typology of deadlines, based on the following parameters: deontic operator, maintenance or achievement, presence or absence of sanctions, and persistence after the deadline. The deadline types are illustrated by a set of examples.
Article
Full-text available
Photocopy. Supplied by British Library. Thesis (Ph. D.)--King's College, Cambridge, 1989.
Article
Full-text available
In this paper we introduce a formal framework for the construction of normative multiagent systems, based on Searle's notion of the construction of social reality. Within the structure of normative multiagent systems we distinguish between regulative norms that describe obligations, prohibitions and permissions, and constitutive norms that regulate the creation of institutional facts as well as the modification of the normative system itself. Using the metaphor of normative systems as agents, we attribute mental attitudes to the normative system.
Article
Full-text available
In this paper we discuss different types of permissions and their roles in deontic logic. We study the distinction between weak and strong permissions in the context of input/output logic, combining the logic with constraints, priorities and hierarchies of normative authorities. In this setting we observe that the notion of prohibition immunity no longer applies, and we introduce a new notion of permission as exception and a new distinction between static and dynamic norms. We show that strong permissions can dynamically change a normative system by adding exceptions to obligations, provide an explicit representation of what is permitted to the subjects of the normative system and allow higher level authorities to limit the changes that lower level authorities can do to the normative system.
Article
In this work we investigate on the concept of “restraining bolt”, envisioned in Science Fiction. Specifically we introduce a novel problem in AI. We have two distinct sets of features extracted from the world, one by the agent and one by the authority imposing restraining specifications (the “restraining bolt”). The two sets are apparently unrelated since of interest to independent parties, however they both account for (aspects of) the same world. We consider the case in which the agent is a reinforcement learning agent on the first set of features, while the restraining bolt is specified logically using linear time logic on finite traces LTLf/LDLf over the second set of features. We show formally, and illustrate with examples, that, under general circumstances, the agent can learn while shaping its goals to suitably conform (as much as possible) to the restraining bolt specifications.
Chapter
We present Tempest, a synthesis tool to automatically create correct-by-construction reactive systems and shields from qualitative or quantitative specifications in probabilistic environments. A shield is a special type of reactive system used for run-time enforcement; i.e., a shield enforces a given qualitative or quantitative specification of a running system while interfering with its operation as little as possible. Shields that enforce a qualitative or quantitative specification are called safety-shields or optimal-shields, respectively. Safety-shields can be implemented as pre-shields or as post-shields, optimal-shields are implemented as post-shields. Pre-shields are placed before the system and restrict the choices of the system. Post-shields are implemented after the system and are able to overwrite the system’s output. Tempest is based on the probabilistic model checker Storm, adding model checking algorithms for stochastic games with safety and mean-payoff objectives. To the best of our knowledge, Tempest is the only synthesis tool able to solve 2-player games with mean-payoff objectives without restrictions on the state space. Furthermore, Tempest adds the functionality to synthesize safe and optimal strategies that implement reactive systems and shields.
Article
Autonomous cyber-physical agents play an increasingly large role in our lives. To ensure that they behave in ways aligned with the values of society, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. We detail a novel approach that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations and reinforcement learning to learn to maximize environmental rewards. A contextual bandit-based orchestrator then picks between the two policies: constraint-based and environment reward-based. The contextual bandit orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward-maximizing or constrained policy. In addition, the orchestrator is transparent on which policy is being employed at each time step. We test our algorithms using Pac-Man and show that the agent is able to learn to act optimally, act within the demonstrated constraints, and mix these two functions in complex ways.
Conference Paper
Autonomous cyber-physical agents play an increasingly large role in our lives. To ensure that they behave in ways aligned with the values of society, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. We detail a novel approach that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations and reinforcement learning to learn to maximize environmental rewards. A contextual bandit-based orchestrator then picks between the two policies: constraint-based and environment reward-based. The contextual bandit orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward-maximizing or constrained policy. In addition, the orchestrator is transparent on which policy is being employed at each time step. We test our algorithms using Pac-Man and show that the agent is able to learn to act optimally, act within the demonstrated constraints, and mix these two functions in complex ways.
Chapter
We discuss some essential issues for the formal representation of norms to implement normative reasoning, and we show how to capture those requirements in a computationally oriented formalism, Defeasible Deontic Logic, and we provide the description of this logic, and we illustrate its use to model and reasoning with norms with the help of legal examples.
Conference Paper
In the agents’ literature, norms have been studied from multiple perspectives, but while formalisations tend to be disconnected from possible implementations due to the lack of differentiation between abstract norm and norm instantiation, on the other hand implementations tend to be weak groundings of deontic logics, tightly coupled to one particular implementation domain. Furthermore, different formalisations are typically used for norm enforcement and norm reasoning. In this paper we report on our attempt to bridge this gap by reducing from deontic statements to structural operational semantics (for norm monitoring) and to planning control rules (for practical normative reasoning). We hint at the feasibility of the translation of these semantics to actual implementation languages (Clojure and Drools for norm monitoring and TLPlan for norm-aware planning). Finally we discuss the limitations of our approach and suggest some improvements and future lines of research.
Conference Paper
Limit-deterministic Büchi automata can replace deterministic Rabin automata in probabilistic model checking algorithms, and can be significantly smaller. We present a direct construction from an LTL formula φ\varphi to a limit-deterministic Büchi automaton. The automaton is the combination of a non-deterministic component, guessing the set of eventually true G{\mathbf {G}}-subformulas of φ\varphi , and a deterministic component verifying this guess and using this information to decide on acceptance. Contrary to the indirect approach of constructing a non-deterministic automaton for φ\varphi and then applying a semi-determinisation algorithm, our translation is compositional and has a clear logical structure. Moreover, due to its special structure, the resulting automaton can be used not only for qualitative, but also for quantitative verification of MDPs, using the same model checking algorithm as for deterministic automata. This allows one to reuse existing efficient implementations of this algorithm without any modification. Our construction yields much smaller automata for formulas with deep nesting of modal operators and performs at least as well as the existing approaches on general formulas.
Article
We consider a problem on the synthesis of reactive controllers that optimize some a priori unknown performance criterion while interacting with an uncontrolled environment such that the system satisfies a given temporal logic specification. We decouple the problem into two subproblems. First, we extract a (maximally) permissive strategy for the system, which encodes multiple (possibly all) ways in which the system can react to the adversarial environment and satisfy the specifications. Then, we quantify the a priori unknown performance criterion as a (still unknown) reward function and compute an optimal strategy for the system within the operating envelope allowed by the permissive strategy by using the so-called maximin-Q learning algorithm. We establish both correctness (with respect to the temporal logic specifications) and optimality (with respect to the a priori unknown performance criterion) of this two-step technique for a fragment of temporal logic specifications. For specifications beyond this fragment, correctness can still be preserved, but the learned strategy may be sub-optimal. We present an algorithm to the overall problem, and demonstrate its use and computational requirements on a set of robot motion planning examples.
Conference Paper
Runtime monitoring is one of the central tasks to provide operational decision support to running business processes, and check on-the-fly whether they comply with constraints and rules. We study runtime monitoring of properties expressed in LTL on finite traces (LTFf ) and its extension LDFf . LDFf is a powerful logic that captures all monadic second order logic on finite traces, which is obtained by combining regular expressions with LTFf , adopting the syntax of propositional dynamic logic (PDL). Interestingly, in spite of its greater expressivity, LDFf has exactly the same computational complexity of LTFf . We show that LDFf is able to capture, in the logic itself, not only the constraints to be monitored, but also the de-facto standard RV-LTL monitors. This makes it possible to declaratively capture monitoring metaconstraints, i.e., constraints about the evolution of other constraints, and check them by relying on usual logical services for temporal logics instead of ad-hoc algorithms. This, in turn, enables to flexibly monitor constraints depending on the monitoring state of other constraints, e.g., “compensation” constraints that are only checked when others are detected to be violated. In addition, we devise a direct translation of LDFf formulas into nondeterministic automata, avoiding to detour to Büchi automata or alternating automata, and we use it to implement a monitoring plug-in for the ProM suite.
Article
We propose to synthesize a control policy for a Markov decision process (MDP) such that the resulting traces of the MDP satisfy a linear temporal logic (LTL) property. We construct a product MDP that incorporates a deterministic Rabin automaton generated from the desired LTL property. The reward function of the product MDP is defined from the acceptance condition of the Rabin automaton. This construction allows us to apply techniques from learning theory to the problem of synthesis for LTL specifications even when the transition probabilities are not known a priori. We prove that our method is guaranteed to find a controller that satisfies the LTL property with probability one if such a policy exists, and we suggest empirically with a case study in traffic control that our method produces reasonable control strategies even when the LTL property cannot be satisfied with probability one.
Article
Part I. A Theory of Speech Acts: 1. Methods and scope 2. Expressions, meaning and speech acts 3. The structure of illocutionary acts 4. Reference as a speech act 5. Predication Part II. Some Applications of the Theory: 6. Three fallacies in contemporary philosophy 7. Problems of reference 8. Deriving 'ought' from 'is' Index.
Article
Temporal logic is one of the classic branches of modal logic. It is remarkably fruitful in the issues it has raised, the results it has given rise to, and as an applied tool. This chapter focuses on the key issues related to temporal logic and examines some topics in temporal logic that are considered both in computer science and in other fields. A basic round-up of the semantic options for handling time is describes and some logics (syntax and evaluation) that can be used are explained. The expressivity of classical and modal-style logics is compared. Kamp's famous 1968 expressive completeness theorem, the temporal reasoning, tableaux, resolution, filtration- and the finite model property, and other methods are discussed. Temporal logics come in many forms and that motivations from computing or linguistic applications and philosophical, theoretical or mathematical interests have driven temporal logic research in many disparate directions. The structures supporting varying granularity of focus and the options when propositions depend on several time points are considered.
Conference Paper
A unified approach to program verification is suggested, which applies to both sequential and parallel programs. The main proof method suggested is that of temporal reasoning in which the time dependence of events is the basic concept. Two formal systems are presented for providing a basis for temporal reasoning. One forms a formalization of the method of intermittent assertions, while the other is an adaptation of the tense logic system Kb, and is particularly suitable for reasoning about concurrent programs.
Safe Reinforcement Learning Using Probabilistic Shields
  • N Jansen
  • B Könighofer
  • S Junges
  • A Serban
  • R Bloem
Norm specification and verification in multi-agent systems
  • N Alechina
  • M Dastani
  • B Logan
Cautious reinforcement learning with logical constraints
  • M Hasanbeig
  • A Abate
  • D Kroening