Article

Bayesian Decomposition of Multi-Modal Dynamical Systems for Reinforcement Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper, we present a model-based reinforcement learning system where the transition model is treated in a Bayesian manner. The approach naturally lends itself to exploit expert knowledge by introducing priors to impose structure on the underlying learning task. The additional information introduced to the system means that we can learn from small amounts of data, recover an interpretable model and, importantly, provide predictions with an associated uncertainty. To show the benefits of the approach, we use a challenging data set where the dynamics of the underlying system exhibit both operational phase shifts and heteroscedastic noise. Comparing our model to NFQ and BNN+LV, we show how our approach yields human-interpretable insight about the underlying dynamics while also increasing data-efficiency.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Schaefer et al. (2007) was among the first model-based offline RL methods, however the considered datasets were still collected largely at random, i.e. with very good exploration. Other, more recent approaches may also address the offline setting, however their focus is often something else: (Hein et al., 2016(Hein et al., , 2018 focus on finding interpretable policies that are able to increase trust brought towards them by practitioners, while (Depeweg et al., 2016Kaiser et al., 2020) put their emphasis on modeling the complicated uncertainties in the transition dynamics of the environments. While theoretically being offline, these algorithms also assume randomly collected datasets. ...
Article
Full-text available
Offline reinforcement learning (RL) Algorithms are often designed with environments such as MuJoCo in mind, in which the planning horizon is extremely long and no noise exists. We compare model-free, model-based, as well as hybrid offline RL approaches on various industrial benchmark (IB) datasets to test the algorithms in settings closer to real world problems, including complex noise and partially observable states. We find that on the IB, hybrid approaches face severe difficulties and that simpler algorithms, such as rollout based algorithms or model-free algorithms with simpler regularizers perform best on the datasets.
... Early works in this field, such as FQI and NFQ (Ernst et al., 2005;Riedmiller, 2005), termed the problem "batch" rather than offline, and didn't explicitly address the additional challenge that the batch mode brought to the table. Many other batch RL algorithms have since been proposed (Depeweg et al., 2016;Hein et al., 2018;Kaiser et al., 2020), which despite being offline in the sense that they do not interact with the environment, do not regularize their policy accordingly and instead assume a random data collection that makes generalization rather easy. Among the first to explicitly address the limitations in the offline setting were SPIBB(-DQN) (Laroche et al., 2019) in the discrete and BCQ (Fujimoto et al., 2019) in the continuous actions case. ...
Preprint
Full-text available
Offline reinforcement learning algorithms still lack trust in practice due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their most important hyperparameter - the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby overcoming both of the above mentioned issues simultaneously. This allows users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one.
... Schaefer et al. (2007) was among the first model-based offline RL methods, however the considered datasets were still collected largely at random, i.e. with very good exploration. Other, more recent approaches may also address the offline setting, however their focus is often something else: (Hein et al., 2016(Hein et al., , 2018 focus on finding interpretable policies that are able to increase trust brought towards them by practitioners, while (Depeweg et al., 2016Kaiser et al., 2020) put their emphasis on modeling the complicated uncertainties in the transition dynamics of the environments. While theoretically being offline, these algorithms also assume randomly collected datasets. ...
Preprint
Full-text available
Offline reinforcement learning (RL) Algorithms are often designed with environments such as MuJoCo in mind, in which the planning horizon is extremely long and no noise exists. We compare model-free, model-based, as well as hybrid offline RL approaches on various industrial benchmark (IB) datasets to test the algorithms in settings closer to real world problems, including complex noise and partially observable states. We find that on the IB, hybrid approaches face severe difficulties and that simpler algorithms, such as rollout based algorithms or model-free algorithms with simpler regularizers perform best on the datasets.
... Policies trained on imperfect models can diverge by accumulating transition errors or extrapolate falsely and lead policies to favor visiting parts of the state-action space in which models are incorrect and overly optimistic. It has been shown however, that this issue can, even in the offline RL setting, usually be circumvented by penalizing model uncertainty in the policy training process, e.g., by employing Bayesian models (Depeweg et al. 2016Kaiser et al. 2020), or by ensembling, as in MOPO and MOReL (Yu et al. 2020;Kidambi et al. 2020). MOPO facilitates two head architecture models that predict mean and variance of successor states and subtracts the maximum variance across a model ensemble from the predicted mean reward. ...
Article
Full-text available
State-of-the-art reinforcement learning algorithms mostly rely on being allowed to directly interact with their environment to collect millions of observations. This makes it hard to transfer their success to industrial control problems, where simulations are often very costly or do not exist, and exploring in the real environment can potentially lead to catastrophic events. Recently developed, model-free, offline RL algorithms, can learn from a single dataset (containing limited exploration) by mitigating extrapolation error in value functions. However, the robustness of the training process is still comparatively low, a problem known from methods using value functions. To improve robustness and stability of the learning process, we use dynamics models to assess policy performance instead of value functions, resulting in MOOSE (MOdel-based Offline policy Search with Ensembles), an algorithm which ensures low model bias by keeping the policy within the support of the data. We compare MOOSE with state-of-the-art model-free, offline RL algorithms BRAC, BEAR and BCQ on the Industrial Benchmark and MuJoCo continuous control tasks in terms of robust performance, and find that MOOSE outperforms its model-free counterparts in almost all considered cases, often even by far.
Article
Full-text available
Gaussian process classification is a popular method with a number of appealing properties. We show how to scale the model within a variational inducing point framework, outperforming the state of the art on benchmark datasets. Importantly, the variational formulation can be exploited to allow classification in problems with millions of data points, as we demonstrate in experiments.
Conference Paper
Full-text available
In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and high-dimensional control tasks.
Conference Paper
Full-text available
In a typical reinforcement learning (RL) setting details of the environment are not given explicitly but have to be estimated from obser- vations. Most RL approaches only optimize the expected value. However, if the number of observations is limited considering expected values only can lead to false conclusions. Instead, it is crucial to also account for the estimator's uncertainties. In this paper, we present a method to incorpo- rate those uncertainties and propagate them to the conclusions. By being only approximate, the method is computationally feasible. Furthermore, we describe a Bayesian approach to design the estimators. Our exper- iments show that the method considerably increases the robustness of the derived policies compared to the standard approach.
Article
Batch reinforcement learning is a subfield of dynamic programming-based reinforcement learning. Originally defined as the task of learning the best possible policy from a fixed set of a priori-known transition samples, the (batch) algorithms developed in this field can be easily adapted to the classical online case, where the agent interacts with the environment while learning. Due to the efficient use of col-lected data and the stability of the learning process, this research area has attracted a lot of attention recently. In this chapter, we introduce the basic principles and the theory behind batch reinforcement learning, describe the most important algorithms, exemplarily discuss ongoing research within this field, and briefly survey real-world applications of batch reinforcement learning.
Conference Paper
This paper introduces NFQ, an algorithm for efficient and effective training of a Q-value function represented by a multi-layer perceptron. Based on the principle of storing and reusing transition experiences, a model-free, neural network based Reinforcement Learning algorithm is proposed. The method is evaluated on three benchmark problems. It is shown empirically, that reasonably few interactions with the plant are needed to generate control policies of high quality.
Article
Motion correspondence is a fundamental problem in computer vision and many other disciplines. This article describes statistical data association techniques originally developed in the context of target tracking and surveillance and now beginning to be used in dynamic motion analysis by the computer vision community. The Mahalanobis distance measure is first introduced before discussing the limitations of nearest neighbor algorithms. Then, the track-splitting, joint likelihood, multiple hypothesis algorithms are described, each method solving an increasing- ly more complicated optimization. Real-time constraints may prohibit the application of these optimal methods. The suboptimal joint probabilistic data association algorithm is therefore described. The advantages, limitations, and relationships between the approaches are discussed.
Article
In this work we introduce a mixture of GPs to address the data association problem, i.e. to label a group of observations according to the sources that generated them. Unlike several previously proposed GP mixtures, the novel mixture has the distinct characteristic of using no gating function to determine the association of samples and mixture components. Instead, all the GPs in the mixture are global and samples are clustered following "trajectories" across input space. We use a non-standard variational Bayesian algorithm to efficiently recover sample labels and learn the hyperparameters. We show how multi-object tracking problems can be disambiguated and also explore the characteristics of the model in traditional regression settings.
The wet game of chicken
  • V Tresp
V. Tresp, The wet game of chicken, Siemens AG, CT IC 4, Technical Report (1994).
  • S Depeweg
  • J M Hernández-Lobato
  • F Doshi-Velez
  • S Udluft
S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, S. Udluft, Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks, arXiv:1605.07127 [cs, stat] (2016). arXiv:1605.07127.
Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning
  • S Depeweg
  • J.-M Hernandez-Lobato
  • F Doshi-Velez
  • S Udluft
S. Depeweg, J.-M. Hernandez-Lobato, F. Doshi-Velez, S. Udluft, Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning, in: International Conference on Machine Learning, 2018, pp. 1192-1201.
  • E Bodin
  • N D F Campbell
  • C H Ek
E. Bodin, N. D. F. Campbell, C. H. Ek, Latent Gaussian Process Regression, arXiv:1707.05534 [cs, stat] (2017). arXiv:1707.05534.
  • M Kaiser
  • C Otte
  • T Runkler
  • C H Ek
M. Kaiser, C. Otte, T. Runkler, C. H. Ek, Data Association with Gaussian Processes, arXiv:1810.07158 [cs, stat] (2018). arXiv:1810.07158.
  • A Damianou
  • N Lawrence
A. Damianou, N. Lawrence, Deep Gaussian Processes, in: Artificial Intelligence and Statistics, 2013, pp. 207-215.
  • C J Maddison
  • A Mnih
  • Y W Teh
C. J. Maddison, A. Mnih, Y. W. Teh, The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables, arXiv:1611.00712 [cs, stat] (2016). arXiv:1611.00712.
Expectation Propagation for approximate Bayesian inference
  • T P Minka
T. P. Minka, Expectation Propagation for approximate Bayesian inference, in: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., 2001, pp. 362-369. arXiv:1301.2294.