Policy Gradient Semi-Markov Decision Process
DOI: 10.1109/ICTAI.2008.51 Conference: 20th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2008), November 3-5, 2008, Dayton, Ohio, USA, Volume 2
This paper proposes a simulation-based algorithm for optimizing the average reward in a parameterized continuous-time, finite-state semi-Markov decision process (SMDP). Our contributions are twofold: First, we compute the approximate gradient of the average reward with respect to the parameters in SMDP controlled by parameterized stochastic policies. Then stochastic gradient ascent method is used to adjust the parameters in order to optimize the average reward. Second, we present a simulation-based algorithm to estimate the approximate average gradient of the average reward (GSMDP), using only single sample path of the underlying Markov chain. We prove the almost sure convergence of this estimate to the true gradient of the average reward when the number of iterations goes to infinity.
- [Show abstract] [Hide abstract]
ABSTRACT: In  and  we have presented a simulation-based algorithm for optimizing the average reward in a parameterized continuous-time, finite-state semi-Markov Decision Process (SMDP). We approximated the gradient of the average reward. Then, a simulation-based algorithm was proposed to estimate the approximate gradient of the average reward (called GSMDP), using only a single sample path of the underlying Markov chain. GSMDP was proved to converge with probability 1. In this paper, we give bounds on the approximation and estimation errors for GSMDP algorithm. The approximation error of that approximation is the size of the difference between the true gradient and the approximate gradient. The estimation error, the size of the difference between the output of the algorithm and its asymptotic output, arises because the algorithm sees only a finite data sequence.
- [Show abstract] [Hide abstract]
ABSTRACT: In this paper, we propose to use hierarchical action decomposition to make Bayesian model-based reinforcement learning more efficient and feasible for larger problems. We formulate Bayesian hierarchical reinforcement learning as a partially observable semi-Markov decision process (POSMDP). The main POSMDP task is partitioned into a hierarchy of POSMDP subtasks. Each subtask might consist of only primitive actions or hierarchically call other subtasks’ policies, since the policies of lower-level subtasks are considered as macro actions in higher-level subtasks. A solution for this hierarchical action decomposition is to solve lower-level subtasks first, then higher-level ones. Because each formulated POSMDP has a continuous state space, we sample from a prior belief to build an approximate model for them, then solve by using a recently introduced Monte Carlo Value Iteration with Macro-Actions solver. We name this method Monte Carlo Bayesian Hierarchical Reinforcement Learning. Simulation results show that our algorithm exploiting the action hierarchy performs significantly better than that of flat Bayesian reinforcement learning in terms of both reward, and especially solving time, in at least one order of magnitude.
Article: Bayes-adaptive hierarchical MDPs[Show abstract] [Hide abstract]
ABSTRACT: Reinforcement learning (RL) is an area of machine learning that is concerned with how an agent learns to make decisions sequentially in order to optimize a particular performance measure. For achieving such a goal, the agent has to choose either 1) exploiting previously known knowledge that might end up at local optimality or 2) exploring to gather new knowledge that expects to improve the current performance. Among other RL algorithms, Bayesian model-based RL (BRL) is well-known to be able to trade-off between exploitation and exploration optimally via belief planning, i.e. partially observable Markov decision process (POMDP). However, solving that POMDP often suffers from curse of dimensionality and curse of history. In this paper, we make two major contributions which are: 1) an integration framework of temporal abstraction into BRL that eventually results in a hierarchical POMDP formulation, which can be solved online using a hierarchical sample-based planning solver; 2) a subgoal discovery method for hierarchical BRL that automatically discovers useful macro actions to accelerate learning. In the experiment section, we demonstrate that the proposed approach can scale up to much larger problems. On the other hand, the agent is able to discover useful subgoals for speeding up Bayesian reinforcement learning.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.