Policy Gradient Semi-markov Decision Process.
ABSTRACT This paper proposes a simulation-based algorithm for optimizing the average reward in a parameterized continuous-time, finite-state semi-Markov decision process (SMDP). Our contributions are twofold: First, we compute the approximate gradient of the average reward with respect to the parameters in SMDP controlled by parameterized stochastic policies. Then stochastic gradient ascent method is used to adjust the parameters in order to optimize the average reward. Second, we present a simulation-based algorithm to estimate the approximate average gradient of the average reward (GSMDP), using only single sample path of the underlying Markov chain. We prove the almost sure convergence of this estimate to the true gradient of the average reward when the number of iterations goes to infinity.
- SourceAvailable from: ArXiv[Show abstract] [Hide abstract]
ABSTRACT: We present algorithms that perform gradient ascent of the average reward in a Partially Observable Markov Decision Process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter&Bartlett, this volume), which computes biased estimates of the performance gradient in POMDPs. The algorithm’s chief advantages are that it uses only one free parameter beta, which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of the first two author [ibid. 15, 319-350 (2001; Zbl 0994.68121)] on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.Journal of Artificial Intelligence Research 12/2001; · 1.06 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Markov decision processes have been a popular paradigm for sequential decision making under uncertainty. Dynamic programming provides a framework for studying such problems, as well as for devising algorithms to compute an optimal control policy. Dynamic programming methods rely on a suitably defined value function that has to be computed for every state in the state space. However, many interesting problems involve very large state spaces ("curse of dimensionality"), which prohibits the application of dynamic programming. In addition, dynamic programming assumes the availability of an exact model, in the form of transition probabilities ("curse of modeling"). In many practical situations, such a model is not available and one must resort to simulation or experimentation with an actual system. For all of these reasons, dynamic programming in its pure form may be inapplicable. In this thesis we study an approach for overcoming these difficulties where we use (a) compact (parametric) re...03/2000;
- [Show abstract] [Hide abstract]
ABSTRACT: Topic of this book is what has been called formerly reinforcement learning, but what is now being addressed as neuro-dynamic programming. Three areas may be distinguished: Presentation of the background in dynamic programming, neural net architecture and stochastic approximation, secondly, the neuro-dynamic methodology proper newly developed by the authors, and finally the special case studies. One goal of the book is to proceed to rigorous proofs in an area where verbous arguments are omnipresent. This combination makes the book interesting: A mathematicians view on topics usually covered by imprecise qualitative reasoning. Some further catch words: finite/infinite horizon problems, stochastic shortest path problems, gradient methods in training, simulation of lookup table representation.01/1996; Athena Scientific.