Conference Proceeding

# Integrating Sample-Based Planning and Model-Based Reinforcement Learning.

01/2010; In proceeding of: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010
Source: DBLP

ABSTRACT Recent advancements in model-based reinforcement learn- ing have shown that the dynamics of many structured do- mains (e.g. DBNs) can be learned with tractable sample com- plexity, despite their exponentially large state spaces. U n- fortunately, these algorithms all require access to a plann er that computes a near optimal policy, and while many tra- ditional MDP algorithms make this guarantee, their com- putation time grows with the number of states. We show how to replace these over-matched planners with a class of sample-based planners—whose computation time is indepen- dent of the number of states—without sacrificing the sample- efficiency guarantees of the overall learning algorithms. T o do so, we define sufficient criteria for a sample-based planne r to be used in such a learning system and analyze two popu- lar sample-based approaches from the literature. We also in- troduce our own sample-based planner, which combines the strategies from these algorithms and still meets the criter ia for integration into our learning system. In doing so, we define the first complete RL solution for compactly represented (ex - ponentially sized) state spaces with efficiently learnable dy- namics that is both sample efficient and whose computation time does not grow rapidly with the number of states.

0 0
·
0 Bookmarks
·
32 Views
• Source
##### Article: Bandit Algorithms for Tree Search
[hide abstract]
ABSTRACT: Bandit based methods for tree search have recently gained popularity when applied to huge trees, e.g. in the game of go (Gelly et al., 2006). The UCT algorithm (Kocsis and Szepesvari, 2006), a tree search method based on Upper Confidence Bounds (UCB) (Auer et al., 2002), is believed to adapt locally to the effective smoothness of the tree. However, we show that UCT is too optimistic'' in some cases, leading to a regret O(exp(exp(D))) where D is the depth of the tree. We propose alternative bandit algorithms for tree search. First, a modification of UCT using a confidence sequence that scales exponentially with the horizon depth is proven to have a regret O(2^D \sqrt{n}), but does not adapt to possible smoothness in the tree. We then analyze Flat-UCB performed on the leaves and provide a finite regret bound with high probability. Then, we introduce a UCB-based Bandit Algorithm for Smooth Trees which takes into account actual smoothness of the rewards for performing efficient cuts'' of sub-optimal branches with high confidence. Finally, we present an incremental tree search version which applies when the full tree is too big (possibly infinite) to be entirely represented and show that with high probability, essentially only the optimal branches is indefinitely developed. We illustrate these methods on a global optimization problem of a Lipschitz function, given noisy data.
04/2007;
• Source
##### Article: Reinforcement Learning in Finite MDPs: PAC Analysis.
[hide abstract]
ABSTRACT: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These "PAC-MDP" algorithms include the well-known E3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. We also present a more refined analysis that yields insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX. Finally, we conclude with open problems.
Journal of Machine Learning Research. 01/2009; 10:2413-2444.
• Source
##### Article: Combining Online and Offline Knowledge in UCT
[hide abstract]
ABSTRACT: The UCT algorithm learns a value function online using sample-based search. The T D(lambda) algorithm can learn a value function offline for the on-policy distribution. We consider three approaches for combining offline and online value functions in the UCT algorithm. First, the offline value function is used as a default policy during Monte-Carlo simulation. Second, the UCT value function is combined with a rapid online estimate of action values. Third, the offline value function is used as prior knowledge in the UCT search tree. We evaluate these algorithms in 9 × 9 Go against GnuGo 3.7.10. The ﬁrst algorithm performs better than UCT with a random simulation policy, but surprisingly, worse than UCT with a weaker, handcrafted simulation policy. The second algorithm outperforms UCT altogether. The third algorithm outperforms UCT with handcrafted prior knowledge. We combine these algorithms in MoGo, the world’s strongest 9 × 9 Go program. Each technique signiﬁcantly improves MoGo’s playing strength.
01/2007;