Preprint

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named UCRL-VTR+\text{UCRL-VTR}^{+} for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that UCRL-VTR+\text{UCRL-VTR}^{+} attains an O~(dHT)\tilde O(dH\sqrt{T}) regret where d is the dimension of feature mapping, H is the length of the episode and T is the number of interactions with the MDP. We also prove a matching lower bound Ω(dHT)\Omega(dH\sqrt{T}) for this setting, which shows that UCRL-VTR+\text{UCRL-VTR}^{+} is minimax optimal up to logarithmic factors. In addition, we propose the UCLK+\text{UCLK}^{+} algorithm for the same family of MDPs under discounting and show that it attains an O~(dT/(1γ)1.5)\tilde O(d\sqrt{T}/(1-\gamma)^{1.5}) regret, where γ[0,1)\gamma\in [0,1) is the discount factor. Our upper bound matches the lower bound Ω(dT/(1γ)1.5)\Omega(d\sqrt{T}/(1-\gamma)^{1.5}) proved in Zhou et al. (2020) up to logarithmic factors, suggesting that UCLK+\text{UCLK}^{+} is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The generalized linear bandit framework has attracted a lot of attention in recent years by extending the well-understood linear setting and allowing to model richer reward structures. It notably covers the logistic model, widely used when rewards are binary. For logistic bandits, the frequentist regret guarantees of existing algorithms areÕ(κ √ T), where κ is a problem-dependent constant. Unfortunately, κ can be arbitrarily large as it scales exponentially with the size of the decision set. This may lead to significantly loose regret bounds and poor empirical performance. In this work, we study the logistic bandit with a focus on the prohibitive dependencies introduced by κ. We propose a new optimistic algorithm based on a finer examination of the non-linearities of the reward function. We show that it enjoys aÕ(√ T) regret with no dependency in κ, but for a second order term. Our analysis is based on a new tail-inequality for self-normalized martingales, of independent interest.
Article
Full-text available
In this paper we study a model-based approach to calculating approximately optimal policies in Markovian Decision Processes. In particular, we derive novel bounds on the loss of using a policy derived from a factored linear model, a class of models which generalize virtually all previous models that come with strong computational guarantees. For the first time in the literature, we derive performance bounds for model-based techniques where the model inaccuracy is measured in weighted norms. Moreover, our bounds show a decreased sensitivity to the discount factor and, unlike similar bounds derived for other approaches, they are insensitive to measure mismatch. Similarly to previous works, our proofs are also based on contraction arguments, but with the main differences that we use carefully constructed norms building on Banach lattices, and the contraction property is only assumed for operators acting on "compressed" spaces, thus weakening previous assumptions, while strengthening previous results.
Article
Full-text available
We consider a sequential learning problem with Gaussian payoffs and side information: after selecting an action i, the learner receives information about the payoff of every action j in the form of Gaussian observations whose mean is the same as the mean payoff, but the variance depends on the pair (i,j) (and may be infinite). The setup allows a more refined information transfer from one action to another than previous partial monitoring setups, including the recently introduced graph-structured feedback case. For the first time in the literature, we provide non-asymptotic problem-dependent lower bounds on the regret of any algorithm, which recover existing asymptotic problem-dependent lower bounds and finite-time minimax lower bounds available in the literature. We also provide algorithms that achieve the problem-dependent lower bound (up to some universal constant factor) or the minimax lower bounds (up to logarithmic factors).
Conference Paper
Full-text available
We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer's UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales.
Conference Paper
Full-text available
We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted Markov Decision Processes (MDPs). For the upper bound we make the assumption that each action leads to at most two possible next-states and prove a new bound for a UCRL-style algorithm on the number of time-steps when it is not Probably Approximately Correct (PAC). The new lower bound strengthens previous work by being both more general (it applies to all policies) and tighter. The upper and lower bounds match up to logarithmic factors.
Article
Full-text available
Algorithms based on upper confidence bounds for balancing exploration and exploitation are gaining popularity since they are easy to implement, efficient and effective. This paper considers a variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms. In earlier experimental works, such algorithms were found to outperform the competing algorithms. We provide the first analysis of the expected regret for such algorithms. As expected, our results show that the algorithm that uses the variance estimates has a major advantage over its alternatives that do not use such estimates provided that the variances of the payoffs of the suboptimal arms are low. We also prove that the regret concentrates only at a polynomial rate. This holds for all the upper confidence bound based algorithms and for all bandit problems except those special ones where with probability one the payoff obtained by pulling the optimal arm is larger than the expected payoff for the second best arm. Hence, although upper confidence bound bandit algorithms achieve logarithmic expected regret rates, they might not be suitable for a risk-averse decision maker. We illustrate some of the results by computer simulations.
Article
Motivated by the wide adoption of reinforcement learning (RL) in real-world personalized services, where users' sensitive and private information needs to be protected, we study regret minimization in finite-horizon Markov decision processes (MDPs) under the constraints of differential privacy (DP). Compared to existing private RL algorithms that work only on tabular finite-state, finite-actions MDPs, we take the first step towards privacy-preserving learning in MDPs with large state and action spaces. Specifically, we consider MDPs with linear function approximation (in particular linear mixture MDPs) under the notion of joint differential privacy (JDP), where the RL agent is responsible for protecting users' sensitive data. We design two private RL algorithms that are based on value iteration and policy optimization, respectively, and show that they enjoy sub-linear regret performance while guaranteeing privacy protection. Moreover, the regret bounds are independent of the number of states, and scale at most logarithmically with the number of actions, making the algorithms suitable for privacy protection in nowadays large-scale personalized services. Our results are achieved via a general procedure for learning in linear mixture MDPs under changing regularizers, which not only generalizes previous results for non-private learning, but also serves as a building block for general private reinforcement learning.
Conference Paper
We study an idealised sequential resource allocation problem. In each time step the learner chooses an allocation of several resource types between a number of tasks. Assigning more resources to a task increases the probability that it is completed. The problem is challenging because the alignment of the tasks to the resource types is unknown and the feedback is noisy. Our main contribution is the new setting and an algorithm with nearly-optimal regret analysis. Along the way we draw connections to the problem of minimising regret for stochastic linear bandits with heteroscedastic noise. We also present some new results for stochastic linear bandits on the hypercube that significantly improve on existing work, especially in the sparse case.
Article
Recently, there has been significant progress in understanding reinforcement learning in discounted infinite-horizon Markov decision processes (MDPs) by deriving tight sample complexity bounds. However, in many real-world applications, an interactive learning agent operates for a fixed or bounded period of time, for example tutoring students for exams or handling customer service requests. Such scenarios can often be better treated as episodic fixed-horizon MDPs, for which only looser bounds on the sample complexity exist. A natural notion of sample complexity in this setting is the number of episodes required to guarantee a certain performance with high probability (PAC guarantee). In this paper, we derive an upper PAC bound O~(S2AH2ϵ2ln1δ)\tilde O(\frac{|\mathcal S|^2 |\mathcal A| H^2}{\epsilon^2} \ln\frac 1 \delta) and a lower PAC bound Ω~(SAH2ϵ2ln1δ+c)\tilde \Omega(\frac{|\mathcal S| |\mathcal A| H^2}{\epsilon^2} \ln \frac 1 {\delta + c}) that match up to log-terms and an additional linear dependency on the number of states S|\mathcal S|. The lower bound is the first of its kind for this setting. Our upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-horizon dependency of at least H3H^3.
Article
We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with N state-action pairs and the discount factor γ∈[0,1) only O(Nlog(N/δ)/((1−γ)3ε 2)) state-transition samples are required to find an ε-optimal estimation of the action-value function with the probability (w.p.) 1−δ. Further, we prove that, for small values of ε, an order of O(Nlog(N/δ)/((1−γ)3ε 2)) samples is required to find an ε-optimal policy w.p. 1−δ. We also prove a matching lower bound of Θ(Nlog(N/δ)/((1−γ)3ε 2)) on the sample complexity of estimating the optimal action-value function with ε accuracy. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: the upper bounds match the lower bound in terms of N, ε, δ and 1/(1−γ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1−γ).
Article
Fitting the value function in a Markovian decision process by a linear superposition of M basis functions reduces the problem dimensionality from the number of states down to M, with good accuracy retained if the value function is a smooth function of its argument, the state vector. This paper provides, for both the discounted and undiscounted cases, three algorithms for computing the coefficients in the linear superposition: linear programming, policy iteration, and least squares.
Conference Paper
In the classical stochastic k-armed bandit problem, ineachofasequenceofT rounds, adecisionmaker chooses one of k arms and incurs a cost chosen from an unknown distribution associated with that arm. The goal is to minimize regret, defined as the difference between the cost incurred by the algo- rithm and the optimal cost. In the linear optimization version of this problem (firstconsideredbyAuer(2002)), weviewthearms as vectors in Rn, and require that the costs be lin- ear functions of the chosen vector. As before, it is assumed that the cost functions are sampled in- dependently from an unknown distribution. In this setting, the goal is to find algorithms whose run- ning time and regret behave well as functions of the number of rounds T and the dimensionality n (rather than the number of arms, k, which may be exponential in n or even infinite). We give a nearly complete characterization of this problem in terms of both upper and lower bounds for the regret. In certain special cases (such as when the decision region is a polytope), the regret is polylog(T). In general though, the optimal re- gret is ( p T) — our lower bounds rule out the possibility of obtaining polylog(T) rates in gen- eral. We present two variants of an algorithm based on the idea of "upper confidence bounds." The first, due to Auer (2002), but not fully analyzed, obtains regret whose dependence on n and T are both es- sentially optimal, but which may be computation- ally intractable when the decision set is a polytope. The second version can be efficiently implemented when the decision set is a polytope (given as an in- tersection of half-spaces), but gives up a factor of p n in the regret bound. Our results also extend to the setting where the set of allowed decisions may change over time.
Article
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s ' there is a policy which moves from s to s ' in at most D steps (on average). We present a reinforcement learning algorithm with total regret O ˜(DSAT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of Ω(DSAT) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T. Finally, we also consider a setting where the MDP is allowed to change a fixed number of l times. We present a modification of our algorithm that is able to deal with this setting and show a regret bound of O ˜(l 1/3 T 2/3 DSA).
Article
We show how a standard tool from statistics | namely condence bounds | can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-o. Our technique for designing and analyzing algorithms for such situations is general and can be applied when an algorithm has to make exploitation-versus-exploration decisions based on uncertain information provided by a random process. We apply our technique to two models with such an exploitation-exploration trade-o. For the adversarial bandit problem with shifting our new algorithm suers only ~ O (ST)1=2 regret with high probability over T trials with S shifts. Such a regret bound was previously known only in expectation. The second model we consider is associative reinforcement learning with linear value functions. For this model our technique improves the regret from ~ O T 3=4 to ~ O T 1=2 .
Article
We give improved constants for data dependent and variance sensitive confidence bounds, called empirical Bernstein bounds, and extend these inequalities to hold uniformly over classes of functionswhose growth function is polynomial in the sample size n. The bounds lead us to consider sample variance penalization, a novel learning method which takes into account the empirical variance of the loss function. We give conditions under which sample variance penalization is effective. In particular, we present a bound on the excess risk incurred by the method. Using this, we argue that there are situations in which the excess risk of our method is of order 1/n, while the excess risk of empirical risk minimization is of order 1/sqrt/{n}. We show some experimental results, which confirm the theory. Finally, we discuss the potential application of our results to sample compression schemes. Comment: 10 pages, 1 figure, Proc. Computational Learning Theory Conference (COLT 2009)
Article
Watch a martingale with uniformly bounded increments until it first crosses the horizontal line of height a. The sum of the conditional variances of the increments given the past, up to the crossing, is an intrinsic measure of the crossing time. Simple and fairly sharp upper and lower bounds are given for the Laplace transform of this crossing time, which show that the distribution is virtually the same as that for the crossing time of Brownian motion, even in the tail. The argument can be adapted to extend inequalities of Bernstein and Kolmogorov to the dependent case, proving the law of the iterated logarithm for martingales. The argument can also be adapted to prove Levy's central limit theorem for martingales. The results can be extended to martingales whose increments satisfy a growth condition.
Article
We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an r-dimensional random vector ZRr\mathbf{Z} \in \mathbb{R}^r, where r2r \geq 2. The objective is to minimize the cumulative regret and Bayes risk. When the set of arms corresponds to the unit sphere, we prove that the regret and Bayes risk is of order Θ(rT)\Theta(r \sqrt{T}), by establishing a lower bound for an arbitrary policy, and showing that a matching upper bound is obtained through a policy that alternates between exploration and exploitation phases. The phase-based policy is also shown to be effective if the set of arms satisfies a strong convexity condition. For the case of a general set of arms, we describe a near-optimal policy whose regret and Bayes risk admit upper bounds of the form O(rTlog3/2T)O(r \sqrt{T} \log^{3/2} T). Comment: 40 pages; updated results and references
Article
Mixed linear models are assumed in most animal breeding applications. Convenient methods for computing BLUE of the estimable linear functions of the fixed elements of the model and for computing best linear unbiased predictions of the random elements of the model have been available. Most data available to animal breeders, however, do not meet the usual requirements of random sampling, the problem being that the data arise either from selection experiments or from breeders' herds which are undergoing selection. Consequently, the usual methods are likely to yield biased estimates and predictions. Methods for dealing with such data are presented in this paper.
Model-based reinforcement learning with a generative model is minimax optimal
  • A Agarwal
  • S Kakade
  • L F Yang
Agarwal, A., Kakade, S. and Yang, L. F. (2020). Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory.
Model-based reinforcement learning with value-targeted regression
  • Z Jia
  • L Yang
  • C Szepesvari
  • M Wang
Jia, Z., Yang, L., Szepesvari, C. and Wang, M. (2020). Model-based reinforcement learning with value-targeted regression. In L4DC.
Minimax regret bounds for reinforcement learning
  • M G Azar
  • I Osband
  • R Munos
Azar, M. G., Osband, I. and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org.
Provably efficient exploration in policy optimization
  • Q Cai
  • Z Yang
  • C Jin
  • Z Wang
Cai, Q., Yang, Z., Jin, C. and Wang, Z. (2019). Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830.
Contextual bandits with linear payoff functions
  • W Chu
  • L Li
  • L Reyzin
  • R Schapire
Chu, W., Li, L., Reyzin, L. and Schapire, R. (2011). Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.
On oracle-efficient pac rl with rich observations
  • C Dann
  • N Jiang
  • A Krishnamurthy
  • A Agarwal
  • J Langford
  • R E Schapire
Dann, C., Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J. and Schapire, R. E. (2018). On oracle-efficient pac rl with rich observations. In Advances in neural information processing systems.
Is a good representation sufficient for sample efficient reinforcement learning
  • S S Du
  • S M Kakade
  • R Wang
  • L F Yang
Du, S. S., Kakade, S. M., Wang, R. and Yang, L. F. (2019). Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations.
Minimax optimal reinforcement learning for discounted MDPs
  • J He
  • D Zhou
  • Q Gu
He, J., Zhou, D. and Gu, Q. (2020b). Minimax optimal reinforcement learning for discounted MDPs. arXiv preprint arXiv:2010.00587.
Open problem: The dependence of sample complexity lower bounds on planning horizon
  • N Jiang
  • A Agarwal
Jiang, N. and Agarwal, A. (2018). Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory.
Is Q-learning provably efficient?
  • C Jin
  • Z Allen-Zhu
  • S Bubeck
  • M I Jordan
Jin, C., Allen-Zhu, Z., Bubeck, S. and Jordan, M. I. (2018). Is Q-learning provably efficient? In Advances in Neural Information Processing Systems.
Provably efficient reinforcement learning with linear function approximation
  • C Jin
  • Z Yang
  • Z Wang
  • M I Jordan
Jin, C., Yang, Z., Wang, Z. and Jordan, M. I. (2020). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory.
Information directed sampling and bandits with heteroscedastic noise
  • J Kirschner
  • A Krause
Kirschner, J. and Krause, A. (2018). Information directed sampling and bandits with heteroscedastic noise. In Conference On Learning Theory.
Learning with good feature representations in bandits and in rl with a generative model
  • T Lattimore
  • C Szepesvari
  • G Weisz
Lattimore, T., Szepesvari, C. and Weisz, G. (2020). Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning. PMLR.
A contextual-bandit approach to personalized news article recommendation
  • L Li
  • W Chu
  • J Langford
  • R E Schapire
Li, L., Chu, W., Langford, J. and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web.
Nearly minimax-optimal regret for linearly parameterized bandits
  • Y Li
  • Y Wang
  • Y Zhou
Li, Y., Wang, Y. and Zhou, Y. (2019a). Nearly minimax-optimal regret for linearly parameterized bandits. In Conference on Learning Theory.
  • Y Li
  • Y Wang
  • Y Zhou
Li, Y., Wang, Y. and Zhou, Y. (2019b). Tight regret bounds for infinite-armed linear contextual bandits. arXiv preprint arXiv:1905.01435.
  • S Liu
  • H Su
Liu, S. and Su, H. (2020). Regret bounds for discounted mdps. arXiv preprint arXiv:2002.05138.
Sample complexity of reinforcement learning using linearly combined model ensembles
  • A Modi
  • N Jiang
  • A Tewari
  • S Singh
Modi, A., Jiang, N., Tewari, A. and Singh, S. (2020). Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics. PMLR.
A unifying view of optimism in episodic reinforcement learning
  • G Neu
  • C Pike-Burke
Neu, G. and Pike-Burke, C. (2020). A unifying view of optimism in episodic reinforcement learning. Advances Neural Information Processing Systems.
Near-optimal time and sample complexities for for solving discounted Markov decision process with a generative model
  • A Sidford
  • M Wang
  • X Wu
  • L F Yang
  • Y Ye
Sidford, A., Wang, M., Wu, X., Yang, L. F. and Ye, Y. (2018). Near-optimal time and sample complexities for for solving discounted Markov decision process with a generative model. arXiv preprint arXiv:1806.01492.
Non-asymptotic gap-dependent regret bounds for tabular MDPs
  • M Simchowitz
  • K G Jamieson
Simchowitz, M. and Jamieson, K. G. (2019). Non-asymptotic gap-dependent regret bounds for tabular MDPs. In Advances in Neural Information Processing Systems.
Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches
  • W Sun
  • N Jiang
  • A Krishnamurthy
  • A Agarwal
  • J Langford
Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A. and Langford, J. (2019). Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on Learning Theory. PMLR.
Near-optimal optimistic reinforcement learning using empirical Bernstein inequalities
  • A Tossou
  • D Basu
  • C Dimitrakakis
Tossou, A., Basu, D. and Dimitrakakis, C. (2019). Near-optimal optimistic reinforcement learning using empirical Bernstein inequalities. arXiv preprint arXiv:1905.12425.
Is long horizon reinforcement learning more difficult than short horizon reinforcement learning?
  • R Wang
  • S S Du
  • L F Yang
  • S M Kakade
Wang, R., Du, S. S., Yang, L. F. and Kakade, S. M. (2020a). Is long horizon reinforcement learning more difficult than short horizon reinforcement learning? arXiv preprint arXiv:2005.00527 .
Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension
  • R Wang
  • R R Salakhutdinov
  • L Yang
Wang, R., Salakhutdinov, R. R. and Yang, L. (2020b). Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems 33.
Optimism in reinforcement learning with generalized linear function approximation
  • Y Wang
  • R Wang
  • S S Du
  • A Krishnamurthy
Wang, Y., Wang, R., Du, S. S. and Krishnamurthy, A. (2019). Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136.
Exponential lower bounds for planning in MDPs with linearly-realizable optimal action-value functions
  • G Weisz
  • P Amortila
  • C Szepesvári
Weisz, G., Amortila, P. and Szepesvári, C. (2020). Exponential lower bounds for planning in MDPs with linearly-realizable optimal action-value functions. arXiv preprint arXiv:2010.01374.
  • K Yang
  • L F Yang
  • S S Du
Yang, K., Yang, L. F. and Du, S. S. (2020). Q-learning with logarithmic regret. arXiv preprint arXiv:2006.09118.