Daniel Russo’s research while affiliated with Columbia University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (11)


Satisficing in Time-Sensitive Bandit Learning
  • Preprint

March 2018

·

1 Citation

Daniel Russo

·

Benjamin Van Roy

Much of the recent literature on bandit learning focuses on algorithms that aim to converge on an optimal action. One shortcoming is that this orientation does not account for time sensitivity, which can play a crucial role when learning an optimal action requires much more information than near-optimal ones. Indeed, popular approaches such as upper-confidence-bound methods and Thompson sampling can fare poorly in such situations. We consider instead learning a satisficing action, which is near-optimal while requiring less information, and propose satisficing Thompson sampling, an algorithm that serves this purpose. We establish a general bound on expected discounted regret and study the application of satisficing Thompson sampling to linear and infinite-armed bandits, demonstrating arbitrarily large benefits over Thompson sampling. We also discuss the relation between the notion of satisficing and the theory of rate distortion, which offers guidance on the selection of satisficing actions.


Satisficing in Time-Sensitive Bandit Learning

March 2018

·

24 Reads

·

18 Citations

Mathematics of Operations Research

Much of the recent literature on bandit learning focuses on algorithms that aim to converge on an optimal action. One shortcoming is that this orientation does not account for time sensitivity, which can play a crucial role when learning an optimal action requires much more information than near-optimal ones. Indeed, popular approaches such as upper-confidence-bound methods and Thompson sampling can fare poorly in such situations. We consider instead learning a satisficing action, which is near-optimal while requiring less information, and propose satisficing Thompson sampling, an algorithm that serves this purpose. We establish a general bound on expected discounted regret and study the application of satisficing Thompson sampling to linear and infinite-armed bandits, demonstrating arbitrarily large benefits over Thompson sampling. We also discuss the relation between the notion of satisficing and the theory of rate distortion, which offers guidance on the selection of satisficing actions.


A Tutorial on Thompson Sampling

January 2018

·

230 Reads

·

210 Citations

Daniel J. Russo

·

Benjamin Van Roy

·

Abbas Kazerouni

·

[...]

·

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. A Tutorial on Thompson Sampling covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. It also discusses when and why Thompson sampling is or is not effective and relations to alternative algorithms.


A Tutorial on Thompson Sampling

July 2017

·

620 Reads

·

433 Citations

Foundations and Trends® in Machine Learning

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, dynamic pricing, recommendation, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.


A Tutorial on Thompson Sampling

July 2017

·

2 Citations

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.


Time-Sensitive Bandit Learning and Satisficing Thompson Sampling

April 2017

·

35 Reads

·

5 Citations

The literature on bandit learning and regret analysis has focused on contexts where the goal is to converge on an optimal action in a manner that limits exploration costs. One shortcoming imposed by this orientation is that it does not treat time preference in a coherent manner. Time preference plays an important role when the optimal action is costly to learn relative to near-optimal actions. This limitation has not only restricted the relevance of theoretical results but has also influenced the design of algorithms. Indeed, popular approaches such as Thompson sampling and UCB can fare poorly in such situations. In this paper, we consider discounted rather than cumulative regret, where a discount factor encodes time preference. We propose satisficing Thompson sampling -- a variation of Thompson sampling -- and establish a strong discounted regret bound for this new algorithm.


Deep Exploration via Randomized Value Functions

March 2017

·

102 Reads

·

233 Citations

We study the use of randomized value functions to guide deep exploration in reinforcement learning. This offers an elegant means for synthesizing statistically and computationally efficient exploration with common practical approaches to value function learning. We present several reinforcement learning algorithms that leverage randomized value functions and demonstrate their efficacy through computational studies. We also prove a regret bound that establishes statistical efficiency with a tabular representation.


An Information-Theoretic Analysis of Thompson Sampling

March 2014

·

93 Reads

·

279 Citations

We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimal-action distribution. This strengthens preexisting results and yields new insight into how information improves performance.


Learning to Optimize Via Information Directed Sampling

March 2014

·

126 Reads

·

211 Citations

Advances in Neural Information Processing Systems

This paper proposes information directed sampling--a new algorithm for balancing between exploration and exploitation in online optimization problems in which a decision-maker must learn from partial feedback. The algorithm quantifies the amount learned by selecting an action through an information theoretic measure: the mutual information between the true optimal action and the algorithm's next observation. Actions are then selected by optimizing a myopic objective that balances earning high immediate reward and acquiring information. We show this algorithm is provably efficient and is empirically efficient in simulation trials. We provide novel and general regret bounds that scale with the entropy of the optimal action distribution. Furthermore, as we highlight through several examples, information directed sampling sometimes dramatically outperforms popular approaches like UCB algorithms and Thompson sampling which don't quantify the information provided by different actions.


(More) Efficient Reinforcement Learning via Posterior Sampling

June 2013

·

179 Reads

·

318 Citations

Advances in Neural Information Processing Systems

Most provably-efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, \emph{posterior sampling for reinforcement learning} (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an O~(τSAT)\tilde{O}(\tau S \sqrt{AT}) bound on the expected regret, where T is time, τ\tau is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.


Citations (10)


... Due to the absence of a dynamic parameter adjustment mechanism in the UCB1 algorithm [27], it struggles to effectively adapt to rapidly changing environments. Consequently, this study employs the Thompson Sampling algorithm [39] as a more suitable alternative. ...

Reference:

ChatHTTPFuzz: large language model-assisted IoT HTTP fuzzing
A Tutorial on Thompson Sampling
  • Citing Preprint
  • July 2017

... There are a number of algorithms which adapt stationary stochastic bandit algorithms to incorporate ideas such as windowing or discounting. Notable examples include Sliding Window/Discounted UCB [Garivier and Moulines, 2008], as well as Thompson Sampling based variants [Russo et al., 2017, Trovò et al., 2020. These algorithms perform well when a good problemdependent choice of hyperparameters is known in advance. ...

A Tutorial on Thompson Sampling
  • Citing Article
  • July 2017

Foundations and Trends® in Machine Learning

... Similarly to UCB-based methods, Thompson sampling is guaranteed to converge to an optimal policy in multiarmed bandit problems [38,39], and has shown strong empirical performance [40,41]. For a discussion of known shortcomings of Thompson sampling, we refer to [42][43][44]. ...

Time-Sensitive Bandit Learning and Satisficing Thompson Sampling
  • Citing Article
  • April 2017

... See Bellemare et al. (2023) for a recent textbook introduction. The other's focus is on accounting for epistemic uncertainty inherent in the inference of a value function, usually relying on methods from Bayesian inference (Ghavamzadeh et al., 2015;Luis et al., 2024), e.g., to use it as a guide for exploration (e.g., Deisenroth & Rasmussen, 2011;Osband et al., 2019). Evidential value learning is relative to this second area of research, by using an evidential model over the value function to induce a distribution over an advantage function that incorporates aleatoric and epistemic uncertainty for regularization and optimistic exploration. ...

Deep Exploration via Randomized Value Functions
  • Citing Article
  • March 2017

... Another valid Bayesian approach (Orbanz, 2009;Ghosal & van der Vaart, 2017) for obtaining posteriors is leveraging the property of conjugacy as discussed in Sec 3. In particular, most nonparametric priors do not satisfy the necessary conditions for Bayes rule (See A.1), and one must rely on their conjugacy property to derive the corresponding posteriors. Therefore, all probability matching algorithms which derive P t (R t,a ) (and hence P t (A * = a)) using a valid Bayesian approach are admissible in the information theoretic analysis of Russo & Van Roy (2014). Additionally, these admissible algorithms would enjoy similar bounds as parametric Thompson sampling on their information-ratio (and consequently Bayesian regret), if they satisfy auxiliary conditions required from the original analysis. ...

Learning to Optimize Via Information Directed Sampling
  • Citing Article
  • March 2014

Advances in Neural Information Processing Systems

... 3. We extend the information theoretic analysis of Thompson sampling in Russo & Van Roy (2016) to a wider class of probability-matching algorithms that derive their posterior probability of optimal action using a valid Bayesian approach, and use this extension to establish σ √ 2TK log K non-asymptotic upper bound on the Bayesian regret of DPPS in bandit environments with σ sub-Gaussian reward noise, where T is the time horizon, and K is the number of arms. Context: We are unaware of any Bootstrap based bandit algorithm that enjoys the orderoptimal, σ √ 2TK log K, non-asymptotic regret bound in the wide class of σ-sub-Gaussian bandit environments. ...

An Information-Theoretic Analysis of Thompson Sampling
  • Citing Article
  • March 2014

... This can be done in both model-based and model-free scenarios. In the former, a Posterior Sampling Reinforcement Learning (PSRL) (Osband et al., 2013;Fan & Ming, 2021) algorithm based on Dirirchlet Process posteriors is definitely a promising direction of research. For the model-free scenario, one can extend Randomized Least Square Value Iteration (RLSVI) from its current Bayesian-Bootstrap based implementations (Osband et al., 2016; to a full-fledged DP implementation to account for uncertainty that does not come from the observed data, in a principled manner similar to that shown in this paper. ...

(More) Efficient Reinforcement Learning via Posterior Sampling
  • Citing Article
  • June 2013

Advances in Neural Information Processing Systems

... existing noisy GP bandits theory often leverage the upper bound of min ∈ [ ] (x ; X −1 ) or =1 (x ; X −1 ) from [Srinivas et al., 2010], we expect that the various existing theory of the noisy setting algorithms can be extended to the corresponding noise-free setting by directly replacing the existing noisy upper bounds of [Srinivas et al., 2010] with Lemma 3. For example, the analysis for GP-TS [Chowdhury and Gopalan, 2017], GP-UCB and GP-TS under Bayesian setting [Srinivas et al., 2010, Russo andVan Roy, 2014], contextual setting [Krause and Ong, 2011], GP-based level-set estimation [Gotovos et al., 2013], multi-objective setting [Zuluaga et al., 2016], robust formulation [Bogunovic et al., 2018], and so on. ...

Learning to Optimize Via Posterior Sampling
  • Citing Article
  • January 2013

Mathematics of Operations Research