Philip S. Thomas’s research while affiliated with University of Massachusetts Amherst and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (66)


Abstract Reward Processes: Leveraging State Abstraction for Consistent Off-Policy Evaluation
  • Preprint

October 2024

·

Ameet Deshpande

·

·

Philip S. Thomas

Evaluating policies using off-policy data is crucial for applying reinforcement learning to real-world problems such as healthcare and autonomous driving. Previous methods for off-policy evaluation (OPE) generally suffer from high variance or irreducible bias, leading to unacceptably high prediction errors. In this work, we introduce STAR, a framework for OPE that encompasses a broad range of estimators -- which include existing OPE methods as special cases -- that achieve lower mean squared prediction errors. STAR leverages state abstraction to distill complex, potentially continuous problems into compact, discrete models which we call abstract reward processes (ARPs). Predictions from ARPs estimated from off-policy data are provably consistent (asymptotically correct). Rather than proposing a specific estimator, we present a new framework for OPE and empirically demonstrate that estimators within STAR outperform existing methods. The best STAR estimator outperforms baselines in all twelve cases studied, and even the median STAR estimator surpasses the baselines in seven out of the twelve cases.



Figure 6. This plot shows the average width, over all algorithms, of the bootstrapped 95% confidence intervals versus the number of samples of each Xi,j. Different colors indicate a different aggregate weighting method. The shaded regions represent standard deviations of average confidence interval width. A total of 1,000 independent trials of the evaluation procedure were executed for each sample size. The results in the left plot use the performance ratio normalization function, while the right plot uses the performance percentile.
Figure 7. This plot shows the coverage probability of the bootstrapped 95% confidence intervals at each sample size. The shaded region represents 95% confidence intervals of the coverage probability using the Clopper-Pearson method (Clopper & Pearson, 1934). The dotted line indicates the target failure rate of 0.05.
Figure 11. This plot shows the average number of algorithms that have overlapping confidence intervals with the best algorithm. The error bars represent the standard deviations. The solid lines correspond to using adversarial weightings and the dashed lines for uniform weightings. (Top) Each line color corresponds to a different group of environments denoted by the number of environments. (Bottom) Each line color corresponds to a different group of algorithms denoted by the number of algorithms. 3 (Sep) and 3 (Sim) correspond to the algorithm sets that are well separated and similar in performance, respectively.
Figure 14. (Left) This plot shows the return for each algorithm for different random transition probabilities. (Right) This plot shows the proportion of time the agent was in the "Far" group of states. For all plots, each color corresponds to the algorithm Intrinsic Motivation (IM; blue lines) or restart distribution with intrinsic motivation (µ + IM; red lines). Each line style corresponds to a different random transition probability. Each line is the average of 100 trials, and the shaded areas represent standard deviations. For these plots, both algorithms use the same β, with β = 1.5 for the top row and β = 5.0 for the bottom.
Position: Benchmarking is Limited in Reinforcement Learning Research
  • Preprint
  • File available

June 2024

·

22 Reads

·

·

·

[...]

·

Philip S. Thomas

Novel reinforcement learning algorithms, or improvements on existing ones, are commonly justified by evaluating their performance on benchmark environments and are compared to an ever-changing set of standard algorithms. However, despite numerous calls for improvements, experimental practices continue to produce misleading or unsupported claims. One reason for the ongoing substandard practices is that conducting rigorous benchmarking experiments requires substantial computational time. This work investigates the sources of increased computation costs in rigorous experiment designs. We show that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive. As a result, we argue for using an additional experimentation paradigm to overcome the limitations of benchmarking.

Download

ICU-Sepsis: A Benchmark MDP Built from Real Medical Data

June 2024

·

17 Reads

We present ICU-Sepsis, an environment that can be used in benchmarks for evaluating reinforcement learning (RL) algorithms. Sepsis management is a complex task that has been an important topic in applied RL research in recent years. Therefore, MDPs that model sepsis management can serve as part of a benchmark to evaluate RL algorithms on a challenging real-world problem. However, creating usable MDPs that simulate sepsis care in the ICU remains a challenge due to the complexities involved in acquiring and processing patient data. ICU-Sepsis is a lightweight environment that models personalized care of sepsis patients in the ICU. The environment is a tabular MDP that is widely compatible and is challenging even for state-of-the-art RL algorithms, making it a valuable tool for benchmarking their performance. However, we emphasize that while ICU-Sepsis provides a standardized environment for evaluating RL algorithms, it should not be used to draw conclusions that guide medical practice.



From Past to Future: Rethinking Eligibility Traces

March 2024

·

2 Reads

·

1 Citation

Proceedings of the AAAI Conference on Artificial Intelligence

In this paper, we introduce a fresh perspective on the challenges of credit assignment and policy evaluation. First, we delve into the nuances of eligibility traces and explore instances where their updates may result in unexpected credit assignment to preceding states. From this investigation emerges the concept of a novel value function, which we refer to as the ????????????? ????? ????????. Unlike traditional state value functions, bidirectional value functions account for both future expected returns (rewards anticipated from the current state onward) and past expected returns (cumulative rewards from the episode's start to the present). We derive principled update equations to learn this value function and, through experimentation, demonstrate its efficacy in enhancing the process of policy evaluation. In particular, our results indicate that the proposed learning approach can, in certain challenging contexts, perform policy evaluation more rapidly than TD(λ)–a method that learns forward value functions, v^π, ????????. Overall, our findings present a new perspective on eligibility traces and potential advantages associated with the novel value function it inspires, especially for policy evaluation.



Coagent Networks: Generalized and Scaled

May 2023

·

13 Reads

Coagent networks for reinforcement learning (RL) [Thomas and Barto, 2011] provide a powerful and flexible framework for deriving principled learning rules for arbitrary stochastic neural networks. The coagent framework offers an alternative to backpropagation-based deep learning (BDL) that overcomes some of backpropagation's main limitations. For example, coagent networks can compute different parts of the network \emph{asynchronously} (at different rates or at different times), can incorporate non-differentiable components that cannot be used with backpropagation, and can explore at levels higher than their action spaces (that is, they can be designed as hierarchical networks for exploration and/or temporal abstraction). However, the coagent framework is not just an alternative to BDL; the two approaches can be blended: BDL can be combined with coagent learning rules to create architectures with the advantages of both approaches. This work generalizes the coagent theory and learning rules provided by previous works; this generalization provides more flexibility for network architecture design within the coagent framework. This work also studies one of the chief disadvantages of coagent networks: high variance updates for networks that have many coagents and do not use backpropagation. We show that a coagent algorithm with a policy network that does not use backpropagation can scale to a challenging RL domain with a high-dimensional state and action space (the MuJoCo Ant environment), learning reasonable (although not state-of-the-art) policies. These contributions motivate and provide a more general theoretical foundation for future work that studies coagent networks.



Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old Data in Nonstationary Environments

February 2023

·

1 Read

In this work, we consider the off-policy policy evaluation problem for contextual bandits and finite horizon reinforcement learning in the nonstationary setting. Reusing old data is critical for policy evaluation, but existing estimators that reuse old data introduce large bias such that we can not obtain a valid confidence interval. Inspired from a related field called survey sampling, we introduce a variant of the doubly robust (DR) estimator, called the regression-assisted DR estimator, that can incorporate the past data without introducing a large bias. The estimator unifies several existing off-policy policy evaluation methods and improves on them with the use of auxiliary information and a regression approach. We prove that the new estimator is asymptotically unbiased, and provide a consistent variance estimator to a construct a large sample confidence interval. Finally, we empirically show that the new estimator improves estimation for the current and future policy values, and provides a tight and valid interval estimation in several nonstationary recommendation environments.


Citations (33)


... Doubly robust (DR) estimators (e.g., Jiang & Li (2016); Farajtabar et al. (2018)) combine model-based DM and model-free IS for OPE but may fail to reduce variance when both DM and IS have high variance. Various methods have been developed to refine estimation accuracy in IS, such as truncating importance weights and estimating weights from steady-state visitation distributions (Liu et al., 2018a;Xie et al., 2019;Doroudi et al., 2017;Bossens & Thomas, 2024). ...

Reference:

Concept-driven Off Policy Evaluation
Low Variance Off-policy Evaluation with State-based Importance Sampling
  • Citing Conference Paper
  • June 2024

... It's common in the fairness-aware machine learning literature for fairness measures to be defined such that the optimization goal is a ratio with a value of 1.0 or a difference with a value of 0.0, and previous work on bias in content moderation has used the difference [10]. Following recent work that demonstrates that the ratio is more appropriate for most fairness contexts [28], we define speech suppression accordingly: ...

Analyzing the Relationship Between Difference and Ratio-Based Fairness Metrics
  • Citing Conference Paper
  • June 2024

... Formal languages can be extended to be more expressive, to capture privacy properties [83], data-based properties [59], [60], fairness properties [12], [27], among others. Some of these kinds of properties can be automatically verified probabilistically [4], [29], [33], [53], [81]. ...

Seldonian Toolkit: Building Software with Safe and Fair Machine Learning
  • Citing Conference Paper
  • May 2023

... It is widely used to solve differential equations analytically [32,33] or numerically [34], and approximate functions [35,36] or timedependent parameters [37]. It also plays an important role in the fields of control [38,39], signal processing [40,41], image analysis [42], and reinforcement learning [43,44]. In this paper, we adopt the Fourier series to design a method for discovering dynamical systems without the prior information and customized design. ...

Value Function Approximation in Reinforcement Learning Using the Fourier Basis
  • Citing Article
  • August 2011

Proceedings of the AAAI Conference on Artificial Intelligence

... Several recent works have aimed to estimate risk functionals from off-policy datasets. Of these, Chandak et al. [2021b] estimates the variance, while more recent works [Huang et al., 2021b, Chandak et al., 2021a tackle the estimation of more general risks and are the closest works of comparison. Both Huang et al. [2021b], Chandak et al. [2021a] take a two-step approach of first estimating the off-policy CDF of returns; and then estimating their risks via a plug-in approach. ...

High-Confidence Off-Policy (or Counterfactual) Variance Estimation
  • Citing Article
  • May 2021

Proceedings of the AAAI Conference on Artificial Intelligence

... Reinforcement learning (RL) has shown to be a promising approach to complex real-world decision-making problems [1], [2], [3], [4]. However, unconstrained online trial-and-error in the training of RL agents prevents further applications of RL in safety-critical scenarios since it might result in large economic losses [11], [20], [21], [22]. Many studies propose to overcome the problem by offline (batch) RL algorithms [23]. ...

Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing
  • Citing Article
  • February 2017

Proceedings of the AAAI Conference on Artificial Intelligence

... Our approach is orthogonal to the solution methods in that it uses an existing strategy representation and learns a new, potentially more concise finitestate controller representation. Furthermore, our modifications of learned strategy representations shares similarities with approaches for strategy improvement [36,13,33]. ...

High-Confidence Off-Policy Evaluation
  • Citing Article
  • February 2015

Proceedings of the AAAI Conference on Artificial Intelligence

... This poses a challenge as the system struggles to fully grasp user intent and preferences. In such situations, the system may need to ask clarifying questions to obtain additional context and disambiguate user queries [89,90]. However, if the system fails to address this bias effectively, it may lead to the recommendation of low-quality items that do not align with the user's preferences. ...

Large-scale Interactive Conversational Recommendation System using Actor-Critic Framework
  • Citing Conference Paper
  • September 2021

... Alternatively, offline reinforcement learning using expert or trace replay [3] is another possible approach to improve neural schedulers. Moreover, leveraging the structure of the underlying action space to parameterize the policy is a candidate approach to tackle a varying action set [10]. We also plan to leverage GNNs to bestow the structural knowledge from job DAGs [50], and demonstrate the performance gain of the improved neural schedulers by using the Compiler Integrated Extensible DSSoC Runtime (CEDR) tool, a successor to DS3 emulator, as it enables the gathering of low-level and finegrain timing and performance counter characteristics [29]. . ...

Lifelong Learning with a Changing Action Set
  • Citing Article
  • April 2020

Proceedings of the AAAI Conference on Artificial Intelligence

... They showed that the invalid action mask method scales better than the penalty method, but it would still suffer from Kullback-Leibler (KL) divergence explosion when dealing with more challenging tasks. The challenge can not be underestimated in the deterministic case, and it would be even more difficult when it comes to the stochastic case, where the problem becomes a stochastic action set Markov decision process (SAS-MDP) [10], [16]. In ATFM, the weather is of stochastic nature, which could largely influence the flights [27], which is reflected in the changing sector capacity in our problem, and the current methods still require knowledge of all available actions, which could cause huge computational complexity with the combinatorial action space. ...

Reinforcement Learning When All Actions Are Not Always Available
  • Citing Article
  • April 2020

Proceedings of the AAAI Conference on Artificial Intelligence