Philip S. Thomas’s research while affiliated with University of Massachusetts Amherst and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (67)


Figure 1: The Acrobot. We employed Sarsa(λ) (γ = 1.0, λ = 0.9, = 0) with Fourier Bases of orders 3 (resulting in 256 basis functions), 5 (resulting in 1296 basis functions) and 7 (resulting in 4096 basis functions) and RBF, Polynomial and PVF bases of equivalent sizes to empirically compare their performance on the Acrobot task. (We did not run PVFs with 4096 basis functions because the nearest-neighbour calculations for a graph that size proved too expensive) We systematically varied α (the gradient descent term) to obtain the best performance for each basis function type and order combination. The resulting α values are shown in Table 1.  
Figure 2: Learning curves for agents using (a) order 3 (b) order 5 and (c) order 7 Fourier Bases, and RBFs and PVFs with corresponding number of basis functions.
Value Function Approximation in Reinforcement Learning Using the Fourier Basis.
  • Conference Paper
  • Full-text available

January 2011

·

837 Reads

·

245 Citations

·

Sarah Osentoski

·

Philip Thomas

We describe the Fourier Basis, a linear value function approximation scheme based on the Fourier Series. We em- pirically evaluate its properties, and demonstrate that it performs well compared to Radial Basis Functions and the Polynomial Basis, the two most popular fixed bases for linear value function approximation, and is competitive with learned Proto-Value Functions even though no extra experience or computation is required.

Download

Conjugate Markov Decision Processes.

January 2011

·

30 Reads

·

22 Citations

Many open problems involve the search for a mapping that is used by an algorithm solving an MDP. Useful mappings are often from the state set to some other set. Examples include representation discovery (a mapping to a feature space) and skill discovery (a mapping to skill termination probabilities). Different mappings result in algorithms achieving varying expected returns. In this paper we present a novel approach to the search for any mapping used by any algorithm attempting to solve an MDP, for that which results in maximum expected return.


Application of the Actor-Critic Architecture to Functional Electrical Stimulation Control of a Human Arm

January 2009

·

122 Reads

·

22 Citations

IEEE Int Conf Robot Autom

Clinical tests have shown that the dynamics of a human arm, controlled using Functional Electrical Stimulation (FES), can vary significantly between and during trials. In this paper, we study the application of the actor-critic architecture, with neural networks for the both the actor and the critic, as a controller that can adapt to these changing dynamics of a human arm. Development and tests were done in simulation using a planar arm model and Hill-based muscle dynamics. We begin by training it using a Proportional Derivative (PD) controller as a supervisor. We then make clinically relevant changes to the dynamics of the arm and test the actor-critic's ability to adapt without supervision in a reasonable number of episodes. Finally, we devise methods for achieving both rapid learning and long-term stability.


Fig. 1. Two-joint, six-muscle biomechanical arm model used. Antagonistic muscle pairs are as follows, listed as (flexor, extensor): monoarticular shoulder muscles (a: anterior deltoid, b: posterior deltoid); monoarticular elbow muscles (c: brachialis, d: triceps brachii (short head)); biarticular muscles (e: biceps brachii, f: triceps brachii (long head)).  
Fig. 9. Plot of the sum of the squared error in approximating the critic's utility function for an actor-critic with the PD controller as the actor in the simulated arm environment.
Creating a Reinforcement Learning Controller for Functional Electrical Stimulation of a Human Arm

January 2008

·

250 Reads

·

12 Citations

Clinical tests have shown that the dynamics of a human arm, controlled using Functional Electrical Stimulation (FES), can vary significantly between and during trials. In this paper, we study the application of Reinforcement Learning to create a controller that can adapt to these changing dynamics of a human arm. Development and tests were done in simulation using a two-dimensional arm model and Hill-based muscle dynamics. An actor-critic architecture is used with artificial neural networks for both the actor and the critic. We begin by training it using a Proportional Derivative (PD) controller as a supervisor. We then make clinically relevant changes to the dynamics of the arm and test the actor-critic's ability to adapt without supervision in a reasonable number of episodes.



Policy Gradient Coagent Networks

11 Reads

·

13 Citations

We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically eval-uate an update rule for each module, which requires only local information: the module's input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively diffi-cult and are also desirable to increase the biological plausibility of reinforcement learning methods.



Citations (34)


... Doubly robust (DR) estimators (e.g., Jiang & Li (2016); Farajtabar et al. (2018)) combine model-based DM and model-free IS for OPE but may fail to reduce variance when both DM and IS have high variance. Various methods have been developed to refine estimation accuracy in IS, such as truncating importance weights and estimating weights from steady-state visitation distributions (Liu et al., 2018a;Xie et al., 2019;Doroudi et al., 2017;Bossens & Thomas, 2024). ...

Reference:

Concept-driven Off Policy Evaluation
Low Variance Off-policy Evaluation with State-based Importance Sampling
  • Citing Conference Paper
  • June 2024

... It's common in the fairness-aware machine learning literature for fairness measures to be defined such that the optimization goal is a ratio with a value of 1.0 or a difference with a value of 0.0, and previous work on bias in content moderation has used the difference [10]. Following recent work that demonstrates that the ratio is more appropriate for most fairness contexts [28], we define speech suppression accordingly: ...

Analyzing the Relationship Between Difference and Ratio-Based Fairness Metrics
  • Citing Conference Paper
  • June 2024

... These findings align with those of and extend previous research Daley and Amato [44], which identified λ values between 0.6 and 0.7 as optimal for complex environments. A recent work Gupta et al. [45] on bidirectional value functions further supports the importance of effective temporal credit assignment. The approach demonstrated that these novel methods outperform traditional TD (λ) techniques in complex, noisy environments, highlighting the balance required between credit assignment depth and noise amplification. ...

From Past to Future: Rethinking Eligibility Traces
  • Citing Article
  • March 2024

Proceedings of the AAAI Conference on Artificial Intelligence

... Formal languages can be extended to be more expressive, to capture privacy properties [83], data-based properties [59], [60], fairness properties [12], [27], among others. Some of these kinds of properties can be automatically verified probabilistically [4], [29], [33], [53], [81]. ...

Seldonian Toolkit: Building Software with Safe and Fair Machine Learning
  • Citing Conference Paper
  • May 2023

... It is widely used to solve differential equations analytically [32,33] or numerically [34], and approximate functions [35,36] or timedependent parameters [37]. It also plays an important role in the fields of control [38,39], signal processing [40,41], image analysis [42], and reinforcement learning [43,44]. In this paper, we adopt the Fourier series to design a method for discovering dynamical systems without the prior information and customized design. ...

Value Function Approximation in Reinforcement Learning Using the Fourier Basis
  • Citing Article
  • August 2011

Proceedings of the AAAI Conference on Artificial Intelligence

... Several recent works have aimed to estimate risk functionals from off-policy datasets. Of these, Chandak et al. [2021b] estimates the variance, while more recent works [Huang et al., 2021b, Chandak et al., 2021a tackle the estimation of more general risks and are the closest works of comparison. Both Huang et al. [2021b], Chandak et al. [2021a] take a two-step approach of first estimating the off-policy CDF of returns; and then estimating their risks via a plug-in approach. ...

High-Confidence Off-Policy (or Counterfactual) Variance Estimation
  • Citing Article
  • May 2021

Proceedings of the AAAI Conference on Artificial Intelligence

... Reinforcement learning (RL) has shown to be a promising approach to complex real-world decision-making problems [1], [2], [3], [4]. However, unconstrained online trial-and-error in the training of RL agents prevents further applications of RL in safety-critical scenarios since it might result in large economic losses [11], [20], [21], [22]. Many studies propose to overcome the problem by offline (batch) RL algorithms [23]. ...

Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing
  • Citing Article
  • February 2017

Proceedings of the AAAI Conference on Artificial Intelligence

... This poses a challenge as the system struggles to fully grasp user intent and preferences. In such situations, the system may need to ask clarifying questions to obtain additional context and disambiguate user queries [89,90]. However, if the system fails to address this bias effectively, it may lead to the recommendation of low-quality items that do not align with the user's preferences. ...

Large-scale Interactive Conversational Recommendation System using Actor-Critic Framework
  • Citing Conference Paper
  • September 2021

... Alternatively, offline reinforcement learning using expert or trace replay [3] is another possible approach to improve neural schedulers. Moreover, leveraging the structure of the underlying action space to parameterize the policy is a candidate approach to tackle a varying action set [10]. We also plan to leverage GNNs to bestow the structural knowledge from job DAGs [50], and demonstrate the performance gain of the improved neural schedulers by using the Compiler Integrated Extensible DSSoC Runtime (CEDR) tool, a successor to DS3 emulator, as it enables the gathering of low-level and finegrain timing and performance counter characteristics [29]. . ...

Lifelong Learning with a Changing Action Set
  • Citing Article
  • April 2020

Proceedings of the AAAI Conference on Artificial Intelligence