Yiming Zhang’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (5)


Aggressive Q-Learning with Ensembles: Achieving Both High Sample Efficiency and High Asymptotic Performance
  • Preprint

November 2021

·

14 Reads

·

·

Che Wang

·

[...]

·

Keith W. Ross

Recently, Truncated Quantile Critics (TQC), using distributional representation of critics, was shown to provide state-of-the-art asymptotic training performance on all environments from the MuJoCo continuous control benchmark suite. Also recently, Randomized Ensemble Double Q-Learning (REDQ), using a high update-to-data ratio and target randomization, was shown to achieve high sample efficiency that is competitive with state-of-the-art model-based methods. In this paper, we propose a novel model-free algorithm, Aggressive Q-Learning with Ensembles (AQE), which improves the sample-efficiency performance of REDQ and the asymptotic performance of TQC, thereby providing overall state-of-the-art performance during all stages of training. Moreover, AQE is very simple, requiring neither distributional representation of critics nor target randomization.


Figure 1. Comparing performance of ATRPO and TRPO with different discount factors. The x-axis is the number of agent-environment interactions and the y-axis is the total return averaged over 10 seeds. The solid line represents the agents' performance on evaluation trajectories of maximum length 1,000 (top row) and 10,000 (bottom row). The shaded region represents one standard deviation.
Figure 2. Speed-time plot of a single trajectory (maximum length 10,000) for ATRPO and Discounted TRPO in the Humanoid-v3 environment. The solid line represents the speed of the agent at the corresponding timesteps.
Figure 3. Performance for ACPO. Unconstrained ATRPO is plotted for comparison. The x-axis is the number of agent-environment interactions and the y-axis is the total return averaged over 10 seeds. The solid line represents the agents' average total return (top row) and average cost (bottom row) on the evaluation trajectories. The shaded region represent one standard deviation.
Figure 4. Comparing performance of ATRPO and TRPO with different discount factors. TRPO is trained without the reset scheme. The x-axis is the number of agent-environment interactions and the y-axis is the total return averaged over 10 seeds. The solid line represents the agents' performance on evaluation trajectories of maximum length 1,000 (top row) and 10,000 (bottom row). The shaded region represent one standard deviation.
Figure 5. Comparing performance of ATRPO and TRPO trained with and without the reset costs. The curves for TRPO are for the best discount factor for each environment. The x-axis is the number of agent-environment interactions and the y-axis is the total return averaged over 10 seeds. The solid line represents the agents' performance on evaluation trajectories of maximum length 1,000 (top row) and 10,000 (bottom row). The shaded region represent one standard deviation.
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
  • Preprint
  • File available

June 2021

·

203 Reads

·

1 Citation

We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL). We first consider bounding the difference of the long-term average reward for two policies. We show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al., 2017) results in a non-meaningful bound in the average-reward setting. By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the two policies and Kemeny's constant. Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. This iterative procedure can then be combined with classic DRL (Deep Reinforcement Learning) methods, resulting in practical DRL algorithms that target the long-run average reward criterion. In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.

Download

First Order Optimization in Policy Space for Constrained Deep Reinforcement Learning

February 2020

·

43 Reads

In reinforcement learning, an agent attempts to learn high-performing behaviors through interacting with the environment, such behaviors are often quantified in the form of a reward function. However some aspects of behavior, such as ones which are deemed unsafe and are to be avoided, are best captured through constraints. We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS) which maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints. Using data generated from the current policy, FOCOPS first finds the optimal update policy by solving a constrained optimization problem in the nonparameterized policy space. FOCOPS then projects the update policy back into the parametric policy space. Our approach provides a guarantee for constraint satisfaction throughout training and is first-order in nature therefore extremely simple to implement. We provide empirical evidence that our algorithm achieves better performance on a set of constrained robotics locomotive tasks compared to current state of the art approaches.


Efficient Entropy for Policy Gradient with Multidimensional Action Space

June 2018

·

144 Reads

In recent years, deep reinforcement learning has been shown to be adept at solving sequential decision processes with high-dimensional state spaces such as in the Atari games. Many reinforcement learning problems, however, involve high-dimensional discrete action spaces as well as high-dimensional state spaces. This paper considers entropy bonus, which is used to encourage exploration in policy gradient. In the case of high-dimensional action spaces, calculating the entropy and its gradient requires enumerating all the actions in the action space and running forward and backpropagation for each action, which may be computationally infeasible. We develop several novel unbiased estimators for the entropy bonus and its gradient. We apply these estimators to several models for the parameterized policies, including Independent Sampling, CommNet, Autoregressive with Modified MDP, and Autoregressive with LSTM. Finally, we test our algorithms on two environments: a multi-hunter multi-rabbit grid game and a multi-agent multi-arm bandit problem. The results show that our entropy estimators substantially improve performance with marginal additional computational cost.


Supervised Policy Update

May 2018

·

15 Reads

We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a non-parameterized policy. It then solves a supervised regression problem to convert the non-parameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a methodology for finding an optimal policy in the non-parameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be addressed by this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.