Luobao Zou’s research while affiliated with Shanghai Jiao Tong University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (9)


Soft Policy Optimization using Dual-Track Advantage Estimator
  • Conference Paper

November 2020

·

19 Reads

·

2 Citations

·

·

Luobao Zou

·

[...]

·


Fig. 1: Training curves on the Mujoco continuous control tasks. These curves reflect the change of cumulative return over 1 million time steps. The solid line denotes the average of 10 trails generated by random time seeds and the shaded region is bounded by the maximum and minimum of the 10 trails.
Fig. 2: Comparison of different TD update coefficient α on the performance of SPOD. α = 0.1 indicates the update rate of value function is slow in TD method and the agent is conservative in adopting the new explored policy, α = 0.4 indicates the agent will adopt the compromise between new and old policies, and α = 0.9 means the agent tends to completely adopt the new explored policy.
Fig. 3: Comparison of different combine methods of GAE and TDAE in SPOD (Eq. 34). mean, max, min denote the mean, maximum and minimum of GAE and TDAE respectively. beta = 0.99 denotes the weight β = 0.99 in Eq. 34.
Fig. 4: Comparison of different scales of temperature parameter η on the performance of SPOD in three high-dimensional control tasks. The greater η indicates the agent is likely to explore the new more returnable policies in the early training stage, and vice versa.
Fig. 5: Ablation analysis of SPOD
Soft policy optimization using dual-track advantage estimator
  • Preprint
  • File available

September 2020

·

219 Reads

In reinforcement learning (RL), we always expect the agent to explore as many states as possible in the initial stage of training and exploit the explored information in the subsequent stage to discover the most returnable trajectory. Based on this principle, in this paper, we soften the proximal policy optimization by introducing the entropy and dynamically setting the temperature coefficient to balance the opportunity of exploration and exploitation. While maximizing the expected reward, the agent will also seek other trajectories to avoid the local optimal policy. Nevertheless, the increase of randomness induced by entropy will reduce the train speed in the early stage. Integrating the temporal-difference (TD) method and the general advantage estimator (GAE), we propose the dual-track advantage estimator (DTAE) to accelerate the convergence of value functions and further enhance the performance of the algorithm. Compared with other on-policy RL algorithms on the Mujoco environment, the proposed method not only significantly speeds up the training but also achieves the most advanced results in cumulative return.

Download

Separated Trust Regions Policy Optimization Method

July 2019

·

72 Reads

·

4 Citations

In this work, we propose a moderate policy update method for reinforcement learning, which encourages the agent to explore more boldly in early episodes but updates the policy more cautious. Based on the maximum entropy framework, we propose a softer objective with more conservative constraints and build the separated trust regions for optimization. To reduce the variance of expected entropy return, a calculated state policy entropy of Gaussian distribution is preferred instead of collecting log probability by sampling. This new method, which we call separated trust region for policy mean and variance (STRMV), can be view as an extension to proximal policy optimization (PPO) but it is gentler for policy update and more lively for exploration. We test our approach on a wide variety of continuous control benchmark tasks in the MuJoCo environment. The experiments demonstrate that STRMV outperforms the previous state of art on-policy methods, not only achieving higher rewards but also improving the sample efficiency.



An Extensible Approach for Real-time Bidding with Model-free Reinforcement Learning

June 2019

·

72 Reads

·

4 Citations

Neurocomputing

In this paper, we propose an extensible framework for model-free reinforcement learning (RL) for real-time bidding (RTB) in display advertising. This framework can be applied into both simple environments and extend to the comprehensive environment that the DSP bids for multiple advertisers at the same time. To process new information that is collected via real-time interaction with the environment, an extensible model is first introduced, which is based on the distribution of the recharging probability. Substantial effort is expended to alleviate the problem of the sparsity of the click signal with the reward function. The proposed scheme has high feasibility and can address dynamic environments in contrast to prior works, which assumed that the distribution of the feature vectors and the dealing price were already known. Furthermore, a fund-recharging mechanism is introduced for transforming the RTB model into an endless task, which allows the policy to be optimized in a farsighted rather than a myopic manner. Illustrative experiments on both the small- and large-scale real datasets demonstrate the state-of-the-art performance of the proposed framework for the issue of interest.





Citations (5)


... It learns the mapping between agent states and actions to provide end-to-end control. A variety of DRL models such as Deep Q-Network (DQN) [40], deep deterministic policy gradient (DDPG) [41], asynchronous advantage actor-critic (A3C) [42] and proximal policy optimization (PPO) [43] have been applied and provided promising results. Long et al. [44] used laser data as input and proposed a PPO-based framework to avoid obstacles between multiple robots. ...

Reference:

Learning Reward Function with Matching Network for Mapless Navigation
An accelerated asynchronous advantage actor-critic algorithm applied in papermaking
  • Citing Conference Paper
  • July 2019

... Truly Proximal Policy Optimization (Wang, He, and Tan 2020) improves the sample efficiency of PPO by adopting a new clipping function to restrict the policy ratio, and substituting the triggering condition for clipping by a trust region-based one. Separated Trust Regions Policy Optimization (Zou et al. 2019) improves the sample efficiency of PPO by proposing a softer objective with more conservative constraints and building the separated trust-region for optimization. However, these methods ignore the perspective of directly utilizing off-policy data to improve the sample efficiency of PPO , (Wang, He, and Tan 2020), (Zou et al. 2019). ...

Separated Trust Regions Policy Optimization Method
  • Citing Conference Paper
  • July 2019

... The premise is to learn the optimal strategy by maximizing agents' cumulative rewards from the environment. For instance, Cheng et al. [16] proposed a model-free reinforcement learning model with a fund-recharging mechanism for RTB; Liu et al. [9] employ a stochastic reinforcement learning (RL) algorithm and design a bidding function to calculate the bidding price, which can learn the optimal bidding adjustment factor changing with the RTB environment. Despite their contributions, these studies neither addressed the user identification problem nor considered users' heterogeneity. ...

An Extensible Approach for Real-time Bidding with Model-free Reinforcement Learning
  • Citing Article
  • June 2019

Neurocomputing

... The actor networks are configured to approximate the internal policy parameters included in (12) and (13). Figure 6 shows the structure for the feedforward control action parameters θ f f ,i,j and the PI control action parameters θ P,i,j and θ I,i,j . ...

A new thermal power generation control in reinforcement learning
  • Citing Conference Paper
  • November 2018

... Since these methods have only been developed recently, they have not been investigated extensively for combustion control. Cheng et al. [651] used a synchronous neural episodic control approach that employed CNNs and LSTM networks to consider 40 operating points in order to control air volume, fuel content, oxygen, and feedwater flow in a coal-fired boiler. Henry de Frahan et al. [652] presented the first work to apply deep RL for optimizing efficiency and emissions in an internal combustion engine. ...

Deep Reinforcement Learning Combustion Optimization System Using Synchronous Neural Episodic Control
  • Citing Conference Paper
  • July 2018