November 2020
·
19 Reads
·
2 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
November 2020
·
19 Reads
·
2 Citations
September 2020
·
219 Reads
In reinforcement learning (RL), we always expect the agent to explore as many states as possible in the initial stage of training and exploit the explored information in the subsequent stage to discover the most returnable trajectory. Based on this principle, in this paper, we soften the proximal policy optimization by introducing the entropy and dynamically setting the temperature coefficient to balance the opportunity of exploration and exploitation. While maximizing the expected reward, the agent will also seek other trajectories to avoid the local optimal policy. Nevertheless, the increase of randomness induced by entropy will reduce the train speed in the early stage. Integrating the temporal-difference (TD) method and the general advantage estimator (GAE), we propose the dual-track advantage estimator (DTAE) to accelerate the convergence of value functions and further enhance the performance of the algorithm. Compared with other on-policy RL algorithms on the Mujoco environment, the proposed method not only significantly speeds up the training but also achieves the most advanced results in cumulative return.
July 2019
·
72 Reads
·
4 Citations
In this work, we propose a moderate policy update method for reinforcement learning, which encourages the agent to explore more boldly in early episodes but updates the policy more cautious. Based on the maximum entropy framework, we propose a softer objective with more conservative constraints and build the separated trust regions for optimization. To reduce the variance of expected entropy return, a calculated state policy entropy of Gaussian distribution is preferred instead of collecting log probability by sampling. This new method, which we call separated trust region for policy mean and variance (STRMV), can be view as an extension to proximal policy optimization (PPO) but it is gentler for policy update and more lively for exploration. We test our approach on a wide variety of continuous control benchmark tasks in the MuJoCo environment. The experiments demonstrate that STRMV outperforms the previous state of art on-policy methods, not only achieving higher rewards but also improving the sample efficiency.
July 2019
·
12 Reads
·
1 Citation
June 2019
·
72 Reads
·
4 Citations
Neurocomputing
In this paper, we propose an extensible framework for model-free reinforcement learning (RL) for real-time bidding (RTB) in display advertising. This framework can be applied into both simple environments and extend to the comprehensive environment that the DSP bids for multiple advertisers at the same time. To process new information that is collected via real-time interaction with the environment, an extensible model is first introduced, which is based on the distribution of the recharging probability. Substantial effort is expended to alleviate the problem of the sparsity of the click signal with the reward function. The proposed scheme has high feasibility and can address dynamic environments in contrast to prior works, which assumed that the distribution of the feature vectors and the dealing price were already known. Furthermore, a fund-recharging mechanism is introduced for transforming the RTB model into an endless task, which allows the policy to be optimized in a farsighted rather than a myopic manner. Illustrative experiments on both the small- and large-scale real datasets demonstrate the state-of-the-art performance of the proposed framework for the issue of interest.
November 2018
·
34 Reads
·
1 Citation
November 2018
·
40 Reads
·
1 Citation
July 2018
·
62 Reads
·
5 Citations
July 2018
·
43 Reads
·
8 Citations
... It learns the mapping between agent states and actions to provide end-to-end control. A variety of DRL models such as Deep Q-Network (DQN) [40], deep deterministic policy gradient (DDPG) [41], asynchronous advantage actor-critic (A3C) [42] and proximal policy optimization (PPO) [43] have been applied and provided promising results. Long et al. [44] used laser data as input and proposed a PPO-based framework to avoid obstacles between multiple robots. ...
July 2019
... Truly Proximal Policy Optimization (Wang, He, and Tan 2020) improves the sample efficiency of PPO by adopting a new clipping function to restrict the policy ratio, and substituting the triggering condition for clipping by a trust region-based one. Separated Trust Regions Policy Optimization (Zou et al. 2019) improves the sample efficiency of PPO by proposing a softer objective with more conservative constraints and building the separated trust-region for optimization. However, these methods ignore the perspective of directly utilizing off-policy data to improve the sample efficiency of PPO , (Wang, He, and Tan 2020), (Zou et al. 2019). ...
Reference:
Off-Policy Proximal Policy Optimization
July 2019
... The premise is to learn the optimal strategy by maximizing agents' cumulative rewards from the environment. For instance, Cheng et al. [16] proposed a model-free reinforcement learning model with a fund-recharging mechanism for RTB; Liu et al. [9] employ a stochastic reinforcement learning (RL) algorithm and design a bidding function to calculate the bidding price, which can learn the optimal bidding adjustment factor changing with the RTB environment. Despite their contributions, these studies neither addressed the user identification problem nor considered users' heterogeneity. ...
June 2019
Neurocomputing
... The actor networks are configured to approximate the internal policy parameters included in (12) and (13). Figure 6 shows the structure for the feedforward control action parameters θ f f ,i,j and the PI control action parameters θ P,i,j and θ I,i,j . ...
November 2018
... Since these methods have only been developed recently, they have not been investigated extensively for combustion control. Cheng et al. [651] used a synchronous neural episodic control approach that employed CNNs and LSTM networks to consider 40 operating points in order to control air volume, fuel content, oxygen, and feedwater flow in a coal-fired boiler. Henry de Frahan et al. [652] presented the first work to apply deep RL for optimizing efficiency and emissions in an internal combustion engine. ...
July 2018