## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

2014 In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.

To read the full-text of this research,

you can request a copy directly from the authors.

... The policy function π(s) returns the action a t for a given state s t . Many algorithms, such as DDPG [27] use a neural network to predict a t given s t , and add noise to this prediction to enhance exploration. When the agent executes action a t according to π, it receives the appropriate reward r t , eventually obtaining state-action sequences τ = (s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , ..., s T , r T ). ...

... Deep reinforcement learning In recent years, numerous RL methods with impressive performance have evolved, such as TRPO [28], PPO [29], ACER [30], and SAC [31]. We use DDPG [27], an off-policy actor-critic algorithm that approximates the Q-function, which we will use to assess the actions of the actor and the baseline. In deep RL, the policy π is modeled by a neural network, parameterized by its parameters θ, which are then updated to maximize the expected reward J (π). ...

Reinforcement learning (RL) is not yet competitive for many cyber-physical systems, such as robotics, process automation, and power systems, as training on a system with physical components cannot be accelerated, and simulation models do not exist or suffer from a large simulation-to-reality gap. During the long training time, expensive equipment cannot be used and might even be damaged due to inappropriate actions of the reinforcement learning agent. Our novel approach addresses exactly this problem: We train the reinforcement agent in a so-called shadow mode with the assistance of an existing conventional controller, which does not have to be trained and instantaneously performs reasonably well. In shadow mode, the agent relies on the controller to provide action samples and guidance towards favourable states to learn the task, while simultaneously estimating for which states the learned agent will receive a higher reward than the conventional controller. The RL agent will then control the system for these states and all other regions remain under the control of the existing controller. Over time, the RL agent will take over for an increasing amount of states, while leaving control to the baseline, where it cannot surpass its performance. Thus, we keep regret during training low and improve the performance compared to only using conventional controllers or reinforcement learning. We present and evaluate two mechanisms for deciding whether to use the RL agent or the conventional controller. The usefulness of our approach is demonstrated for a reach-avoid task, for which we are able to effectively train an agent, where standard approaches fail.

... The term 'deterministic' indicates that the action is chosen based on a deterministic policy µ θ : S− > A with parameter θ ∈ R n , rather than a policy represented by a parametric probability distribution, which is known as SPG. The existence of the DPG is demonstrated in Reference [45] and the DPG theorem is established. ...

... These variables are defined with regard to a deterministic policy µ and parameter θ. Empirical evidence shows that the DPG algorithm can achieve superior performance compared to stochastic algorithms when dealing with high-dimensional action spaces [45]. Additionally, DPG is able to circumvent the challenges associated with integrating throughout the entire action space. ...

Distributed parameter systems (DPSs) frequently appear in industrial manufacturing processes, with complex characteristics such as time–space coupling, nonlinearity, infinite dimension, uncertainty and so on, which is full of challenges to the modeling of the system. At present, most DPS modeling methods are offline. When the internal parameters or external environment of DPS change, the offline model is incapable of accurately representing the dynamic attributes of the real system. Establishing an online model for DPS that accurately reflects the real-time dynamics of the system is very important. In this paper, the idea of reinforcement learning is creatively integrated into the three-dimensional (3D) fuzzy model and a reinforcement learning-based 3D fuzzy modeling method is proposed. The agent improves the strategy by continuously interacting with the environment, so that the 3D fuzzy model can adaptively establish the online model from scratch. Specifically, this paper combines the deterministic strategy gradient reinforcement learning algorithm based on an actor critic framework with a 3D fuzzy system. The actor function and critic function are represented by two 3D fuzzy systems and the critic function and actor function are updated alternately. The critic function uses a TD (0) target and is updated via the semi-gradient method; the actor function is updated by using the chain derivation rule on the behavior value function and the actor function is the established DPS online model. Since DPS modeling is a continuous problem, this paper proposes a TD (0) target based on average reward, which can effectively realize online modeling. The suggested methodology is implemented on a three-zone rapid thermal chemical vapor deposition reactor system and the simulation results demonstrate the efficacy of the methodology.

... After the above warm initialization process for the highlevel policy, we also train the low level policy π l in a supervised manner using the trajectories τ from the learner demonstrations in the seen tasks from the distribution D. As opposed to the supervised learning approach to warm initialize the policy π h , here we do not incorporate noise as the expert trajectories contain no errors with regards to which action should be taken given a state and desired subgoal. Finally, we train both high-level and low-level policies using any existing RL algorithm as shown in Algorithm 2, (e.g., DPG in [32] or PPO in [33]). ...

... In this section, we warm initialize the learner high level policy with the mapping's predicted subgoals as discussed in Section III-B. Then, we use deterministic policy gradient (DPG) [32] to train high level and low level policy functions π h and π l . For the knight agent, we define its subgoal set at a specific position in Figure 3. Specifically, the knight can select any square at most 2 squares away from its current position. ...

In this paper, we consider a transfer reinforcement learning problem involving agents with different action spaces. Specifically, for any new unseen task, the goal is to use a successful demonstration of this task by an expert agent in its action space to enable a learner agent learn an optimal policy in its own different action space with fewer samples than those required if the learner was learning on its own. Existing transfer learning methods across different action spaces either require handcrafted mappings between those action spaces provided by human experts, which can induce bias in the learning procedure, or require the expert agent to share its policy parameters with the learner agent, which does not generalize well to unseen tasks. In this work, we propose a method that learns a subgoal mapping between the expert agent policy and the learner agent policy. Since the expert agent and the learner agent have different action spaces, their optimal policies can have different subgoal trajectories. We learn this subgoal mapping by training a Long Short Term Memory (LSTM) network for a distribution of tasks and then use this mapping to predict the learner subgoal sequence for unseen tasks, thereby improving the speed of learning by biasing the agent's policy towards the predicted learner subgoal sequence. Through numerical experiments, we demonstrate that the proposed learning scheme can effectively find the subgoal mapping underlying the given distribution of tasks. Moreover, letting the learner agent imitate the expert agent's policy with the learnt subgoal mapping can significantly improve the sample efficiency and training time of the learner agent in unseen new tasks.

... There have been attempts to extend policy gradients to off-policy data (Degris et al., 2012). The most common approach in this direction is to use deterministic policy gradients (DPG; Silver et al., 2014): ...

... To our knowledge, surprisingly few algorithms make use of the generalized MC estimator (14), with AAPG (Petit et al., 2019) and MPO being our only references. On the flip side, methods that perform exact summation or integration over the action space are either limited to small and finite action spaces (Sutton et al., 2001;Allen et al., 2017) or restricted to specific distribution classes that enable closed-form integration (Silver et al., 2014;Ciosek & Whiteson, 2018;2020). ...

Conventional wisdom suggests that policy gradient methods are better suited to complex action spaces than action-value methods. However, foundational studies have shown equivalences between these paradigms in small and finite action spaces (O'Donoghue et al., 2017; Schulman et al., 2017a). This raises the question of why their computational applicability and performance diverge as the complexity of the action space increases. We hypothesize that the apparent superiority of policy gradients in such settings stems not from intrinsic qualities of the paradigm, but from universal principles that can also be applied to action-value methods to serve similar functionality. We identify three such principles and provide a framework for incorporating them into action-value methods. To support our hypothesis, we instantiate this framework in what we term QMLE, for Q-learning with maximum likelihood estimation. Our results show that QMLE can be applied to complex action spaces with a controllable computational cost that is comparable to that of policy gradient methods, all without using policy gradients. Furthermore, QMLE demonstrates strong performance on the DeepMind Control Suite, even when compared to the state-of-the-art methods such as DMPO and D4PG.

... Artificial intelligence-based reinforcement repetition [6], [7], and deep reinforcement learning [8], [9] can obtain learning information and update model parameters by receiving rewards from the environment for the actions, which can effectively utilize historical information, has strong learning ability, and is widely used in A-GC systems. For example, literature [6] introduced the Q(λ) algorithm to solve the delayed return problem due to the large time lag link of thermal power units in the interconnected grid dominated by thermal power. ...

... Artificial intelligence-based reinforcement repetition [6], [7], and deep reinforcement learning [8], [9] can obtain learning information and update model parameters by receiving rewards from the environment for the actions, which can effectively utilize historical information, has strong learning ability, and is widely used in A-GC systems. For example, literature [6] introduced the Q(λ) algorithm to solve the delayed return problem due to the large time lag link of thermal power units in the interconnected grid dominated by thermal power. Literature [7] proposed a full-process R(λ) algorithm based on the average payoff model to improve the 10-min average regulation performance standard index qualification rate during the assessment period and obtain the optimal regulation method for CPS of interconnected power systems. ...

This paper introduces a novel approach to address the issue of overestimation of state-action values in reinforcement learning algorithms based on the Q-framework. It integrates a dual-layer Q-learning algorithm with a grey wolf intelligent optimization method, enabling rapid search for optimal allocations in unknown search spaces. This integration results in the development of a multi-agent collaborative Area-Grid Coordination (A-GC) strategy, termed the grey wolf double Q (GWDQ) strategy, tailored for multi-area energy interconnection scenarios. The proposed GWDQ strategy is evaluated through simulation experiments on comprehensive energy system models, including mixed gas turbine systems, Combined Cooling, Heating, and Power (CCHP) systems, and a multi-area energy-interconnected Northeast power grid model. A centralized architecture is established to analyze the optimization effects of digital advertising. The performance of the GWDQ strategy is compared with traditional reinforcement learning algorithms through simulation and empirical data validation. Results indicate that the GWDQ strategy exhibits stronger learning capabilities, improved stability, and enhanced control performance compared to traditional methods. It demonstrates superior optimization for digital advertising and enables swift acquisition of optimal coordination in A-GC processes across multiple regions. Additionally, the paper analyzes the environmental and economic impacts of the proposed strategy.

... Silver et al. [26] introduced the deterministic policy gradient (DPG) algorithm, which uses deterministic policies instead of stochastic ones. Let π ϕ refer to the policy with parameters ϕ. ...

... In the second step, (24) applies where F t (s t , a t ) always takes the value given in (25), rather than (26). Consequently, the calculations for E[F t (s t , a t )] and Var[F t (s t , a t )] are slightly modified as follows: ...

To obtain better value estimation in reinforcement learning, we propose a novel algorithm based on the double actor-critic framework with temporal difference error-driven regularization, abbreviated as TDDR. TDDR employs double actors, with each actor paired with a critic, thereby fully leveraging the advantages of double critics. Additionally, TDDR introduces an innovative critic regularization architecture. Compared to classical deterministic policy gradient-based algorithms that lack a double actor-critic structure, TDDR provides superior estimation. Moreover, unlike existing algorithms with double actor-critic frameworks, TDDR does not introduce any additional hyperparameters, significantly simplifying the design and implementation process. Experiments demonstrate that TDDR exhibits strong competitiveness compared to benchmark algorithms in challenging continuous control tasks.

... We begin with an overview of the theoretical foundations of reinforcement learning and evolutionary computation. Next, we explain the principles of the Deep Deterministic Policy Gradient (DDPG) [37], [38] algorithm used in reinforcement learning. Following this, we provide a detailed description of the components and core mechanisms of the standard ERL framework. ...

Deep reinforcement learning (DRL) has achieved significant success in continuous control tasks. However, it encounters challenges that restrict its applicability to a wider array of tasks, including sparse rewards and limited exploration. In recent years, the integration of evolutionary algorithms (EAs) with deep reinforcement learning has emerged as a significant area of research. Evolutionary reinforcement learning (ERL) methods can address certain challenges inherent in conventional reinforcement learning algorithms. However, the introduction of evolutionary computation algorithms increases the number of hyperparameters, and sensitivity to these hyperparameters continues to pose a significant challenge. This paper proposes an evolutionary reinforcement learning method that incorporates evolutionary mutation rates. This method integrates a self-adaptive mutation rate mechanism within the ERL framework, which maintains two populations: one consisting of individuals (agents) and the other comprising mutation rates. This is our original contribution to this research. The actor population is categorized into several groups, each assigned a specific mutation rate. After the actor population undergoes mutation, the mutation rate of the population evolves based on the performance of the mutations within the actor population. This approach tackles the issue of hyperparameter selection for mutation rates in ERL. Experimental results demonstrate superior performance compared to the standard ERL framework across various continuous control tasks.

... Reinforcement learning (RL)(Sutton and Barto 2018) is a sampling-based algorithm to sequential optimization where an agent seeks learn an optimal policy guided a reward signal. Recent RL algorithms use function approximators (deep neural networks) and build upon the policy gradient theorem (Sutton et al. 2000) to directly optimize the parameterized controllers (Silver et al. 2014;Haarnoja et al. 2018). The results of the theorem allows for the computation of an approximate gradient over the parameter space of the neural network, without having to explicitly compute the expected rewards or Q−function by enumerating the state-action pairs. ...

An (artificial cardiac) pacemaker is an implantable electronic device that sends electrical impulses to the heart to regulate the heartbeat. As the number of pacemaker users continues to rise, so does the demand for features with additional sensors, adaptability, and improved battery performance. Reinforcement learning (RL) has recently been proposed as a performant algorithm for creative design space exploration, adaptation, and statistical verification of cardiac pacemakers. The design of correct reward functions, expressed as a reward machine, is a key programming activity in this process. In 2007, Boston Scientific published a detailed description of their pacemaker specifications. This document has since formed the basis for several formal characterizations of pacemaker specifications using real-time automata and logic. However, because these translations are done manually, they are challenging to verify. Moreover, capturing requirements in automata or logic is notoriously difficult. We posit that it is significantly easier for domain experts, such as electrophysiologists, to observe and identify abnormalities in electrocardiograms that correspond to patient-pacemaker interactions. Therefore, we explore the possibility of learning correctness specifications from such labeled demonstrations in the form of a reward machine and training an RL agent to synthesize a cardiac pacemaker based on the resulting reward machine. We leverage advances in machine learning to extract signals from labeled demonstrations as reward machines using recurrent neural networks and transformer architectures. These reward machines are then used to design a simple pacemaker with RL. Finally, we validate the resulting pacemaker using properties extracted from the Boston Scientific document.

... Since the stochastic trading environment induces a hard exploration problem, learning with a pure RL objective is extremely difficult. To promote policy learning in such a complex trading environment, we propose to augment the RL method to imitate the quoting behavior in the expert dataset D E , and the policy π h is updated with the deterministic policy gradient [30] as: ...

Order execution is an extremely important problem in the financial domain, and recently, more and more researchers have tried to employ reinforcement learning (RL) techniques to solve this challenging problem. There are a lot of difficulties for conventional RL methods to tackle the order execution problem, such as the large action space including price and quantity, and the long-horizon property. As naturally order execution is composed of a low-frequency volume scheduling stage and a high-frequency order placement stage, most existing RL-based order execution methods treat these stages as two distinct tasks and offer a partial solution by addressing either one individually. However, the current literature fails to model the non-negligible mutual influence between these two tasks, leading to impractical order execution solutions. To address these limitations, we propose a novel automatic order execution approach based on the hierarchical RL framework (OEHRL), which jointly learns the policies for volume scheduling and order placement. OEHRL first extracts the state embeddings at both the macro and micro levels with a sequential variational auto-encoder model. Based on the effective embeddings, OEHRL generates a hindsight expert dataset, which is used to train a hierarchical order execution policy. In the hierarchical structure, the high-level policy is in charge of the target volume and the low-level learns to determine the prices for a series of the allocated sub-orders from the high level. These two levels collaborate seamlessly and contribute to the optimal order execution policy. Extensive experiment results on 200 stocks across the US and China A-share markets validate the effectiveness of the proposed approach.

... This problem is similar to the optimization problem solved by policy gradient methods in reinforcement learning [21]. ...

In this paper, we propose a method for constructing a neural network viscosity in order to reduce the non-physical oscillations generated by high-order Discontinuous Galerkin methods on uniform Cartesian grids. To this end, the problem is reformulated as an optimal control problem for which the control is the viscosity function and the cost function involves comparison with a reference solution after several compositions of the scheme. The learning process is strongly based on gradient backpropagation tools. Numerical simulations show that the artificial viscosities, with a convolutional architecture, constructed in this way are just as good or better than those used in the literature.

... The DDPG algorithm, as described by Silver et al., 24 concurrently learns the Q-function and the policy using off-policy data and the Bellman equation. It maintains consistency with the optimal action-value function, a * = aarg maxQ * (s, a). ...

This article introduces an approach aimed at enabling self-driving cars to emulate human-learned driving behavior. We propose a method where the navigation challenge of autonomous vehicles, from starting to ending positions, is framed as a series of decision-making problems encountered in various states negating the requirement for high-precision maps and routing systems. Utilizing high-quality images and sensor-derived state information, we design rewards to guide an agent’s movement from the initial to the final destination. The soft actor-critic algorithm is employed to learn the optimal policy from the interaction between the agent and the environment, informed by these states and rewards. In an innovative approach, we apply the variational autoencoder technique to extract latent vectors from high-quality images, reconstructing a new state space with vehicle state vectors. This method reduces hardware requirements and enhances training efficiency and task success rates. Simulation tests conducted in the CARLA simulator demonstrate the superiority of our method over others. It enhances the intelligence of autonomous vehicles without the need for intermediate processes such as target detection, while concurrently reducing the hardware footprint, even though it may not perform as well as the currently available mature techniques.

... Similar to the DQN, the critic network is updated by minimizing the loss function in (12), another actor function µ(s|θ µ ) specifies the current policy by mapping states to the specific action. The actor-network is updated with the equation below [32] ∇ θµ J=E st∼ρ β ∇ θµ (s|θ µ )|s=s t ∇ a Q π (s, a|θ Q )| s=st,a=µ(st) (14) where s ∼ ρ β represents the state s following the distribution β. Then the target networks of the critic and actor networks are updated with (13). ...

Finite-set model predictive control (FS-MPC) appears to be a promising and effective control method for power electronic converters. Conventional FS-MPC suffers from the time-consuming process of weighting factor selection, which significantly impacts control performance. Another ongoing challenge of FS-MPC is its dependence on the prediction model for desirable control performance. To overcome the above issues, we propose to apply reinforcement learning (RL) to FS-MPC for power converters. The RL algorithm is first employed for the automatic weighting factor design of the FS-MPC, aiming to minimize the total harmonic distortion (THD) or reduce the average switching frequency. Furthermore, by formulating the incentive for the RL agent with the cost function of the predictive algorithm, the agent learns autonomously to find the optimal switching policy for the power converter by imitating the predictive controller without prior knowledge of the system model. Finally, a deployment framework that allows for experimental validation of the proposed RL-based methods on a practical FS-MPC regulated stand-alone converter configuration is presented. Two exemplary control objectives are demonstrated to show the effectiveness of the proposed RL-aided weighting factor tuning method. Moreover, the results show a good match between the model-free RL-based controller and the FS-MPC performance.

... We used ACRL-NGN or DDPG as the learning algorithm and FCM-ML or FCM-JA as the feedback control model to determine the maximum value of each muscle activation level. We used a DDPG algorithm with the actor-critic method [24], implemented by modifying the Python code of Morvanzhou (https://github.com/MorvanZhou/Reinforcementlearning-with-tensorflow/blob/master/contents/9_Deep_Deterministic_Policy_Gradient_ DDPG/DDPG.py, ...

Simultaneous and cooperative muscle activation results in involuntary posture stabilization in vertebrates. However, the mechanism through which more muscles than joints contribute to this stabilization remains unclear. We developed a computational human body model with 949 muscle action lines and 22 joints and examined muscle activation patterns for stabilizing right upper or lower extremity motions at a neutral body posture (NBP) under gravity using actor–critic reinforcement learning (ACRL). Two feedback control models (FCM), muscle length change (FCM–ML) and joint angle differences, were applied to ACRL with a normalized Gaussian network (ACRL–NGN) or deep deterministic policy gradient. Our findings indicate that among the six control methods, ACRL–NGN with FCM–ML, utilizing solely antagonistic feedback control of muscle length change without relying on synergy pattern control or categorizing muscles as flexors, extensors, agonists, or synergists, achieved the most efficient involuntary NBP stabilization. This finding suggests that vertebrate muscles are fundamentally controlled without categorization of muscles for targeted joint motion and are involuntarily controlled to achieve the NBP, which is the most comfortable posture under gravity. Thus, ACRL–NGN with FCM–ML is suitable for controlling humanoid muscles and enables the development of a comfortable seat design.

... We employ deep deterministic policy gradient (DDPG) [30], [31] for our RL algorithm, given our continuous action space, which is, for example, the length and width of transistors. Algorithms such as DDPG offer flexibility in collecting training data and are often more sample-efficient than onpolicy algorithms [32]. ...

This paper presents a fully open-sourced AMS integrated circuit optimization framework based on reinforcement learning (RL). Specifically, given a certain circuit topology and target specifications, this framework optimizes the circuit in both schematic and post-layout phases. We propose using the heterogeneous graph neural network as the function approximator for RL. Optimization results suggest that it can achieve higher reward values with fewer iterations than the homogeneous graph neural networks. We demonstrate the applications of transfer learning (TL) in optimizing circuits in a different technology node. Furthermore, we show that by transferring the knowledge of schematic-level optimization, the trained RL agent can optimize the post-layout performance more efficiently than optimizing post-layout performance from scratch. To showcase the workflow of our approach, we extended our prior work to optimize latched comparators in the SKY130 and GF180MCU processes. Simulation results demonstrate that our framework can satisfy various target specifications and generate LVS/DRC clean circuit layouts.

... The Soft Actor-Critic (SAC) algorithm is a model-free, online, off-policy, actor-critic reinforcement learning method designed to compute an optimal policy that not only maximizes the expected long-term reward but also the entropy of the policy defined by the paper from [12]. The policy entropy serves as a measure of the policy's uncertainty given the current state. ...

This research investigates the performance of three newest version of algorithms in reinforcement learning (RL) as an advanced control strategy for water level control with single-tank system and quadruple-tank system, contrasting it with the conventional PID (proportional-integral-derivative) control within the framework. RL, which autonomously learns by interacting with its environment, is becoming increasingly popular for developing optimal controllers for complex, dynamic, and nonlinear processes. Unlike most RL studies that use open-source platforms like Python and OpenAI Gym, this research utilizes MATLAB's Reinforcement Learning Toolbox (introduced in R2024a) to design a water tank model using Transfer function and State Space Equation. The controller is trained using Soft Actor-Crtic (SAC), Proximal Policy Optimization (PPO), and Deep Deterministic Policy Gradient (DDPG) algorithm, with Simulink employed to simulate the water tank system and establish an experimental test bench for comparsion between them.
The findings indicate that the Soft Actor-Critic (SAC) algorithm deliver the best result in signal tracking, achieving high speed and low error relative to the reference signal when compared to other reinforcement learning (RL) algorithms. However, algorithms such as Proximal Policy Optimization (PPO) demonstrate superior performance in maintaining minimal steady-state error, despite requiring a significantly longer training period in the context of a single-tank system. In more complex scenarios, such as the quadruple-tank system, SAC proves to be superior due to its advantage in handling continuous action spaces, where PPO tends to diverge significantly. Achieving success in machine learning necessitates the tuning of numerous hyperparameters, a process that is both time-consuming and labor-intensive. The practical insights derived from this research are corroborated by existing literature, underscoring the robustness and applicability of the findings.

... Reinforcement Learning Reinforcement Learning (RL) (Sutton & Barto, 2018) has two main approaches: model-free methods (Silver et al., 2014;Fujimoto et al., 2018;Haarnoja et al., 2018a;Schulman et al., 2015;Kalashnkov et al., 2021;Kalashnikov et al., 2018;Mnih et al., 2015;Hessel et al., 2018;Yarats et al., 2021;Laskin et al., 2020) and model-based methods (Sutton, 1991;Hafner et al., 2020;Luo et al., 2019;Janner et al., 2019;Chua et al., 2018;Schrittwieser et al., 2019;Wang & Ba, 2020). While model-free methods focus on learning the value function and policy, model-based methods aim to learn the underlying model of the environment, using this learned model to compute optimal actions. ...

Model-based reinforcement learning has shown promise for improving sample efficiency and decision-making in complex environments. However, existing methods face challenges in training stability, robustness to noise, and computational efficiency. In this paper, we propose Bisimulation Metric for Model Predictive Control (BS-MPC), a novel approach that incorporates bisimulation metric loss in its objective function to directly optimize the encoder. This time-step-wise direct optimization enables the learned encoder to extract intrinsic information from the original state space while discarding irrelevant details and preventing the gradients and errors from diverging. BS-MPC improves training stability, robustness against input noise, and computational efficiency by reducing training time. We evaluate BS-MPC on both continuous control and image-based tasks from the DeepMind Control Suite, demonstrating superior performance and robustness compared to state-of-the-art baseline methods.

... DDPG [25] is based on DPG [26] and is trained using the actor-critic framework. The actor and critic are fitted using a neural network and parameters are updated using gradient descent. ...

Machine learning has been applied by more and more scholars in the field of quantitative investment, but traditional machine learning methods cannot provide high returns and strong stability at the same time. In this paper, a multimodal model based on reinforcement learning (RL) is constructed for the stock investment portfolio management task. Most of the previous methods based on RL have chosen the value-based RL methods. Policy gradient-based RL methods have been proven to be superior to value-based RL methods by a growing number of research. Commonly used policy gradient-based reinforcement learning methods are DDPG, TD3, SAC, and PPO. We conducted comparative experiments to select the most suitable method for the dataset in this paper. The final choice was DDPG. Furthermore, there will rarely be a way to refine the raw data before training the agent. The stock market has a large amount of data, and the data are complex. If the raw stock market data are fed directly to the agent, the agent cannot learn the information in the data efficiently and quickly. We use state representation learning (SRL) to process the raw stock data and then feed the processed data to the agent. It is not enough to train the agent using only stock data; we also added comment text data and image data. The comment text data comes from investors’ comments on stock bars. Image data are derived from pictures that can represent the overall direction of the market. We conducted experiments on three datasets and compared our proposed model with 11 other methods. We set up three evaluation indicators in the paper. Taken together, our proposed model works best.

... In policy-based methods, instead, a policy is learned directly, typically through gradient descent optimization of the weights of a neural network, to maximize the expected reward. The policy may be deterministic (e.g., deterministic policy gradient (DPG) (Silver et al., 2014)) or, in the case of probabilistically selected actions, stochastic (e.g., proximal policy optimization (PPO) (Schulman, et al., 2017b)). Actor-critic methods combine the benefits of value-based and policy-based approaches (Wang et al., 2017. ...

This study provides a systematic analysis of the resource-consuming training of deep reinforcement-learning (DRL) agents for simulated low-speed automated driving (AD). In Unity, this study established two case studies: garage parking and navigating an obstacle-dense area. Our analysis involves training a path-planning agent with real-time-only sensor information. This study addresses research questions insufficiently covered in the literature, exploring curriculum learning (CL), agent generalization (knowledge transfer), computation distribution (CPU vs. GPU), and mapless navigation. CL proved necessary for the garage scenario and beneficial for obstacle avoidance. It involved adjustments at different stages, including terminal conditions, environment complexity, and reward function hyperparameters, guided by their evolution in multiple training attempts. Fine-tuning the simulation tick and decision period parameters was crucial for effective training. The abstraction of high-level concepts (e.g., obstacle avoidance) necessitates training the agent in sufficiently complex environments in terms of the number of obstacles. While blogs and forums discuss training machine learning models in Unity, a lack of scientific articles on DRL agents for AD persists. However, since agent development requires considerable training time and difficult procedures, there is a growing need to support such research through scientific means. In addition to our findings, we contribute to the R&D community by providing our environment with open sources.

... As continuous reinforcement learning methods have been proposed successively [6], not only simple physical systems in classical control can obtain the expected motion, but also more complex continuous tasks such as quadruped gait policy have been trained successfully [4] [7]. In particular, reinforcement learning methods such as DDPG [8], A3C [9], PPO [10], and SAC [11] have been demonstrated to be applicable to continuous system. ...

Reinforcement learning method is extremely competitive in gait generation techniques for quadrupedal robot, which is mainly due to the fact that stochastic exploration in reinforcement training is beneficial to achieve an autonomous gait. Nevertheless, although incremental reinforcement learning is employed to improve training success and movement smoothness by relying on the continuity inherent during limb movements, challenges remain in adapting gait policy to diverse terrain and external disturbance. Inspired by the association between reinforcement learning and the evolution of animal motion behavior, a self-improvement mechanism for reference gait is introduced in this paper to enable incremental learning of action and self-improvement of reference action together to imitate the evolution of animal motion behavior. Further, a new framework for reinforcement training of quadruped gait is proposed. In this framework, genetic algorithm is specifically adopted to perform global probabilistic search for the initial value of the arbitrary foot trajectory to update the reference trajectory with better fitness. Subsequently, the improved reference gait is used for incremental reinforcement learning of gait. The above process is repeatedly and alternatively executed to finally train the gait policy. The analysis considering terrain, model dimensions, and locomotion condition is presented in detail based on simulation, and the results show that the framework is significantly more adaptive to terrain compared to regular incremental reinforcement learning.

... The authors in [22] show that the gradient of this objective function concerning the parameters θ µ is given by: ...

Next-generation mobile networks, such as those beyond the 5th generation (B5G) and 6th generation (6G), have diverse network resource demands. Network slicing (NS) and device-to-device (D2D) communication have emerged as promising solutions for network operators. NS is a candidate technology for this scenario, where a single network infrastructure is divided into multiple (virtual) slices to meet different service requirements. Combining D2D and NS can improve spectrum utilization, providing better performance and scalability. This paper addresses the challenging problem of dynamic resource allocation with wireless network slices and D2D communications using deep reinforcement learning (DRL) techniques. More specifically, we propose an approach named DDPG-KRP based on deep deterministic policy gradient (DDPG) with K-nearest neighbors (KNNs) and reward penalization (RP) for undesirable action elimination to determine the resource allocation policy maximizing long-term rewards. The simulation results show that the DDPG-KRP is an efficient solution for resource allocation in wireless networks with slicing, outperforming other considered DRL algorithms.

This chapter investigates the transformative potential of Peer-to-Peer (P2P) trading in local energy markets, emphasizing the role of distributed energy resources in facilitating efficient market operations and fostering sustainable energy practices. The text first provides a foundational understanding of local energy markets, highlighting their definition and significance, and introduces the emergent concept of P2P energy trading among prosumers, with a focus on the microgrid and nanogrid levels. It underscores the advantages of such trading, including improved energy efficiency, enhanced grid reliability, and the promotion of renewable energy sources. Further, the chapter delves into the specifics of P2P ancillary service trading, with particular attention to frequency regulation support among microgrid and nanogrid prosumers, exploring its operational benefits and contribution to grid stability. Advancing the discussion, the text introduces a novel aspect of P2P markets—the carbon emission auction trading within local energy spheres. This section probes the theoretical and practical implications of integrating carbon emission considerations into energy trading and examines the market mechanisms through which microgrid prosumers might interact within this innovative paradigm. The chapter is structured into two principal sections. The first addresses P2P energy and ancillary service trading among nanogrid prosumers within a microgrid setting, focusing on real-time market operations for energy balancing and frequency regulation. The second section examines the interconnections between P2P energy, ancillary service, and carbon emission quota trading among multiple microgrid prosumers, presenting advanced modeling techniques and algorithms such as the multi-agent deep deterministic policy gradient for strategy optimization and risk mitigation. The chapter concludes with a synthesis of the explored concepts, reinforcing the significance of P2P trading in advancing the decarbonization of local energy markets and its potential for incentivizing the adoption of green technologies. It offers insights into the market structures, strategic behaviors of prosumers, and the envisioned impact on the overarching energy landscape.

This chapter introduces the importance of spectrum sharing in wireless edge networks. With the rapid development of wireless communications, global mobile data traffic has explosively grown, resulting in the congestion in spectrum resources. Therefore, spectrum sharing has emerged as a key technique to address this issue. In this chapter, we introduce cognitive radio (CR) and artificial intelligence (AI) as two representative techniques for spectrum sharing. Particularly, we cover general reinforcement learning (RL), deep Q-network (DQN) and deep deterministic policy gradient (DDPG) as essential preliminaries for AI-enabled spectrum sharing. Eventually, the structure of the Brief is outlined.

A hallmark of intelligence is the ability to exhibit a wide range of effective behaviors. Inspired by this principle, Quality-Diversity algorithms, such as, are evolutionary methods designed to generate a set of diverse and high-fitness solutions. However, as a genetic algorithm, relies on random mutations, which can become inefficient in high-dimensional search spaces, thus limiting its scalability to more complex domains, such as learning to control agents directly from high-dimensional inputs. To address this limitation, advanced methods like and have been developed, which combine actor-critic techniques from Reinforcement Learning with, significantly enhancing the performance and efficiency of Quality-Diversity algorithms in complex, high-dimensional tasks. While these methods have successfully leveraged the trained critic to guide more effective mutations, the potential of the trained actor remains underutilized in improving both the quality and diversity of the evolved population. In this work, we introduce, an extension of that utilizes the descriptor-conditioned actor as a generative model to produce diverse solutions, which are then injected into the offspring batch at each generation. Additionally, we present an empirical analysis of the fitness and descriptor reproducibility of the solutions discovered by each algorithm. Finally, we present a second empirical analysis shedding light on the synergies between the different variations operators and explaining the performance improvement from to.

Reinforcement learning methods are often con-sidered as a potential solution to enable a robot to adapt to changes in real time to an unpredictable environment. However, with continuous action, only a few existing algorithms are practical for real-time learning. In such a setting, most effective methods have used a parameterized policy structure, often with a separate parameterized value function. The goal of this paper is to assess such actor–critic methods to form a fully specified practical algorithm. Our specific contributions include 1) developing the extension of existing incremental policy-gradient algorithms to use eligibility traces, 2) an empir-ical comparison of the resulting algorithms using continuous actions, 3) the evaluation of a gradient-scaling technique that can significantly improve performance. Finally, we apply our actor–critic algorithm to learn on a robotic platform with a fast sensorimotor cycle (10ms). Overall, these results constitute an important step towards practical real-time learning control with continuous action.

We present a series of formal and empirical results comparing the efficiency of vari-ous policy-gradient methods—methods for reinforcement learning that directly update a parameterized policy according to an approximation of the gradient of performance with respect to the policy parameter. Such methods have recently become of interest as an alternative to value-function-based methods because of superior convergence guarantees, ability to find stochastic policies, and ability to handle large and continuous action spaces. Our results include: 1) formal and empirical demonstrations that a policy-gradient method suggested by Sutton et al. (2000) and Konda and Tsitsiklis (2000) is no better than RE-INFORCE, 2) derivation of the optimal baseline for policy-gradient methods, which differs from the widely used V π (s) previously thought to be optimal, 3) introduction of a new all-action policy-gradient algorithm that is unbiased and requires no baseline, and demon-strating empirically and semi-formally that it is more efficient than the methods mentioned above, and 4) an overall comparison of methods on the mountain-car problem including value-function-based methods and bootstrapping actor-critic methods. One general con-clusion we draw is that the bias of conventional value functions is a feature, not a bug; it seems required is order for the value function to significantly accelerate learning.

This paper presents the first actor-critic algorithm for off-policy
reinforcement learning. Our algorithm is online and incremental, and its
per-time-step complexity scales linearly with the number of learned weights.
Previous work on actor-critic algorithms is limited to the on-policy setting
and does not take advantage of the recent advances in off-policy gradient
temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable
a target policy to be learned while following and obtaining data from another
(behavior) policy. For many problems, however, actor-critic methods are more
practical than action value methods (like Greedy-GQ) because they explicitly
represent the policy; consequently, the policy can be stochastic and utilize a
large action space. In this paper, we illustrate how to practically combine the
generality and learning potential of off-policy learning with the flexibility
in action selection given by actor-critic methods. We derive an incremental,
linear time and space complexity algorithm that includes eligibility traces,
prove convergence under assumptions similar to previous off-policy algorithms,
and empirically show better or comparable performance to existing algorithms on
standard reinforcement-learning benchmark problems.

We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic rein- forcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their com- patibility with function approximation methods, which are needed to handle large or innite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further re- duce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal differ- ence learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the rst convergence proofs and the rst fully incremental algorithms.

The Octopus arm is a highly versatile and complex limb. How the Octo- pus controls such a hyper-redundant arm (not to mention eight of them!) is as yet unknown. Robotic arms based on the same mechanical prin- ciples may render present day robotic arms obsolete. In this paper, we tackle this control problem using an online reinforcement learning al- gorithm, based on a Bayesian approach to policy evaluation known as Gaussian process temporal difference (GPTD) learning. Our substitute for the real arm is a computer simulation of a 2-dimensional model of an Octopus arm. Even with the simplifications inherent to thi s model, the state space we face is a high-dimensional one. We apply a GPTD- based algorithm to this domain, and demonstrate its operation on several learning tasks of varying degrees of difficulty.

We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the number of features. Our algorithm, Greedy-GQ, is an extension of recent work on gradient temporal-difference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting latent learning because the optimal policy, though learned, is not manifest in behavior. Popular off-policy algorithms such as Q-learning are known to be unstable in this setting when used with linear function approximation. In reinforcement learning, the term “off-policy learning” refers to learning about one way of behaving, called the target policy, from data generated by another way of selecting actions, called the behavior policy. The target policy is often an approximation to the optimal policy, which is typically deterministic, whereas the behavior policy is often stochastic, exploring all possible actions in each state as part of finding the optimal policy. Freeing the behavior policy from the target policy enables a greater variety of exploration strategies to be used. It also enables learning from training data generated by unrelated controllers, including manual human control, and from previously collected data. A third reason for interest in off-policy learning is that it permits learning about multiple target policies (e.g., optimal policies for multiple subgoals) from a single stream of data generated by a

Sutton, Szepesvari and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional linear TD (on on-policy problems where TD is convergent), calling into question its practical utility. In this paper we introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.

Reinforcement learning oers a promising framework to take planning for real-world systems towards true autonomy and versatility. However, apply- ing reinforcement learning to high dimensional movement systems (such as real-world robots) in the presence of uncertainty and continuous state-action spaces remains an unsolved problem. In order to make progress towards solving this issue, we focus on a particular type of reinforcement learning methods, i.e., policy gradient methods. These methods are particularly in- teresting to the robotics community as they seem to scale better to continuous state-action problems and have been successfully applied on a variety of high- dimensional robots. However, the main disadvantages of these methods have been the high variance in the gradient estimate, the very slow convergence, and the dependence on baseline functions. In this poster, we show how these policy gradients can be improved in respect to each of these problems. Our approach to policy gradients focuses on the natural policy gradient instead of the regular policy gradient. Natural policy gradients for reinforce- ment learning have rst been suggested by Kakade (2) as 'average natural policy gradients', and subsequently been shown to be the true natural policy gradient by Bagnell & Schneider (1), and Peters et al. (3). As shown by Kakade, natural policy gradients are particularly interesting due to the fact that they equal the parameters of the compatible function approximation. We present a general algorithm for estimating the natural gradient, the Nat- ural Actor-Critic algorithm. This algorithm uses the fact that the compatible function approximation represents an advantage function which can be em-

Policy gradient methods are a type of reinforcement learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) by gradient descent. They do not suffer from many of the problems that have been marring traditional reinforcement learning approaches such as the lack of guarantees of a value function, the intractability problem resulting from uncertain state information and the complexity arising from continuous states & actions.

Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable.

This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.

We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as dened by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.

Technical process control is a highly interesting area of application serving a high practical impact. Since classical controller design is, in general, a demanding job, this area constitutes a highly attractive domain for the application of learning approaches—in particular, reinforcement learning (RL) methods. RL provides concepts for learning controllers that, by cleverly exploiting information from interactions with the process, can acquire high-quality control behaviour from scratch.
This article focuses on the presentation of four typical benchmark problems whilst highlighting important and challenging aspects of technical process control: nonlinear dynamics; varying set-points; long-term dynamic effects; influence of external variables; and the primacy of precision. We propose performance measures for controller quality that apply both to classical control design and learning controllers, measuring precision, speed, and stability of the controller. A second set of key-figures describes the performance from the perspective of a learning approach while providing information about the efficiency of the method with respect to the learning effort needed. For all four benchmark problems, extensive and detailed information is provided with which to carry out the evaluations outlined in this article.
A close evaluation of our own RL learning scheme, NFQCA (Neural Fitted Q Iteration with Continuous Actions), in acordance with the proposed scheme on all four benchmarks, thereby provides performance figures on both control quality and learning behavior.

We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration. This new approach is motivated by the Least-Squares Temporal-Difference learning algorithm (LSTD) for prediction problems, which is known for its efficient use of sample experiences compared to pure temporal-difference algorithms. Heretofore, LSTD has not had a straightforward application to control problems mainly because LSTD learns the state value function of a fixed policy which cannot be used for action selection and control without a model of the underlying process. Our new algorithm, Least-Squares Policy Iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration framework. LSPI is a model-free, off-policy method which can use efficiently (and reuse in each iteration) sample experiences collected in any manner. By separating the sample collection method, the choice of the linear approximation architecture, and the solution method, LSPI allows for focused attention on the distinct elements that contribute to practical reinforcement learning. LSPI is tested on the simple task of balancing an inverted pendulum and the harder task of balancing and riding a bicycle to a target location. In both cases, LSPI learns to control the pendulum or the bicycle by merely observing a relatively small number of trials where actions are selected randomly. LSPI is also compared against Q-learning (both with and without experience replay) using the same value function architecture. While LSPI achieves good performance fairly consistently on the difficult bicycle task, Q-learning variants were rarely able to balance for more than a small fraction of the time needed to reach the target location.

Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments.

We investigate the problem of non-covariant behavior of policy gradient reinforcement learning algorithms.

Robust control theory is used to design stable controllers in the presence of uncertainties. This provides powerful closed-loop robustness guarantees, but can result in controllers that are conservative with regard to performance.

Actor-critic reinforcement learning with energy-based policies

- N Heess
- D Silver
- Y Teh

Heess, N., Silver, D., and Teh, Y. (2012). Actor-critic reinforcement learning with energy-based policies. JMLR
Workshop and Conference Proceedings: EWRL 2012,
24:43-58.

Some notes on gradient descent

- M Toussaint

Toussaint, M. (2012). Some notes on gradient descent.
http://ipvs.informatik.uni-stuttgart.

Linear off-policy actor-critic

- T Degris
- M White
- R S Sutton

Degris, T., White, M., and Sutton, R. S. (2012b). Linear
off-policy actor-critic. In 29th International Conference
on Machine Learning.