Figure - available from: AI Magazine
This content is subject to copyright. Terms and conditions apply.
Actor‐Critic as an Artificial Neural Network and a Hypothetical Neural Implementation. Adapted from Takahashi, Schoenbaum, and Niv (2008).
Source publication
The idea of implementing reinforcement learning in a computer was one of the earliest ideas about the possibility of AI, but reinforcement learning remained on the margin of AI until relatively recently. Today we see reinforcement learning playing essential roles in some of the most impressive AI applications. This article presents observations fro...
Citations
... The book from Sutton and Barto [18] has become a foundational resource in the field of reinforcement learning. Temporal difference (TD) learning algorithms are widely used in various reinforcement learning (RL) applications, including game playing [1], robotics, and finance, due to their efficiency and ability to learn directly from interactions with the environment. ...
... En el transcurso de ese año, Arthur Samuel escribió un juego de damas, el primer software de su clase en integrar Inteligencia Artificial en los Estados Unidos [22]; y, en 1955, extendió las capacidades del juego desarrollado por Strachey, permitiendo que aprendiera por su cuenta con base en la experiencia [22]. En 1954, Belmont Farley y Wesley Clark simularon, por primera vez, el aprendizaje por refuerzo de una red neuronal de 128 neuronas en una computadora digital, con el objetivo de reconocer simples patrones en un conjunto de datos [25]. ...
El desarrollo de la Inteligencia Artificial ha impactado significativamente diversos sectores de la sociedad, incluyendo el ámbito empresarial, académico y gubernamental. Los avances recientes han sido posibles gracias a las contribuciones del pasado. Esta revisión histórica explora diversos desarrollos de la Inteligencia Artificial, desde su concepción en la mitología griega hasta los últimos desarrollos de 2023, organizados en cuatro fases. La primera fase, Concepciones Antiguas, aborda las nociones tempranas de la Inteligencia Artificial en la mitología griega y los primeros autómatas desarrollados. La segunda fase, Inicios de la Inteligencia Artificial Moderna, examina los primeros avances de la investigación científica formal de la Inteligencia Artificial. La tercera fase, Expansión y Retrocesos, está marcada por una expansión en áreas clave como los Sistemas Expertos. La cuarta fase, Resurgimiento de la Inteligencia Artificial, corresponde a la revitalización del campo, impulsada por el aprendizaje profundo. A través de un análisis cronológico de más de 150 fuentes de información, incluyendo artículos científicos, libros y documentos históricos, esta revisión proporciona una visión integral de la evolución de la Inteligencia Artificial. Además, el trabajo describe algunas soluciones de Inteligencia Artificial aplicadas en el ámbito empresarial y gubernamental peruano.
... Reinforcement learning is another powerful technique, particularly useful for evaluating the long-term effects of policy changes through simulation models. By simulating various policy scenarios, reinforcement learning algorithms can optimize strategies to achieve desired environmental outcomes over time [33]. This approach is beneficial for dynamic and complex systems where policies need to adapt based on ongoing feedback. ...
... Afterward, Sutton [42] suggested the current SARSA designation. SARSA and Qlearning, two well-known reinforcement learning algorithms based on temporal difference (TD) learning, are highly capable of creating a learning process that ultimately leads to future decision-making processes. ...
Industrial control systems are often used to assist and manage an industrial operation. These systems’ weaknesses in the various hierarchical structures of the system components and communication backbones make them vulnerable to cyberattacks that jeopardize their security. In this paper, the security of these systems is studied by employing a reinforcement learning extended attack graph to efficiently reveal the subsystems’ flaws. Specifically, an attack graph that mimics the environment is constructed for the system using the state–action–reward–state–action technique, in which the agent is regarded as the attacker. Attackers may cause the greatest amount of system damage with the fewest possible actions if they have the highest cumulative reward. The worst-case assault scheme with a total reward of 42.9 was successfully shown in the results, and the most badly affected subsystems were recognized.
... Reinforcement learning (RL) has emerged as a popular approach for tackling NZS games. RL is grounded in the principle of trial and error [8], enabling agents to acquire optimal behavioural policies by leveraging feedback responses derived from the environment [9,10]. RL can be broadly categorized into two types: model-free and model-based, with the key distinction being the requirement of dynamic model information [11]. ...
To reduce the learning time and space occupation, this study presents a novel model‐free algorithm for obtaining the Nash equilibrium solution of continuous‐time nonlinear non‐zero‐sum games. Based on the integral reinforcement learning method, a new integral HJ equation that can quickly and cooperatively determine the Nash equilibrium strategies of all players is proposed. By leveraging the neural network approximation and gradient descent method, simultaneous continuous‐time adaptive tuning laws are provided for both critic and actor neural network weights. These laws facilitate the estimation of the optimal value function and optimal policy without requiring knowledge or identification of the system's dynamics. The closed‐loop system stability and convergence of weights are guaranteed through the Lyapunov analysis. Additionally, the algorithm is enhanced to reduce the number of auxiliary NNs used in the critic. The simulation results for a two‐player non‐zero‐sum game validate the effectiveness of the proposed algorithm.
... From learning to play backgammon at near the level of the world's best players (Barto, 2019), through landing unmanned aerial vehicles (UAVs) (Polvara et al., 2019), beating the highest ranked players in Jeopardy! (Ferrucci et al., 2013), and human-level performance in Atari games (Mnih et al., 2015), RL has been successful in applications with uncertain environments. ...
Critical chain buffer management (CCBM) has been extensively studied in recent years. This paper investigates a new formulation of CCBM, the multimode chance-constrained CCBM problem. A flow-based mixed-integer linear programming model is described and the chance constraints are tackled using a scenario approach. A reinforcement learning (RL)-based algorithm is proposed to solve the problem. A factorial experiment is conducted and the results of this study indicate that solving the chance-constrained problem produces shorter project durations than the traditional approach that inserts time buffers into a baseline schedule generated by solving the deterministic problem. This paper also demonstrates that our RL method produces competitive schedules compared to established benchmarks. The importance of solving the chance-constrained problem and obtaining a project buffer tailored to the desired probability of completing the project on schedule directly from the solution is highlighted. Because of its potential for generating shorter schedules with the same on-time probabilities as the traditional approach, this research can be a useful aid for decision makers.
... RL has demonstrated remarkable achievements in various domains, ranging from mastering backgammon at a level comparable to the world's best players [94] to successfully landing unmanned aerial vehicles (UAVs) [95], defeating top-ranked contestants in Jeopardy! [96], and achieving human-level performance in Atari games [97]. ...
Industrial projects are plagued by uncertainties, often resulting in both time and cost overruns. This research introduces an innovative approach, employing Reinforcement Learning (RL), to address three distinct project management challenges within a setting of uncertain activity durations. The primary objective is to identify stable baseline schedules. The first challenge encompasses the multimode lean project management problem, wherein the goal is to maximize a project’s value function while adhering to both due date and budget chance constraints. The second challenge involves the chance-constrained critical chain buffer management problem in a multimode context. Here, the aim is to minimize the project delivery date while considering resource constraints and duration-chance constraints. The third challenge revolves around striking a balance between the project value and its net present value (NPV) within a resource-constrained multimode environment. To tackle these three challenges, we devised mathematical programming models, some of which were solved optimally. Additionally, we developed competitive RL-based algorithms and verified their performance against established benchmarks. Our RL algorithms consistently generated schedules that compared favorably with the benchmarks, leading to higher project values and NPVs and shorter schedules while staying within the stakeholders’ risk thresholds. The potential beneficiaries of this research are project managers and decision-makers who can use this approach to generate an efficient frontier of optimal project plans.
... From learning to play backgammon at near the level of the world's best players [69], through landing unmanned aerial vehicles (UAVs) [70], beating the highest ranked players in Jeopardy! [71], and human-level performance in Atari games [72], RL has been successful in applications for uncertain environments. ...
Two important goals in project management are the maximization of the net present value (NPV) and project value, a more recent target. The former is a well-documented objective in project scheduling, and both are project evaluation tools used by decision makers. The literature has focused on the maximization of the project NPV problem and on project value as separate research tracks, but consideration of the tradeoff between both goals offers decision makers a more thorough evaluation of a project when weighing project alternatives. This paper introduces a novel formulation of the maximization problem that includes both a robust formulation of NPV and project value, develops algorithms to solve it, and illustrates the tradeoff between both objectives. The proposed mixed integer program (MIP) features a multimode setting, where the selection of an activity mode will impact cost, duration, resource usage and project value, and stochastic activity durations. To solve the problem, this study offers an innovative reinforcement learning (RL) based algorithm. The solution can be used to plot the efficient frontier between the robust NPV and the project value. Computational experiments revealed that the algorithm performs well compared to tabu search and an MIP solution using a commercial solver, and that the RL actions can be leveraged for coping with positive and negative cashflows. The utility of our work lies in its ability to respond to decision makers’ information needs, providing a framework for tradeoff analysis to select the most adequate project plan that satisfies stakeholders’ requirements.
... While it has been proven that the Nash equilibrium solution always exists for the stochastic games with complete information [24], in the real world, the problem related to the parameter uncertainty of games is pervasive. For example, in many realistic applications associated with reinforcement learning, the payoff functions of agents are usually required to be designed or learned from interactions, and their properties strongly affect the success of targeted tasks [25]; also, the transition probability distribution of the game-environmental states is generally estimated from historical data, thereby influenced by the statistical errors [26], [27]. In particular, such an issue of data uncertainty has given rise to well-grounded concerns recently, such as in the field of AI safety [28], [29] and uncertain robotic systems [30], and accordingly it prompts the research priorities of robust AI [31]. ...
... Suppose that the algorithm terminates with outputting a policy (d ǫ ) ∞ , and denote its corresponding value function under the worstcase transition probability matrix byṽ ǫ . From (25), one can find thatṽ ǫ can be regard as the valueṽ N +1 generated by Algorithm 1 at t = N + 1 by setting M N → ∞ and δ = 0 when t = N . Therefore, from the result in part (a), one can get ...
In stochastic dynamic environments, team stochastic games have emerged as a versatile paradigm for studying sequential decision-making problems of fully cooperative multi-agent systems. However, the optimality of the derived policies is usually sensitive to the model parameters, which are typically unknown and required to be estimated from noisy data in practice. To mitigate the sensitivity of the optimal policy to these uncertain parameters, in this paper, we propose a model of "robust" team stochastic games, where players utilize a robust optimization approach to make decisions. This model extends team stochastic games to the scenario of incomplete information and meanwhile provides an alternative solution concept of robust team optimality. To seek such a solution, we develop a learning algorithm in the form of a Gauss-Seidel modified policy iteration and prove its convergence. This algorithm, compared with robust dynamic programming, not only possesses a faster convergence rate, but also allows for using approximation calculations to alleviate the curse of dimensionality. Moreover, some numerical simulations are presented to demonstrate the effectiveness of the algorithm by generalizing the game model of social dilemmas to sequential robust scenarios.
... Also, Ning et al. [67] introduced a multi-objective EA in which RL was a generic parameter controller. More possible directions for RL-integrated EAs are highlighted in the review papers of Drugan [58], Cunha et al. [68] and Barto [69]. ...
Although several multi-operator and multi-method approaches for solving optimization problems have been proposed, their performances are not consistent for a wide range of optimization problems. Also, the task of ensuring the appropriate selection of algorithms and operators may be inefficient since their designs are undertaken mainly through trial and error. This research proposes an improved optimization framework that uses the benefits of multiple algorithms, namely, a multi-operator differential evolution algorithm and a co-variance matrix adaptation evolution strategy. In the former, reinforcement learning is used to automatically choose the best differential evolution operator. To judge the performance of the proposed framework, three benchmark sets of bound-constrained optimization problems (73 problems) with 10, 30 and 50 dimensions are solved. Further, the proposed algorithm has been tested by solving optimization problems with 100 dimensions taken from CEC2014 and CEC2017 benchmark problems. A real-world application data set has also been solved. Several experiments are designed to analyze the effects of different components of the proposed framework, with the best variant compared with a number of state-of-the-art algorithms. The experimental results show that the proposed algorithm is able to outperform all the others considered.