Andrew G. Barto’s research while affiliated with University of Massachusetts Amherst and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (244)


Adaptive Real-Time Dynamic Programming
  • Chapter

September 2023

·

2 Reads

Andrew G. Barto

TD-DeltaPi: A Model-Free Algorithm for Efficient Exploration

September 2021

·

5 Reads

Proceedings of the AAAI Conference on Artificial Intelligence

We study the problem of finding efficient exploration policies for the case in which an agent is momentarily not concerned with exploiting, and instead tries to compute a policy for later use. We first formally define the Optimal Exploration Problem as one of sequential sampling and show that its solutions correspond to paths of minimum expected length in the space of policies. We derive a model-free, local linear approximation to such solutions and use it to construct efficient exploration policies. We compare our model-free approach to other exploration techniques, including one with the best known PAC bounds, and show that ours is both based on a well-defined optimization problem and empirically efficient.


Fig. 2. Updated representation of the actor-critic architecture configured for the pole-balancing task. The input r is now labeled reward, and the dashed lines depict the TD error as the reinforcement signal for adjusting the input connection weights of both the ASE and the ACE.
Looking Back on the Actor-Critic Architecture
  • Article
  • Full-text available

December 2020

·

217 Reads

·

29 Citations

IEEE Transactions on Systems Man and Cybernetics Systems

This retrospective describes the overall research project that gave rise to the authors' paper "Neuronlike adaptive elements that can solve difficult learning control problems" that was published in the 1983 Neural and Sensory Information Processing special issue of the IEEE Transactions on Systems, Man, and Cybernetics. This look back explains how this project came about, presents the ideas and previous publications that influenced it, and describes our most closely related subsequent research. It concludes by pointing out some noteworthy aspects of this article that have been eclipsed by its main contributions, followed by commenting on some of the directions and cautions that should inform future research.

Download


Preventing undesirable behavior of intelligent machines

November 2019

·

425 Reads

·

151 Citations

Science

Philip S Thomas

·

·

Andrew G Barto

·

[...]

·

Emma Brunskill

Making well-behaved algorithms Machine learning algorithms are being used in an ever-increasing number of applications, and many of these applications affect quality of life. Yet such algorithms often exhibit undesirable behavior, from various types of bias to causing financial loss or delaying medical diagnoses. In standard machine learning approaches, the burden of avoiding this harmful behavior is placed on the user of the algorithm, who most often is not a computer scientist. Thomas et al. introduce a general framework for algorithm design in which this burden is shifted from the user to the designer of the algorithm. The researchers illustrate the benefits of their approach using examples in gender fairness and diabetes management. Science , this issue p. 999


Comparison of Three Prediction Methods. Left: Work versus number of states, n, for reducing the initial error by a factor of ξ = .01 Right: Error reduction versus work for n = 100. From Barto and Duff (1994).
Four Elevators in a Ten‐Story Building. From Sutton and Barto (1998).
Actor‐Critic as an Artificial Neural Network and a Hypothetical Neural Implementation. Adapted from Takahashi, Schoenbaum, and Niv (2008).
Reinforcement Learning: Connections, Surprises, and Challenge

March 2019

·

74 Reads

·

17 Citations

The idea of implementing reinforcement learning in a computer was one of the earliest ideas about the possibility of AI, but reinforcement learning remained on the margin of AI until relatively recently. Today we see reinforcement learning playing essential roles in some of the most impressive AI applications. This article presents observations from the author’s personal experience with reinforcement learning over the most recent 40 years of its history in AI, focusing on striking connections that emerged between largely separate disciplines and on some of the findings that surprised him along the way. These connections and surprises place reinforcement learning in a historical context, and they help explain the success it is finding in modern AI. The article concludes by discussing some of the challenges that need to be faced as reinforcement learning moves out into real world.


Memristor synapse array and programming scheme
a, Optical micrograph of the 128 × 64 memristor synaptic network with the probe card in place. The colour blocks illustrate how the array is partitioned to form the three-layer neural network in Fig. 3, with 48 neurons in each of the hidden layers. The dimensions of the layers are 8 × 48, 96 × 48 and 48 × 4, respectively. Differential pairs are formed between adjacent bottom (top) electrode wires in the first two (last) layers (layer), using a total of 5,184 memristors. b, Magnified view of the selected subarray in a. Memristive synapses of the same row share bottom electrode (BE) wires while those of the same column share top electrode (TE) and transistor gate wires. c, Scanning electron micrograph of a single 1T1R cell in b. The crossbar junction of the Pd/HfO2/Ta memristor sits on top of a foundry-built transistor. d, Schematic illustration of the transistor-assisted synaptic weight update scheme. Positive voltage pulses were applied to the TE of the memristor and the gate of the associated transistor to potentiate the synapse. Positive voltage pulses were applied to the memristor BE and the gate of the transistor to fully depress the synapse before applying another potentiation to achieve a targeted conductance. The memristors form a crossbar array if all transistors are switched ON. e, Conductance map of the memristor synaptic array after a single round of programming to write the letters ‘UMAS’.
Scheme of the hybrid analogue–digital reinforcement learning
a, Flowchart of the deep-Q reinforcement learning algorithm. The agent is physically implemented by the hybrid analogue–digital computing platform. The analogue 1T1R memristor array physically performs the computationally expensive vector-matrix multiplications, while the digital components draw/add samples from the experience, calculate loss using the Bellman equation, and plan for weight changes with the RMSprop optimizer. b, The three-layer fully connected Q-network, where the input is the state s and each output neuron represents the approximate sum of discounted future rewards, or Q-value of the state s, by taking a particular allowed action a. The subarrays are of dimensions ni × 48, 48 × 48 and 48 × no, where ni and no denote the numbers of input and output neurons. Hidden layer and output layer neurons implement rectified linear unit (ReLU) function and linear activation on the physically read currents, respectively. Neurons of the input layer or hidden layers physically source paired voltage signals (with equal amplitude but different polarities) to memristor differential pairs. The conductance difference between two memristors of a differential pair maps to the positive or negative weight of the corresponding synapse (see Methods).
In-memristor reinforcement learning in the cart–pole environment
a, Schematic illustration of the cart–pole environment. The cart is free to move along the track while supporting a hinged pole. The learning agent can make a left or right push at each discrete time step to avoid the pole falling or the cart driving beyond the bounds. b, Experimental curve of the hybrid analogue–digital system and digitally simulated curves with 0, 4 and 8 µS programming noise tracking the number of rewards per epoch. The experimental memristor performance is similar to the simulation with 4 µS programming noise, following the same trend as the noise-free double-precision floating-point simulation. The simulated curve with 8 µS programming noise had a relatively poor performance with a slow reward rise. c, The time evolution of the cart position x and pole angle θ. The failures of the early game epochs are mainly due to the pole falling. The pole was kept upright (that is, θ rarely hit the upper/lower bounds) for many more steps in the later phases of the game. d, Output of all layers of the memristor Q-network at time t1 and t2 specified in c. At time t1, the pole was tilted anticlockwise. The output of the second neuron of Layer 3, representing the left push (push from the right), was larger. In contrast, at time t2, the pole was tilted clockwise. The output of the first neuron of Layer 3, representing the right push (push from the left), was larger.
In-memristor reinforcement learning in the mountain car environment
a, Schematic illustration of the mountain car environment. The car is originally situated in the bottom of the valley. Its engine is too weak to overcome the gravitational force to reach the goal marked by the flag. The learning agent can make a left push, right push, or no push at each discrete time step to make the car reach the target in the shortest time. b, Experimental curve of the hybrid analogue–digital system and digitally simulated curves with 0, 4 and 8 µS programming noise tracking the number of rewards per epoch. The experimental memristor performance is similar to the simulated one with 4 µS programming noise, following the same trend as the noise-free double-precision floating-point simulation. The simulated curve with 8 µS programming noise has a relatively poor performance with more negative rewards in the early epochs, as illustrated in the zoomed inset. c, The time evolution of the car position x. The agent quickly discovered the technique to drive the car back and forth, as indicated by the oscillation patterns. d, Left: value map of all possible input states over the two-dimensional input domain, given by the maximum Q-value associated with the allowed actions of a particular input state. Notably, the right bottom corner is the highest action value since the positive position and velocity has guaranteed success. Similarly, the left upper corner is a local maximum because the altitude equips the car with a large potential energy to accelerate in the subsequent rightward motion. Right: learned map of actions of all possible input states over the two-dimensional input domain. The agent exerts a left (right) push if the car moves left (right), which helps the car to quickly build up momentum.
Reinforcement learning with analogue memristor arrays

March 2019

·

1,278 Reads

·

322 Citations

Reinforcement learning algorithms that use deep neural networks are a promising approach for the development of machines that can acquire knowledge and solve problems without human input or supervision. At present, however, these algorithms are implemented in software running on relatively standard complementary metal–oxide–semiconductor digital platforms, where performance will be constrained by the limits of Moore’s law and von Neumann architecture. Here, we report an experimental demonstration of reinforcement learning on a three-layer 1-transistor 1-memristor (1T1R) network using a modified learning algorithm tailored for our hybrid analogue–digital platform. To illustrate the capabilities of our approach in robust in situ training without the need for a model, we performed two classic control problems: the cart–pole and mountain car simulations. We also show that, compared with conventional digital systems in real-world reinforcement learning tasks, our hybrid analogue–digital computing system has the potential to achieve a significant boost in speed and energy efficiency. A reinforcement learning algorithm can be implemented on a hybrid analogue–digital platform based on memristive arrays for parallel and energy-efficient in situ training.


Enriching Behavioral Ecology with Reinforcement Learning Methods

February 2018

·

161 Reads

·

52 Citations

Behavioural Processes

This article focuses on the division of labor between evolution and development in solving sequential, state-dependent decision problems. Currently, behavioral ecologists tend to use dynamic programming methods to study such problems. These methods are successful at predicting animal behavior in a variety of contexts. However, they depend on a distinct set of assumptions. Here, we argue that behavioral ecology will benefit from drawing more than it currently does on a complementary collection of tools, called reinforcement learning methods. These methods allow for the study of behavior in highly complex environments, which conventional dynamic programming methods do not feasibly address. In addition, reinforcement learning methods are well-suited to studying how biological mechanisms solve developmental and learning problems. For instance, we can use them to study simple rules that perform well in complex environments. Or to investigate under what conditions natural selection favors fixed, non-plastic traits (which do not vary across individuals), cue-driven-switch plasticity (innate instructions for adaptive behavioral development based on experience), or developmental selection (the incremental acquisition of adaptive behavior based on experience). If natural selection favors developmental selection, which includes learning from environmental feedback, we can also make predictions about the design of reward systems. Our paper is written in an accessible manner and for a broad audience, though we believe some novel insights can be drawn from our discussion. We hope our paper will help advance the emerging bridge connecting the fields of behavioral ecology and reinforcement learning.


Figure 5: Different distributions for people of types A (left) and B (right), such that using independent linear regressors for people of each type, and training data from five people of each type, results in an average discriminatory statistic (over 1,000 trials) of 0.42.
Figure 7: Results of applying various linear regression algorithms to the illustrative example of §4.1. All results are averaged over 2,000 trials. LR denotes ordinary least squares linear regression.
Figure 14: Top Left: The probability that a solution other than No Solution Found (NSF) is returned. Top Right: The probability that a distribution over policies was returned with lower expected return than the initial distribution over policies, using the auxiliary reward function r 1 . That is, the probability that a solution was returned which increases the prevalence of hypoglycemia. Bottom Left: For different numbers of days of data (in intervals of 5), a box-plot of the distribution of expected returns of the solutions produced by the algorithm designed using the standard approach. Outliers are shown as black dots, and the blue line is the mean return. Bottom Right: The same as the bottom left plot, but for the quasi-Seldonian algorithm. The blue line is the mean return, where the initial distribution over policies is used if the algorithm returns NSF, and the green line is the mean return given that a solution other than NSF is returned.
On Ensuring that Intelligent Machines Are Well-Behaved

August 2017

·

61 Reads

·

5 Citations

Machine learning algorithms are everywhere, ranging from simple data analysis and pattern recognition tools used across the sciences to complex systems that achieve super-human performance on various tasks. Ensuring that they are well-behaved---that they do not, for example, cause harm to humans or act in a racist or sexist way---is therefore not a hypothetical problem to be dealt with in the future, but a pressing one that we address here. We propose a new framework for designing machine learning algorithms that simplifies the problem of specifying and regulating undesirable behaviors. To show the viability of this new framework, we use it to create new machine learning algorithms that preclude the sexist and harmful behaviors exhibited by standard machine learning algorithms in our experiments. Our framework for designing machine learning algorithms simplifies the safe and responsible application of machine learning.



Citations (94)


... This approach allows them to use deep neural networks to model the uncertainties of the environment, which leads to a more robust controller compared to traditional ones. Later, Konidaris et al. [22] propose to use RL to automatise the skill acquisition on a mobile manipulator. Unlike DL, RL allows to automatically obtain the experience needed to learn robotic skills through trial-and-error and allows to learn complex decision-making policies. ...

Reference:

Learning positioning policies for mobile manipulation operations with deep reinforcement learning
Autonomous Skill Acquisition on a Mobile Manipulator

Proceedings of the AAAI Conference on Artificial Intelligence

... Examples include REINFORCE [24], Advantage Actor-Critic (A2C) [25], and Proximal Policy Optimization (PPO) [26]. • Actor-Critic Architectures [27]: These architectures consist of two neural networks: one for the policy (actor) and one for the value function (critic). The actor chooses actions, while the critic evaluates them to guide the actor's learning. ...

Looking Back on the Actor-Critic Architecture

IEEE Transactions on Systems Man and Cybernetics Systems

... Indeed, studies on animals [13][14][15] and humans [16][17][18] have explored the inherent inclination towards novelty, which is further supported by neuroscience experiments [19][20][21]. The field of intrinsically motivated open-ended learning (IMOL [22]) tackles the problem of developing agents that aim at improving their capabilities to interact with the environment without any specific assigned task. More precisely, Intrinsic Motivations (IMs [23,24]) are a class of selfgenerated signals that have been used to provide robots with autonomous guidance for several different processes, from state-and-action space exploration [25,26], to the autonomous discovery, selection and learning of multiple goals [27][28][29]. ...

Editorial: Intrinsically Motivated Open-Ended Learning in Autonomous Robots

Frontiers in Neurorobotics

... To assess and achieve anti-discrimination fairness, discrimination statistics that measure the average similarity of decisions across groups are used. 30 In addition to consistency and anti-discrimination, a third concept is counterfactual fairness, which ensures an algorithm's decisions remain consistent across hypothetical scenarios where individuals' protected attributes are altered. 19 Typically, causal models that describe how changes in protected attributes affect decisions and other attributes of individuals are used to assess and achieve counterfactual fairness. ...

Preventing undesirable behavior of intelligent machines
  • Citing Article
  • November 2019

Science

... En el transcurso de ese año, Arthur Samuel escribió un juego de damas, el primer software de su clase en integrar Inteligencia Artificial en los Estados Unidos [22]; y, en 1955, extendió las capacidades del juego desarrollado por Strachey, permitiendo que aprendiera por su cuenta con base en la experiencia [22]. En 1954, Belmont Farley y Wesley Clark simularon, por primera vez, el aprendizaje por refuerzo de una red neuronal de 128 neuronas en una computadora digital, con el objetivo de reconocer simples patrones en un conjunto de datos [25]. ...

Reinforcement Learning: Connections, Surprises, and Challenge

... One solution here is memristor-based analogue computing [9][10][11] . This approach improves energy efficiency by eliminating the emulation layers required to operate an artificial neural network on von Neumann architecture-based computers [12][13][14] . Memristors can achieve a higher density in crossbar arrays compared to conventional digital memory devices due to their simple two-terminal structure and ability to achieve multi-level conductance states 15 . ...

Reinforcement learning with analogue memristor arrays

... Reinforcement learning is a widely used model for learning mechanisms characterized by an algorithm in which an agent learns to choose the optimal behavior in an environment by acquiring rewards through interactions with it [14,15]. The standard model is termed the temporal difference (TD) learning model [16], the Rescorla-Wagner (RW) model [17,18], and Qlearning model [18]. ...

Enriching Behavioral Ecology with Reinforcement Learning Methods

Behavioural Processes

... This D4PG variant learns the learning rate of the lagrange multiplier in a soft-constrained optimization procedure. Thomas et al. (2017) propose a new framework for designing machine learning algorithms that simplifies the problem of specifying and regulating undesired behaviours. There have also been approaches to learn a policy that satisfies constraints in the presence of perturbations to the dynamics of an environment . ...

On Ensuring that Intelligent Machines Are Well-Behaved

... In columns are presented predicted classes, in rows actual classes. That way of orientation of matrix is used in many sources [19][20][21][22][23][24][25][26] , however different sources [27][28][29] use another. This means that adding the right headings in this kind of presentation is very important to avoid misunderstanding. ...

Adaptive Real-Time Dynamic Programming
  • Citing Chapter
  • January 2017