Andrew G. Barto’s research while affiliated with University of Massachusetts Amherst and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (244)


Reinforcement learning, efficient coding, and the statistics of natural tasks
  • Article

August 2015

·

142 Reads

·

57 Citations

Current Opinion in Behavioral Sciences

Matthew Botvinick

·

·

Alec Solway

·

Andrew Barto

The application of ideas from computational reinforcement learning has recently enabled dramatic advances in behavioral and neuroscientific research. For the most part, these advances have involved insights concerning the algorithms underlying learning and decision making. In the present article, we call attention to the equally important but relatively neglected question of how problems in learning and decision making are internally represented. To articulate the significance of representation for reinforcement learning we draw on the concept of efficient coding, as developed in perception research. The resulting perspective exposes a range of novel goals for behavioral and neuroscientific research, highlighting in particular the need for research into the statistical structure of naturalistic tasks.


Online Bayesian changepoint detection for articulated motion models

June 2015

·

49 Reads

·

31 Citations

Proceedings - IEEE International Conference on Robotics and Automation

We introduce CHAMP, an algorithm for online Bayesian changepoint detection in settings where it is difficult or undesirable to integrate over the parameters of candidate models. CHAMP is used in combination with several articulation models to detect changes in articulated motion of objects in the world, allowing a robot to infer physically-grounded task information. We focus on three settings where a changepoint model is appropriate: objects with intrinsic articulation relationships that can change over time, object-object contact that results in quasi-static articulated motion, and assembly tasks where each step changes articulation relationships. We experimentally demonstrate that this system can be used to infer various types of information from demonstration data including causal manipulation models, human-robot grasp correspondences, and skill verification tests.


Intrinsic motivations and open-ended development in animals, humans, and robots: an overview
  • Article
  • Full-text available

September 2014

·

680 Reads

·

71 Citations

·

·

·

[...]

·

Andrew Barto
Download

Figure 1.  A. Rooms domain.
Vertices represent states (green = start, red = goal), and edges feasible transitions. B. Mean performance of three hierarchical reinforcement learning agents in the rooms task. Inset: Results based on four graph decompositions. Blue: decomposition from panel C. Purple: decomposition from panel D. Black: entire graph treated as one region. Orange: decomposition with orange vertices in panel A segregated out as singleton regions. Model evidence is on a log scale (data range  to ). Search time denotes the expected number of trial-and-error attempts to discover the solution to a randomly drawn task or subtask (geometric mean; range 685 to 65947; tick mark indicates the origin). Codelength signifies the number of bits required to encode the entire data-set under a Shannon code (range  to ). Note that the abscissa refers both to model evidence and codelength. Model evidence increases left to right, and codelength increases right to left. C. Optimal decomposition. D. An alternative decomposition.
Figure 2.  A. Graph studied by Schapiro et al. [41], showing the optimal decomposition.
B. Task display from Experiment 1. Participants used the computer mouse to select three locations adjacent to the probe location. C. Graph employed in Experiment 1, showing the optimal decomposition. Width of each gray ring indicates mean proportion of cases in which the relevant location was chosen. D. Graph studied in Experiments 2 and 3, showing the optimal decomposition (two regions, with central vertex grouped either to left or right). Top: Illustration of a “delivery” assignment from Experiment 3 (green = start, red = goal), where bottleneck (purple) and non-bottleneck (blue) probes called for a positive response. Bottom: An assignment where bottleneck and non-bottleneck probes called for a negative response. E. Mean correct response times from Experiment 3. Affirm: trials where the probe fell on the shortest path between the specified start and goal locations. Reject: trials where it did not. Purple: bottleneck probes. Blue: non-bottleneck probes. F. State-transition graph for the Tower of Hanoi puzzle, showing the optimal decomposition and indicating the start and goal configurations of the kind studied in Experiment 4. A different set of colors was used for the beads in the actual experiment. Furthermore, as explained under Methods, the beads were the same size. The changes were made here for display purposes.
Optimal Behavioral Hierarchy

August 2014

·

314 Reads

·

138 Citations

Author Summary In order to accomplish everyday tasks, we often divide them up into subtasks: to make spaghetti, we (1) get out a pot, (2) fill it with water, (3) bring the water to a boil, and so forth. But how do we learn to subdivide our goals in this way? Work from computer science suggests that the way a task is subdivided or decomposed can have a dramatic impact on how easy the task is to accomplish: certain decompositions speed learning and planning compared to others. Moreover, some decompositions allow behaviors to be represented more simply. Despite this general insight, little work has been done to formalize these ideas. We outline a mathematical framework to address this question, based on methods for comparing between statistical models. We then present four behavioral experiments, showing that human learners spontaneously discover optimal task decompositions.


Learning parameterized motor skills on a humanoid robot

May 2014

·

66 Reads

·

42 Citations

Proceedings - IEEE International Conference on Robotics and Automation

We demonstrate a sample-efficient method for constructing reusable parameterized skills that can solve families of related motor tasks. Our method uses learned policies to analyze the policy space topology and learn a set of regression models which, given a novel task, appropriately parameterizes an underlying low-level controller. By identifying the disjoint charts that compose the policy manifold, the method can separately model the qualitatively different sub-skills required for solving distinct classes of tasks. Such sub-skills are useful because they can be treated as new discrete, specialized actions by higher-level planning processes. We also propose a method for reusing seemingly unsuccessful policies as additional, valid training samples for synthesizing the skill, thus accelerating learning. We evaluate our method on a humanoid iCub robot tasked with learning to accurately throw plastic balls at parameterized target locations.



Learning grounded finite-state representations from unstructured demonstrations

January 2014

·

71 Reads

·

231 Citations

The International Journal of Robotics Research

Robots exhibit flexible behavior largely in proportion to their degree of knowledge about the world. Such knowledge is often meticulously hand-coded for a narrow class of tasks, limiting the scope of possible robot competencies. Thus, the primary limiting factor of robot capabilities is often not the physical attributes of the robot, but the limited time and skill of expert programmers. One way to deal with the vast number of situations and environments that robots face outside the laboratory is to provide users with simple methods for programming robots that do not require the skill of an expert. For this reason, learning from demonstration (LfD) has become a popular alternative to traditional robot programming methods, aiming to provide a natural mechanism for quickly teaching robots. By simply showing a robot how to perform a task, users can easily demonstrate new tasks as needed, without any special knowledge about the robot. Unfortunately, LfD often yields little knowledge about the world, and thus lacks robust generalization capabilities, especially for complex, multi-step tasks. We present a series of algorithms that draw from recent advances in Bayesian non-parametric statistics and control theory to automatically detect and leverage repeated structure at multiple levels of abstraction in demonstration data. The discovery of repeated structure provides critical insights into task invariants, features of importance, high-level task structure, and appropriate skills for the task. This culminates in the discovery of a finite-state representation of the task, composed of grounded skills that are flexible and reusable, providing robust generalization and transfer in complex, multi-step robotic tasks. These algorithms are tested and evaluated using a PR2 mobile manipulator, showing success on several complex real-world tasks, such as furniture assembly.


A Computational Hypothesis for Allostasis: Delineation of Substance Dependence, Conventional Therapies, and Alternative Treatments

December 2013

·

100 Reads

·

8 Citations

The allostatic theory of drug abuse describes the brain’s reward system alterations as substance misuse progresses. Neural adaptations arising from the reward system itself and from the antireward system provide the subject with functional stability, while affecting the person’s mood. We propose a computational hypothesis describing how a virtual subject’s drug consumption, cognitive substrate, and mood interface with reward and antireward systems. Reward system adaptations are assumed interrelated with the ongoing neural activity defining behavior toward drug intake, including activity in the nucleus accumbens, ventral tegmental area, and prefrontal cortex (PFC). Antireward system adaptations are assumed to mutually connect with higher-order cognitive processes occurring within PFC, orbitofrontal cortex, and anterior cingulate cortex. The subject’s mood estimation is a provisional function of reward components. The presented knowledge repository model incorporates pharmacokinetic, pharmacodynamic, neuropsychological, cognitive, and behavioral components. Patterns of tobacco smoking exemplify the framework’s predictive properties: escalation of cigarette consumption, conventional treatments similar to nicotine patches, and alternative medical practices comparable to meditation. The primary outcomes include an estimate of the virtual subject’s mood and the daily account of drug intakes. The main limitation of this study resides in the 21 time-dependent processes which partially describe the complex phenomena of drug addiction and involve a large number of parameters which may underconstrain the framework. Our model predicts that reward system adaptations account for mood stabilization, whereas antireward system adaptations delineate mood improvement and reduction in drug consumption. This investigation provides formal arguments encouraging current rehabilitation therapies to include meditation-like practices along with pharmaceutical drugs and behavioral counseling.


Table 1 | The typical features of novelty and surprise. 
Novelty or Surprise?

December 2013

·

830 Reads

·

338 Citations

Novelty and surprise play significant roles in animal behavior and in attempts to understand the neural mechanisms underlying it. They also play important roles in technology, where detecting observations that are novel or surprising is central to many applications, such as medical diagnosis, text processing, surveillance, and security. Theories of motivation, particularly of intrinsic motivation, place novelty and surprise among the primary factors that arouse interest, motivate exploratory or avoidance behavior, and drive learning. In many of these studies, novelty and surprise are not distinguished from one another: the words are used more-or-less interchangeably. However, while undeniably closely related, novelty and surprise are very different. The purpose of this article is first to highlight the differences between novelty and surprise and to discuss how they are related by presenting an extensive review of mathematical and computational proposals related to them, and then to explore the implications of this for understanding behavioral and neuroscience data. We argue that opportunities for improved understanding of behavior and its neural basis are likely being missed by failing to distinguish between novelty and surprise.


FIGURE 1. Schematic of the simulated robot used in the experiments. The robot is planar and has a total of 10 degrees of freedom. The base is mobile and can move vertically and horizontally. Each arm has four rotational joints, and no joint limits are imposed. The hand and joints corresponding to the left arm are marked with an open square; those of the right arm are marked with a closed circle. Each arm link is one unit in length, and the base is a rectangular box that is one unit wide and 0.2 units high. 
FIGURE 3. Summary of algorithm presented in this article. Symbols used are defined in the text. Also, G is the set of all goals; the function argmax a Q ( s , a ) returns the action a corresponding to the maximum Q ( s , a ) for state s ; and initial values of each Q ( s , a ) are set to −∞ . 
FIGURE 4. Schematic representations of the four tasks described in this article. In each panel, the starting configuration ( q 0 ) of the robot is drawn. As in Figure 1, the hand and four joints of the left arm are marked with open squares, while those of the right arm are marked with closed circles. The left two panels refer to the one-armed tasks, in which only the right arm is used, so the left arm above is not drawn. each goal In addition, indicates the the spatial order in goals which that they must must be hit be hit. are drawn as open circles, the radii of which are θ g ( = 0.1). The number 
FIGURE 7. Follows same conventions as in Figure 6. Left column, initial solution; middle, exploration condition 3 (EC3); right, exploration condition 4 (EC4).
A Dual Process Account of Coarticulation in Motor Skill Acquisition

November 2013

·

254 Reads

·

8 Citations

Journal of Motor Behavior

Many tasks, such as typing a password, are decomposed into a sequence of subtasks that can be accomplished in many ways. Behavior that accomplishes subtasks in ways that are influenced by the overall task is often described as ``skilled'' and exhibits coarticulation. Many accounts of coarticulation use search methods that are informed by representations of objectives that define ``skilled.'' While they aid in describing the strategies the nervous system may follow, they are computationally-complex and may be difficult to attribute to brain structures. Here, we present a biologically-inspired account whereby skilled behavior is developed through two simple processes: 1) a corrective process that ensures that each subtask is accomplished, but does not do so skillfully, and 2) a reinforcement learning process that finds better movements using ``trial and error'' search that is not informed by representations of any objectives. We implement our account as a computational model controlling a simulated two-armed kinematic ``robot'' that must hit a sequence of goals with its hands. Behavior displays coarticulation in terms of which hand was chosen, how the corresponding arm was used, and how the other arm was used, suggesting that our account can participate in the development of skilled behavior.


Citations (94)


... This approach allows them to use deep neural networks to model the uncertainties of the environment, which leads to a more robust controller compared to traditional ones. Later, Konidaris et al. [22] propose to use RL to automatise the skill acquisition on a mobile manipulator. Unlike DL, RL allows to automatically obtain the experience needed to learn robotic skills through trial-and-error and allows to learn complex decision-making policies. ...

Reference:

Learning positioning policies for mobile manipulation operations with deep reinforcement learning
Autonomous Skill Acquisition on a Mobile Manipulator

Proceedings of the AAAI Conference on Artificial Intelligence

... Examples include REINFORCE [27], Advantage Actor-Critic (A2C) [28], and Proximal Policy Optimization (PPO) [29]. • Actor-Critic Architectures [30] These architectures consist of two neural networks: one for the policy (actor) and one for the value function (critic). The actor chooses actions, while the critic evaluates them to guide the actor's learning. ...

Looking Back on the Actor-Critic Architecture

IEEE Transactions on Systems Man and Cybernetics Systems

... In recent years, research in AI and cognitive robotics has pushed the boundaries of what we understand by autonomy in machines [42,43], moving beyond mere automation to embrace a form of autonomy that includes decision-making, motivational capabilities and open-ended learning processes [44,45,46]. This progression is par-ticularly visible in the domain of intrinsically motivated learning [47], where artificial agents are endowed with general goals, such as curiosity, allowing them to explore their environment and autonomously set tasks aimed at increasing their knowledge and skills [48,49]. These developments open the door to agents that can operate in dynamic and unpredictable environments with minimal supervision, discovering solutions to novel problems along the way. ...

Editorial: Intrinsically Motivated Open-Ended Learning in Autonomous Robots

Frontiers in Neurorobotics

... To assess and achieve anti-discrimination fairness, discrimination statistics that measure the average similarity of decisions across groups are used. 30 In addition to consistency and anti-discrimination, a third concept is counterfactual fairness, which ensures an algorithm's decisions remain consistent across hypothetical scenarios where individuals' protected attributes are altered. 19 Typically, causal models that describe how changes in protected attributes affect decisions and other attributes of individuals are used to assess and achieve counterfactual fairness. ...

Preventing undesirable behavior of intelligent machines
  • Citing Article
  • November 2019

Science

... The book from Sutton and Barto [18] has become a foundational resource in the field of reinforcement learning. Temporal difference (TD) learning algorithms are widely used in various reinforcement learning (RL) applications, including game playing [1], robotics, and finance, due to their efficiency and ability to learn directly from interactions with the environment. ...

Reinforcement Learning: Connections, Surprises, and Challenge

... Meanwhile, in 2017, researchers from UMass Amherst and Hewlett Packard Labs, for the first time, demonstrated analog input, analog weight, and analog output on an integrated 128×64 1T1M array with discrete off-chip peripheral circuits for analog signal processing and image compressing tasks [10]. Soon after, various ML algorithms have been experimentally implemented on the same platform, including in-situ training of multilayer perceptron (MLP) [11], convolutional neural network (CNN) [12], long short-term memory (LSTM) [13] and reinforcement learning(RL) [14], etc. ...

Reinforcement learning with analogue memristor arrays

... Reinforcement learning is a widely used model for learning mechanisms characterized by an algorithm in which an agent learns to choose the optimal behavior in an environment by acquiring rewards through interactions with it [14,15]. The standard model is termed the temporal difference (TD) learning model [16], the Rescorla-Wagner (RW) model [17,18], and Qlearning model [18]. ...

Enriching Behavioral Ecology with Reinforcement Learning Methods

Behavioural Processes

... This D4PG variant learns the learning rate of the lagrange multiplier in a soft-constrained optimization procedure. Thomas et al. (2017) propose a new framework for designing machine learning algorithms that simplifies the problem of specifying and regulating undesired behaviours. There have also been approaches to learn a policy that satisfies constraints in the presence of perturbations to the dynamics of an environment . ...

On Ensuring that Intelligent Machines Are Well-Behaved

... In columns are presented predicted classes, in rows actual classes. That way of orientation of matrix is used in many sources [19][20][21][22][23][24][25][26] , however different sources [27][28][29] use another. This means that adding the right headings in this kind of presentation is very important to avoid misunderstanding. ...

Adaptive Real-Time Dynamic Programming
  • Citing Chapter
  • January 2017