Conference PaperPDF Available

Towards Learning from Implicit Human Reward

Authors:

Abstract and Figures

The TAMER framework provides a way for agents to learn to solve tasks using human-generated rewards. Previous research showed that humans give copious feedback early in training but very sparsely thereafter and that an agent's competitive feedback — informing the trainer about its performance relative to other trainers — can greatly affect the trainer's engagement and the agent's learning. In this paper, we present the first large-scale study of TAMER, involving 561 subjects, which investigates the effect of the agent's competitive feedback in a new setting as well as the potential for learning from trainers' facial expressions. Our results show for the first time that a TAMER agent can successfully learn to play Infinite Mario, a challenging reinforcement-learning benchmark problem. In addition, our study supports prior results demonstrating the importance of bi-directional feedback and competitive elements in the training interface. Finally , our results shed light on the potential for using train-ers' facial expressions as reward signals, as well as the role of age and gender in trainer behavior and agent performance.
Content may be subject to copyright.
Towards Learning from Implicit Human Reward
(Extended Abstract)
Guangliang Li, Hamdi Dibeklio˘
glu, Shimon Whitesonand Hayley Hung
Ocean University of China, Qingdao, China &University of Amsterdam, Amsterdam, The Netherlands
Delft University of Technology, Delft, The Netherlands
University of Oxford, Oxford, UK
g.li@uva.nl, {h.dibeklioglu, h.hung}@tudelft.nl, shimon.whiteson@cs.ox.ac.uk
ABSTRACT
The TAMER framework provides a way for agents to learn
to solve tasks using human-generated rewards. Previous re-
search showed that humans give copious feedback early in
training but very sparsely thereafter and that an agent’s
competitive feedback — informing the trainer about its per-
formance relative to other trainers — can greatly affect the
trainer’s engagement and the agent’s learning. In this paper,
we present the first large-scale study of TAMER, involving
561 subjects, which investigates the effect of the agent’s com-
petitive feedback in a new setting as well as the potential for
learning from trainers’ facial expressions. Our results show
for the first time that a TAMER agent can successfully learn
to play Infinite Mario, a challenging reinforcement-learning
benchmark problem. In addition, our study supports prior
results demonstrating the importance of bi-directional feed-
back and competitive elements in the training interface. Fi-
nally, our results shed light on the potential for using train-
ers’ facial expressions as reward signals, as well as the role of
age and gender in trainer behavior and agent performance.
Categories and Subject Descriptors
I 2.6 [Artificial Intelligence]: Learning
General Terms
Performance, Human Factors, Experimentation
Keywords
Reinforcement learning; human agent interaction
1. INTRODUCTION
Socially intelligent autonomous agents have the potential
to become our high-tech companions in the family of the
future. The ability of these intelligent agents to efficiently
learn from non-technical users to perform a task in a natural
way will be key to their success. Therefore, it is critical
to develop methods that facilitate the interaction between
these non-technical users and agents, through which they
can transfer task knowledge effectively to such agents.
Learning from human reward, i.e., evaluations of the qual-
ity of the agent’s behavior, has proven to be a powerful tech-
nique for facilitating the teaching of artificial agents by their
Appears in: Proceedings of the 15th International Conference
on Autonomous Agents and Multiagent Systems (AAMAS 2016),
J. Thangarajah, K. Tuyls, S. Marsella, C. Jonker (eds.),
May 9–13, 2016, Singapore.
Copyright c
2016, International Foundation for Autonomous Agents and
Multiagent Systems (www.ifaamas.org). All rights reserved.
human users [2, 8, 4]. Compared to learning from demon-
stration [1], learning from human reward does not require
the human to be able to perform the task well herself; she
needs only to be a good judge of performance. Nonetheless,
agent learning from human reward is limited by the quality
of the interaction between the human trainer and the agent.
Previous research shows that the interaction between the
agent and the trainer should ideally be bi-directional [5, 6,
7] and that if an agent informs the trainer about the agent’s
past and current performance and its performance relative
to others, the trainer will provide more feedback and the
agent will ultimately perform better. This paper presents
the results of the first large-scale study of TAMER—a pop-
ular method for enabling autonomous agents to learn from
human reward [4]—by implementing it in the Infinite Mario
domain. Our study was conducted at a science museum in
Amsterdam using 561 museum visitors as subjects and inves-
tigates the effect of the agent’s socio-competitive feedback
in a new setting. In addition, we also study the potential
of using facial expressions as reward signals, since several
TAMER studies have shown that humans give copious feed-
back early in training but very sparsely thereafter [3, 5].
Our experimental results show for the first time that a
TAMER agent can successfully learn to play Infinite Mario,
a challenging reinforcement learning benchmark problem.
Moreover, our study provides large-scale support of the re-
sults of Li et al. [5, 6] demonstrating the importance of bi-
directional feedback and competitive elements in the train-
ing interface and sheds light on the potential for using train-
ers’ facial expressions as reward signals, as well as the role of
age and gender in trainer behavior and agent performance.
2. EXPERIMENT CONDITIONS
In our study at the science museum in Amsterdam in-
volving 561 subjects, we test two independent variables:
‘competition’—whether the agent will inform the competi-
tive feedback to the trainer, and ‘facial expression’—whether
trainers were told that their facial expressions would be used
in addition to key presses to train the agent. The main idea
of the facial expression condition is to examine the effect
that the additional modality of facial expressions could have
on the cognitive load of trainers and whether this varies de-
pending on age or gender.
We investigate how ‘competition’ and ‘facial expression’
affect the agent’s learning performance and trainer’s facial
expressiveness in four experimental conditions in our study:
the control condition—without ‘competition’ or ‘facial ex-
pression’, the facial expression condition —without ‘compe-
0
10
20
30
40
50
60
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
Number of time steps
Number of time steps with feedback
Control
Facial expression
Competitive
Competitive facial expression
Figure 1: Mean number of time steps with feedback
per 200 time steps for all four conditions during the
training process.
tition’ but with ‘facial expression’, the competitive condi-
tion—with ‘competition’ but without ‘facial expression’, and
the competitive facial expression condition—with both. We
hypothesize that ‘competition’ will result in better perform-
ing agents, and ‘facial expression’ will result in worse agent
performance. In addition, we expect that both ‘competi-
tion’ and ‘facial expression’ will increase the trainer’s facial
expressiveness.
3. EXPERIMENTAL RESULTS
100 0 100 200 300
0
5
10
15
20
25
30
35
Control
Final offline performance
Number of subjects
100 0 100 200 300
0
5
10
15
20
25
30
35
FE
Final offline performance
Number of subjects
100 0 100 200 300
0
5
10
15
20
25
30
35
Competitive
Final offline performance
Number of subjects
100 0 100 200 300
0
5
10
15
20
25
30
35
Competitive FE
Final offline performance
Number of subjects
Figure 2: Distribution of final offline performance
across the four conditions. FE=Facial Expression.
Figure 1 shows how feedback was distributed per 200 time
steps over the learning process for the four conditions. From
Figure 1 we can see that the number of time steps with feed-
back received by agents in the four conditions increased at
the early training stage and decreased dramatically after-
wards, which supports previous studies [3, 5] and our moti-
vation for investigating methods of enabling agents to learn
from the trainer’s facial expressions. In addition, it shows
that the agent’s competitive feedback can increase the num-
ber of feedback given by the trainer before 1000 time steps.
Figure 2 shows histograms of the distribution of the final
offline performance for the four conditions. Further analysis
with n-way ANOVA shows that ‘competition’ can signifi-
cantly improve agent learning (p= 0.035) and help the best
trainers the most (p= 0.01). In addition, our results suggest
that ‘facial expression’ has a significantly negative effect on
agent training by female subjects, especially those who are
less than 13 years old (p= 0.008) and cannot train agents
to perform well (p= 0.01).
Furthermore, our analysis shows that telling trainers to
use facial expressions makes them inclined to exaggerate
their expressions, resulting in higher accuracy for predict-
ing positive and negative feedback using facial expressions.
Competitive conditions also elevated facial expressiveness
and further increased predicted accuracy. This has signifi-
cant consequences for the design of agent learning systems
that wish to take into account a trainer’s spontaneous facial
expressions as a reward signal. Further investigation into
the nature of spontaneous and posed facial expressions is
needed, in particular in terms of their relation to feedback
quality and quantity.
Acknowledgments
This research was part of Science Live, the innovative re-
search programme of Science Center NEMO in Amsterdam
that enables scientists to carry out real, publishable, peer-
reviewed research using NEMO visitors as volunteers. Sci-
ence Live is partially funded by KNAW and NWO.
REFERENCES
[1] B. D. Argall, S. Chernova, M. Veloso, and B. Browning.
A survey of robot learning from demonstration.
Robotics and autonomous systems, 57(5):469–483, 2009.
[2] C. Isbell, C. R. Shelton, M. Kearns, S. Singh, and
P. Stone. A social reinforcement learning agent. In
Proceedings of the fifth international conference on
Autonomous agents, pages 377–384. ACM, 2001.
[3] W. B. Knox, B. D. Glass, B. C. Love, W. T. Maddox,
and P. Stone. How humans teach agents. International
Journal of Social Robotics, 4(4):409–421, 2012.
[4] W. B. Knox and P. Stone. Interactively shaping agents
via human reinforcement: The TAMER framework. In
Proceedings of the fifth international conference on
Knowledge capture, pages 9–16. ACM, 2009.
[5] G. Li, H. Hung, S. Whiteson, and W. B. Knox. Using
informative behavior to increase engagement in the
TAMER framework. In Proceedings of the 2013
international conference on Autonomous agents and
multi-agent systems, pages 909–916, 2013.
[6] G. Li, H. Hung, S. Whiteson, and W. B. Knox.
Learning from human reward benefits from
socio-competitive feedback. In Proceedings of the
Fourth Joint IEEE International Conference on
Development and Learning and on Epigenetic Robotics,
pages 93–100, 2014.
[7] G. Li, S. Whiteson, W. B. Knox, and H. Hung. Using
informative behavior to increase engagement while
learning from human reward. Autonomous Agents and
Multi-Agent Systems, pages 1–23, 2015.
[8] A. L. Thomaz and C. Breazeal. Teachable robots:
Understanding human teaching behavior to build more
effective robot learners. Artificial Intelligence,
172(6):716–737, 2008.
... Our preliminary work on this topic was presented in [26,27]. This article significantly extends upon our initial work by providing a more extensive analysis of participants' facial feedback and testing the potential of agent learning from them. ...
Article
Full-text available
Interactive reinforcement learning provides a way for agents to learn to solve tasks from evaluative feedback provided by a human user. Previous research showed that humans give copious feedback early in training but very sparsely thereafter. In this article, we investigate the potential of agent learning from trainers’ facial expressions via interpreting them as evaluative feedback. To do so, we implemented TAMER which is a popular interactive reinforcement learning method in a reinforcement-learning benchmark problem—Infinite Mario, and conducted the first large-scale study of TAMER involving 561 participants. With designed CNN–RNN model, our analysis shows that telling trainers to use facial expressions and competition can improve the accuracies for estimating positive and negative feedback using facial expressions. In addition, our results with a simulation experiment show that learning solely from predicted feedback based on facial expressions is possible and using strong/effective prediction models or a regression method, facial responses would significantly improve the performance of agents. Furthermore, our experiment supports previous studies demonstrating the importance of bi-directional feedback and competitive elements in the training interface.
... Our preliminary work on this topic was presented in [27,26]. This article significantly extends upon our initial work by providing a more extensive analysis of participants' facial feedback and testing the potential of agent learning from them. ...
Preprint
Interactive reinforcement learning provides a way for agents to learn to solve tasks from evaluative feedback provided by a human user. Previous research showed that humans give copious feedback early in training but very sparsely thereafter. In this article, we investigate the potential of agent learning from trainers' facial expressions via interpreting them as evaluative feedback. To do so, we implemented TAMER which is a popular interactive reinforcement learning method in a reinforcement-learning benchmark problem --- Infinite Mario, and conducted the first large-scale study of TAMER involving 561 participants. With designed CNN-RNN model, our analysis shows that telling trainers to use facial expressions and competition can improve the accuracies for estimating positive and negative feedback using facial expressions. In addition, our results with a simulation experiment show that learning solely from predicted feedback based on facial expressions is possible and using strong/effective prediction models or a regression method, facial responses would significantly improve the performance of agents. Furthermore, our experiment supports previous studies demonstrating the importance of bi-directional feedback and competitive elements in the training interface.
... In addition, Li et al. [69], [70] recorded the facial expressions of 498 trainers during training and built mapping between facial expression and explicit human reward. They studied how the agent can learn from facial expressions via taking them as implicit human reward without taking environmental reward into account. ...
Article
Human-centered reinforcement learning (RL), in which an agent learns how to perform a task from evaluative feedback delivered by a human observer, has become more and more popular in recent years. The advantage of being able to learn from human feedback for a RL agent has led to increasing applicability to real-life problems. This paper describes the state-of-the-art human-centered RL algorithms and aims to become a starting point for researchers who are initiating their endeavors in human-centered RL. Moreover, the objective of this paper is to present a comprehensive survey of the recent breakthroughs in this field and provide references to the most interesting and successful works. After starting with an introduction of the concepts of RL from environmental reward, this paper discusses the origins of human-centered RL and its difference from traditional RL. Then we describe different interpretations of human evaluative feedback, which have produced many human-centered RL algorithms in the past decade. In addition , we describe research on agents learning from both human evaluative feedback and environmental rewards as well as on improving the efficiency of human-centered RL. Finally, we conclude with an overview of application areas and a discussion of future work and open questions. Index Terms-Human agent/robot interaction, human reward, interactive reinforcement learning (RL), interactive shaping, policy shaping.
Article
Full-text available
In this work, we address a relatively unexplored aspect of designing agents that learn from human reward. We investigate how an agent’s non-task behavior can affect a human trainer’s training and agent learning. We use the TAMER framework, which facilitates the training of agents by human-generated reward signals, i.e., judgements of the quality of the agent’s actions, as the foun- dation for our investigation. Then, starting from the premise that the interaction between the agent and the trainer should be bi-directional, we propose two new training interfaces to increase a human trainer’s active involvement in the training process and thereby improve the agent’s task performance. One provides infor- mation on the agent’s uncertainty which is a metric calculated as data coverage, the other on its performance. Our results from a 51-subject user study show that these interfaces can induce the trainers to train longer and give more feedback. The agent’s performance, however, increases only in response to the addition of performance-oriented information, not by sharing uncertainty levels. These results suggest that the organizational maxim about human behavior, “you get what you measure” — i.e., sharing metrics with people causes them to focus on optimizing those metrics while de-emphasizing other objectives — also applies to the training of agents. Using principle component analysis, we show how trainers in the two conditions train agents differently. In addition, by simulating the influence of the agent’s uncertainty-informative behavior on a human’s training behavior, we show that trainers could be distracted by the agent sharing its uncertainty levels about its actions, giving poor feedback for the sake of reducing the agent’s uncertainty without improving the agent’s performance.
Conference Paper
Full-text available
Learning from rewards generated by a human trainer observing an agent in action has proven to be a powerful method for non-experts in autonomous agents to teach such agents to perform challenging tasks. Since the efficacy of this approach depends critically on the reward the trainer provides, we consider how the interaction between the trainer and the agent should be designed so as to increase the efficiency of the training process. This paper investigates the influence of the agent's socio-competitive feedback on the human trainer's training behavior and the agent's learning. The results of our user study with 85 subjects suggest that the agent's socio-competitive feedback substantially increases the engagement of the participants in the game task and improves the agents' performance, even though the participants do not directly play the game but instead train the agent to do so. Moreover, making this feedback active further induces more subjects to train the agents longer but does not further improve agent performance. Our analysis suggests that this may be because some trainers train a more complex behavior in the agent that is appropriate for a different performance metric that is sometimes associated with the target task.
Conference Paper
Full-text available
In this paper, we address a relatively unexplored aspect of designing agents that learn from human training by inves- tigating how the agent’s non-task behavior can elicit hu- man feedback of higher quality and quantity. We use the TAMER framework, which facilitates the training of agents by human-generated reward signals, i.e., judgements of the quality of the agent’s actions, as the foundation for our in- vestigation. Then, we propose two new training interfaces to increase active involvement in the training process and thereby improve the agent’s task performance. One provides information on the agent’s uncertainty, the other on its per- formance. Our results from a 51-subject user study show that these interfaces can induce the trainers to train longer and give more feedback. The agent’s performance, however, increases only in response to the addition of performance- oriented information, not by sharing uncertainty levels. Sub- sequent analysis of our results suggests that the organiza- tional maxim about human behavior, “you get what you measure”—i.e., sharing metrics with people causes them to focus on maximizing or minimizing those metrics while de- emphasizing other objectives— also applies to the training of agents, providing a powerful guiding principle for human- agent interface design in general.
Article
Full-text available
Human beings are a largely untapped source of in-the-loop knowledge and guidance for computational learning agents, including robots. To effectively design agents that leverage available human expertise, we need to understand how people naturally teach. In this paper, we describe two experiments that ask how differing conditions affect a human teacher's feedback frequency and the computational agent's learned performance. The first experiment considers the impact of a self-perceived teaching role in contrast to believing one is merely critiquing a recording. The second considers whether a human trainer will give more frequent feedback if the agent acts less greedily (i.e., choosing actions believed to be worse) when the trainer's recent feedback frequency decreases. From the results of these experiments, we draw three main conclusions that inform the design of agents. More broadly, these two studies stand as early examples of a nascent technique of using agents as highly specifiable social entities in experiments on human behavior.
Conference Paper
Full-text available
As computational learning agents move into domains that incur real costs (e.g., autonomous driving or financial investment), it will be necessary to learn good policies without numerous high-cost learning trials. One promising approach to reducing sample complexity of learning a task is knowledge transfer from humans to agents. Ideally, methods of transfer should be accessible to anyone with task knowledge, regardless of that person's expertise in programming and AI. This paper focuses on allowing a human trainer to interactively shape an agent's policy via reinforcement signals. Specifically, the paper introduces "Training an Agent Manually via Evaluative Reinforcement," or TAMER, a framework that enables such shaping. Differing from previous approaches to interactive shaping, a TAMER agent models the human's reinforcement and exploits its model by choosing actions expected to be most highly reinforced. Results from two domains demonstrate that lay users can train TAMER agents without defining an environmental reward function (as in an MDP) and indicate that human training within the TAMER framework can reduce sample complexity over autonomous learning algorithms.
Conference Paper
Full-text available
We report on our reinforcement learning work on Cobot, a software agent that resides in the well-known online chat community LambdaMOO. Our initial work on Cobot~\cite{cobotaaai} provided him with the ability to collect {\em social statistics\/} and report them to users in a reactive manner. Here we describe our application of reinforcement learning to allow Cobot to proactively take actions in this complex social environment, and adapt his behavior from multiple sources of human reward. After 5 months of training, Cobot received 3171 reward and punishment events from 254 different Lambda\-MOO users, and learned nontrivial preferences for a number of users. Cobot modifies his behavior based on his current state in an attempt to maximize reward. Here we describe LambdaMOO and the state and action spaces of Cobot, and report the statistical results of the learning experiment.
Article
While Reinforcement Learning (RL) is not traditionally designed for interactive supervisory input from a human teacher, several works in both robot and software agents have adapted it for human input by letting a human trainer control the reward signal. In this work, we experimentally examine the assumption underlying these works, namely that the human-given reward is compatible with the traditional RL reward signal. We describe an experimental platform with a simulated RL robot and present an analysis of real-time human teaching behavior found in a study in which untrained subjects taught the robot to perform a new task. We report three main observations on how people administer feedback when teaching a Reinforcement Learning agent: (a) they use the reward channel not only for feedback, but also for future-directed guidance; (b) they have a positive bias to their feedback, possibly using the signal as a motivational channel; and (c) they change their behavior as they develop a mental model of the robotic learner. Given this, we made specific modifications to the simulated RL robot, and analyzed and evaluated its learning behavior in four follow-up experiments with human trainers. We report significant improvements on several learning measures. This work demonstrates the importance of understanding the human-teacher/robot-learner partnership in order to design algorithms that support how people want to teach and simultaneously improve the robot's learning behavior.
Article
We present a comprehensive survey of robot Learning from Demonstration (LfD), a technique that develops policies from example state to action mappings. We introduce the LfD design choices in terms of demonstrator, problem space, policy derivation and performance, and contribute the foundations for a structure in which to categorize LfD research. Specifically, we analyze and categorize the multiple ways in which examples are gathered, ranging from teleoperation to imitation, as well as the various techniques for policy derivation, including matching functions, dynamics models and plans. To conclude we discuss LfD limitations and related promising areas for future research.