Conference PaperPDF Available

Human Natural Instruction of a Simulated Electronic Student.

Authors:

Abstract and Figures

Humans naturally use multiple modes of instruction while teaching one another. We would like our robots and artificial agents to be instructed in the same way, rather than programmed. In this paper, we review prior work on human instruction of autonomous agents and present observations from two exploratory pilot studies and the results of a full study investigating how multiple instruction modes are used by humans. We describe our Bootstrapped Learning User Interface, a prototype multi-instruction interface informed by our human-user studies. Copyright © 2011, Association for the Advancement of Artificial Intelligence. All rights reserved.
Content may be subject to copyright.
Human Natural Instruction of a Simulated Electronic Student
Tasneem Kaochar, Raquel Torres Peralta, Clayton T. Morrison, Thomas J. Walsh,
Ian R. Fasel, Sumin Beyon, Anh Tran, Jeremy Wright and Paul R. Cohen
The University of Arizona, Department of Computer Science, Tucson, AZ 85721-0077
{tkaochar,clayton,twalsh,ianfasel}@cs.arizona.edu
Abstract
Humans naturally use multiple modes of instruction while
teaching one another. We would like our robots and arti-
ficial agents to be instructed in the same way, rather than
programmed. In this paper, we review prior work on hu-
man instruction of autonomous agents and present the results
of three studies investigating how multiple instruction modes
are used by humans. We describe our Bootstrapped Learn-
ing User Interface, a prototype multi-instruction interface in-
formed by our human-user studies.
Introduction
Humans are remarkably facile at imparting knowledge to
one another. Through the lens of the various kinds of state of
the art machine learning algorithms we can identify multiple
modes of natural human instruction: we define concepts, we
describe and provide demonstrations of procedures, we give
examples of rules and conditions, and we provide various
kinds of feedback to student behavior. These methods are
used depending on the kind of concept that is being taught
(concept definitions, conditions, rules, procedures) and the
conditions under which the teacher and student find them-
selves. Just as important, the student is an equal participant
in the lesson, able to learn and recognize conventions, and
actively observing and constructing a model of the situation
the teacher is presenting in the lesson.
The familiar and readily available systems of instruction
that are used in human-to-human teaching stand in sharp
contrast with how we currently get computers to do what we
want, despite the fact that computers appear to share with
humans a kind of universal flexibility: we have been able
to make computers perform an enormous variety of com-
plex tasks. These capabilities are currently only achieved as
the result of often costly and labor intensive programming
by human engineers—we get computers to do what we want
through a process that is closer to brain surgery than instruc-
tion. This is true even for state of the art machine learning,
where algorithms are capable of extracting patterns, classi-
fying noisy instances, and learning complex procedures. But
human engineers must provide data in just the right form,
Copyright c
2010, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
with the correct training data, and even then must often ex-
plore through trial and error to get the agent to learn what is
desired.
The goal of human-instructable computing is to build
an “electronic student” that can be taught using the
same natural instruction methods humans (specifically non-
programmers) use. An electronic student architecture will
naturally include a variety of state of the art machine learn-
ing algorithms, but the key challenge is to provide the in-
terface between them and the natural instruction methods
used by humans. There is now a growing body of literature
from researchers studying the intersection between human-
computer interfaces and machine learning. However, to
date, the focus of this work has been on particular individ-
ual modes for human instruction rather than the full gamut.
The next step is to understand how to provide several modes
of instruction in the interface, and to better understand how
humans might use such an interface to teach.
In this paper, we describe a series of three studies intended
to uncover how humans might naturally instruct a capa-
ble electronic student, using a broad spectrum of instruction
types. We aim to design an interface that can accommodate
the variety of natural instruction modes that humans appear
to use, and understand when and how these modes are used.
Prior Work
Many prior efforts have described agents that interact with
human teachers. However, most of these works used only
one mode of teacher-student interaction (e.g. teaching by
demonstration) over the agent’s lifetime. We can roughly
classify this existing work into three categories based on the
kind of feedback the teacher can pass to the student: teach-
ing by demonstration, teaching concepts by example, and
teaching through reinforcement. We now describe these in-
teractions in more detail and provide examples from the lit-
erature.
In teaching by demonstration, a teacher has the same con-
trol interface to a dynamical system as the student does, and
is able to provide traces of proper behavior. The learning in
this case does not need to be pure mimicry, and instead en-
ables the acquisition of a higher-level policy from the traces,
allowing the student to perform correctly in new situations.
For instance, after seeing the teacher navigate around a crate,
the agent may be able to navigate around similar obstacles in
different locations. Prior work in learning from demonstra-
tion has appeared in several branches of the reinforcement
learning literature, including using traces in a batch setting
to bootstrap robot behaviors (Smart and Kaelbling 2002;
Argall et al. 2009) and in extended apprenticeship learning
over the agent’s lifetime (Walsh et al. 2010). Often these
works espoused the use of autonomous learning in concert
with the teacher-provided traces. In our case, we will be
accompanying the demonstration traces by other forms of
teacher interaction that will allow for the same fine grained
tuning of behavior.
Another general form of teacher-student interaction is
teaching by examples of concepts. This protocol shad-
ows the standard supervised-learning interaction in that the
teacher is responsible for providing labeled examples of a
concept to be learned. However, the teacher may also pro-
vide explanations or hints as to why a specific example was
classified a certain way. An example of this framework
is the WILL system (Natarajan et al. 2010) for Inductive
Logic Programming, which combines traditional concept
learning machinery with users’ indications of important def-
initional components. Other systems for more traditional
classification problems, such as training support vector ma-
chines (Chernova and Veloso 2009) and email classification
(Stumpf et al. 2009) have used this paradigm, with the lat-
ter focussing explicitly on natural ways that humans provide
reasons for categorization decisions. Similar techniques can
be helpful in learning conditional dynamics (e.g. you cannot
walk through walls), as was the case in work on learning ob-
ject categories and affordances (Thomaz and Cakmak 2009).
While learning concepts is often an essential component of
a task, this framework does not allow for specific actions
(other than labeling) to be designated as good or bad, and
while demonstration provides an indirect channel for such
guidance (by only showing good behavior), we now consider
a third channel of teacher interaction for more fine-grained
behavioral refinement.
In teaching through reinforcement, the teacher is able to
give a feedback signal indicating a degree of happiness (or
unhappiness) with the agent’s behavior, either at specific
timesteps or when an episode has ended. This form of feed-
back has been shown to be moderately successful in com-
plex tasks such as a simulated cooking scenario (Thomaz,
Hoffman, and Breazeal 2006) and tactical battles in a real-
time strategy game (Judah et al. 2010). However, recent
work (Knox and Stone 2010; Thomaz and Breazeal 2008)
has indicated that many “natural” ways of incorporating nu-
meric or ordinal feedback from humans into a reinforcement
learning problem can be perilous, as humans often provide
incompatible feedback or do not follow “standard” defini-
tions of reward and value. As such, this channel is usually
best suited for fine-grained refinements that are not easily
teachable through the protocols discussed above.
Methodology
In contrast to these works, the interface we are designing
is built around allowing the human teacher to use instantia-
tions of all of these instruction types during learning. This
allows for both the teaching of high-level concepts, but also
allows for fine-grained adjustments of undesired behavior
based on direct reinforcement or demonstrations of specific
situations. In doing so, we allow the human teacher to make
use of techniques that have been proven effective in studies
of specific environments like the ones above, without pre-
cluding domains where single techniques may be of no avail.
However, the design of such an interface poses a method-
ological challenge: we do not yet understand how humans
might naturally teach an automated agent using a multi-
modal instruction interface. The ideal situation would be
to take an existing electronic student and see how humans
instruct it. However, such a student does not yet exist. Fur-
thermore, prior work (Perry 2008), in which transcripts were
collected of the interaction between a human teacher and
a human serving as the interface between the teacher and
an early version of the Mable electronic student (Mailler et
al. 2009), found that as soon as the teacher recognized that
Mable is limited in the kind of interaction it can accommo-
date, the participant tended to reduce instruction to a style
more like direct programming than natural instruction. To
avoid this, in the following studies we employed a Wizard
of OZ (WOZ) protocol in which the student is actually con-
trolled by a human without the teacher’s knowledge. This
allowed us to provide human teachers a high degree of free-
dom in how they choose to instruct the student while believ-
ing they are interacting with a capable student.
Study 1 - Wubble World:
Free Text Interaction
In our first set of experiments, we wanted to elicit as close
to fully natural instruction behavior as possible without con-
straining how the Teacher might teach, but also under the
condition that the Teacher believed they were interacting
with a capable automated agent. We used Wubble World
(hereafter, WW) (Hewlett et al. 2007), a three dimensional
simulated environment in which agents called wubbles can
move around and manipulate objects. We adapted WW so
that humans can control the wubbles and communicate with
one another directly using a peer-to-peer chat facility. Fig-
ure 1 shows the the WW interface view.
Figure 1: Student interface for Wubble World.
In each experiment session, one human participant was
given the role of the Teacher, the other was the Student.
Both Teacher and Student did not know ahead of time what
the teaching/learning task was, and the Teacher was led to
believe that they were interacting with a computer agent
rather than a human Student, while the Student was told they
were interacting with a human Teacher. The two participants
were placed in separate rooms. Both the Teacher and Stu-
dent were trained to use the WW interface and given a short
amount of practice. The interface included controls for mov-
ing the wubbles, highlighting objects, and picking up, carry-
ing and putting down objects. The Teacher and Student were
not constrained in what they could write to one another.
Once the participants were ready, the Teacher was pre-
sented with the teaching task. The teaching task required
multiple concepts to be taught, some depending on first mas-
tering others. Specifically, the WW environment has two
kinds of blocks: cube-shaped red blocks and elongated blue
blocks (see Fig. 1). The task was to have 5 smaller red boxes
placed in a row at the base of a wall, and an elongated blue
block placed on top. This required the Teacher to teach the
difference between the blocks, how they were to be posi-
tioned, as well as the concept of line of sight; in this way,
the teaching task involved concept definitions (line of sight,
distinguishing the different block types), rules and condi-
tions (the red blocks must form the base, what line of sight
means), and a procedure (how to construct the wall). The
Teacher was also asked to verify that the Student had learned
each of the concepts taught.
We collected transcripts from six participant pairs.1The
following enumerates a set of observations based on the tran-
scripts:
1. Modes of instruction are tightly interleaved: while step-
ping through the procedure for building the wall, telling
the Student what to do at each step, the Teacher also de-
fined new concepts (“this is a red block”) and rules sur-
rounding their use (“put red boxes first”), all the while
providing feedback (“that was good”). One reason for
the interleaving is that the teaching environment allows
for both teacher and student to engage in interaction to-
gether within the environment. In general, Teacher’s did
not explicitly denote the beginning or end of a procedure,
instead relying on other contextual cues, such as asking
questions, or asking the Student to try again.
2. Teachers sometimes demonstrated the action themselves,
instructing the Student to watch. Other times they told the
student what to do, step by step, with the assumption that
the student understood this.
3. Teacher feedback ranged from “good” or “no” to more
elaborated explanation: “No, the blocks need to be on the
ground, not the top”
4. It was common for Teachers using free text to use multi-
ple terms for the same object without explicitly noting the
different terms. E.g., “block” versus “box” versus “cube”.
1Additional details and transcripts are available here (Morri-
son, Fasel, and Cohen 2010). Our volunteer participants did not
have background knowledge about the state of the art in machine
learning or artificial agents, so they did not have advanced knowl-
edge about whether an electronic student is currently possible—
this made it possible to maintain the illusion that the Teacher was
interacting with an intelligent computer agent.
5. Students asked questions of clarification, and Teachers
gave definitions and reasons for actions or for labeling
conditions. This helped both Teacher and Student estab-
lish a shared frame of reference.
T: “the blocks should be placed close to each other”
S: “all the red blocks?”
T: “yes”
T: “The line of sight is blocked because the blocks are
between you and the tree.”
6. In some cases, Teachers provided background summaries
of what they were going to teach before providing demon-
strations or examples, e.g., “Let’s imagine that the sign
post is an observer to hide from.
Study 2 - Charlie the Pirate:
Constrained Text Interaction
The lessons from the WW study achieved the goal of allow-
ing the Teacher to be expressive, but free text entry from
both Student and Teacher sometimes led to quite complex
linguistic constructions. Also, we anticipate that in most
cases interaction with an electronic student will either be
with a physical robot, or with a system in which the Teacher
will not be represented in a simulated world as an avatar
along with the Student. For this reason, we changed the
Teaching domain and interaction protocol. In this study, we
had Teachers interact with a small physical robot, and lim-
ited what the Student could say, in the hopes of finding a bal-
ance between expressivity for the Teacher but closer to in-
teractions that might be handled by current artificial agents.
The robot, named Charlie, was built from a Bioloids
robotics kit2and consists of a 4-wheeled, independent drive
chassis, an arm with a simple two-fingered (pincer) grip-
per, and an arm with an RFID sensor attached at the end
(see Fig. 2, right). While each arm has multiple degrees of
freedom, the arm controllers were simplified to discrete ac-
tions to deploy and retract, open and close, rotate the gripper,
and scan with the RFID sensor (this would sweep the sensor
back and forth after it was deployed).
1
2
1
1
Figure 2: Left: A sample arrangement of blocks of different
shapes and colors used in the robot experiment; numbered
blocks contain RFID tags, where the number indicates the
value of a scan. Right: A picture of Charlie the Robot.
The robot was placed on a table along with small foam
blocks of various shapes and colors (see Fig. 2, left). The
2http://www.trossenrobotics.com/bioloid-robot-kits.aspx
Teacher stood at one end of the table and used a terminal
to enter text to communicate directly with the Student. The
Student was located in a partitioned region of the lab, away
from the table and Teacher; the Teacher could not see the
Student and we set up conditions so that the Teacher did not
know there was an additional person in the room, however
the Student could view the table workspace and the Teacher
via two web cams. The Student’s workstation included a
keyboard, and a monitor displaying two windows with the
web cam video feed, a peer-to-peer chat terminal, and an-
other terminal indicating the robot’s RFID sensor status. In
this study, rather than allowing the Student to type free text,
we employed two conditions: one in which the Student was
not able to respond, and a second condition in which the
student can respond by selecting from a set of canned re-
sponses: e.g., “Hello!”, “I didnt understand”, “Yes”, “No”,
and “I’m finished”.
The teaching task asked the Teacher to instruct Charlie to
“find treasure” amongst the blocks on the table. The rules
were that treasure could only be in one of the purple ob-
jects on the table. Charlie could see the color and shape
of the blocks. For the Teacher’s benefit, all of the purple
blocks had numbers indicating whether they had treasure or
not: 2 indicated a block with treasure. Charlie could not see
the numbers on the blocks and instead had to use the RFID
scanner to scan the blocks. Charlie’s RFID scanner would
display a “1” on the Student’s sensor if the block was la-
beled 1, or “2” for blocks with treasure, otherwise it would
display nothing. Once Charlie found the block with the trea-
sure, the Student had to direct him to turn the block upside
down in order to “bury” it. This task required that Charlie
scan all purple blocks until finding the treasure.
We collected transcripts of 7 Teacher/Student pairs, 3 in
the “no Student response” condition, and 4 where the Stu-
dent could use the canned phrases. The participants were
again selected from graduate students and staff of the UA
Computer Science Department, but all participants had little
experience with robots or artificial agents. The main find-
ings from analysis of the transcripts were the following:
1. Similar to the WW findings, all of the different teaching
modes were observed and they were tightly interleaved.
2. In the condition where the Student was not able to re-
spond to the Teacher, we saw the same kind of Teacher
expressivity observed in the prior study with Mable (Perry
2008): Teachers reverted to a style that was more like
programming than instruction, simply walking Charlie
through the task and only occasionally indicating condi-
tions. In the canned response condition, however, while
the Teacher’s expressions tended to be simpler than in the
WW free-text condition, the Teacher tended to ask more
questions or conduct small tests of the Student’s state of
knowledge.
3. Under both conditions, there was a pattern in the com-
plexity of instructions provided by the Teacher: When
Charlie followed instructions with no mistakes, the com-
plexity of the instructions increased, whereas the com-
plexity decreased when Charlie failed to follow directions
correctly.
Study 3 - BLUI: Bootstrapped Learning
User Interface
From the prior two studies we have learned that indeed mul-
tiple instruction modes are natural to use, that they are in-
terleaved during instruction, and that allowing the Student
to respond to the Teacher with at least a limited set of re-
sponse types has a positive affect on Teacher expressiveness.
Given this experience, we constructed a prototype interface,
the Bootstrapped Learning User Interface (BLUI), to make
these instruction types available through a GUI interface.
We then conducted a WOZ experiment to see how human
users with little prior experience might use BLUI.
The UAV ISR Domain
BLUI has been designed to work in an Intelligence, Surveil-
lance and Reconnaissance (ISR) domain in which the Stu-
dent is the control system of a simulated UAV that will be
taught to carry out ISR missions. We use the X-Plane cus-
tomizable flight simulator environment3to simulate a realis-
tic operating environment.
In BLUI, the UAV that the wizard/electronic student con-
trols is a small airplane with a flight control system that can
keep the plane flying in the direction and at the altitude spec-
ified by the teacher. (Note that the human teacher is not re-
quired to understand flight dynamics, which the Student is
handling). A set of flight waypoints can also be specified to
indicate the flight trajectory the UAV should follow.
A scenario file can be loaded into the X-Plane environ-
ment at any point during the flight simulation. Each of these
scenarios generate a version of the world with a given set
of objects such as people, cars, trucks, boats, and buildings.
World objects have properties that can be sensed using the
UAV’s sensors. For instance, one can use the high resolution
camera to determine if a boat object has a cargo hold. The
wizard/Student knows about general geographical features
of the scenario, such as bodies of water and mountains, but
must be taught to distinguish between different world ob-
jects. Thus, world objects are used to define teaching tasks
(such as “fly to the boat but avoid the truck”).
Three UAV sensors (listed below) were made accessi-
ble to the teacher and student to provide information about
world objects. We note that these sensors are the only per-
cepts available for the electronic student to actually observe
the environment, but are generally under the control of the
teacher (unless the student has been asked by the teacher
to perform a procedure that uses these sensors). Therefore,
teaching the student how and when to use these sensors is an
important part of teaching a task.
Wide-area camera - provides a 360-degree view of the
ground around the UAV. The higher the UAV flies, the
wider the range of view. This camera can see objects
on the ground, but can not acquire detailed information
about them. Objects will not come into view until they
are within range.
High resolution camera - provides detailed information
about objects when it is aimed at and set to track an object
3Laminar Research: http://www.x-plane.com/
A
BC
Figure 3: The BLUI Teaching Interface: (A) Teacher Instruction Interface; (B) Timeline; (C) Map Interface
within range. This includes many finer-grained properties
of objects (such as whether they have a cargo hold).
Radiation sensor - can detect the level of radiation of ob-
jects in range. The range of this sensor is more limited and
requires the plane to fly down to the area it is scanning.
Teacher Interface
The teacher is provided with three tools to help teach the
electronic student (Fig. 3): (A) the Teacher Instruction In-
terface, which is the main interface used to communicate
with student; (B) a Timeline Display that shows a list of all
teacher instructions sent to student; (C) a Map Display that
provides information about world objects, UAV sensors, and
range and shows the UAV flight path.
The Teacher Instruction Interface (Fig 3-A) was specifi-
cally designed to enable the three types of instruction meth-
ods discussed earlier. Below we discuss the interface fea-
tures that may be classified under each teacher-student in-
struction mode:
Teaching by demonstration: The teacher can group a set
of instruction commands into a procedure and demonstrate
good (positive) and bad (negative) examples of a procedure.
This can be done in one of two ways. The teacher can either
explicitly state at the beginning of a sequence of commands
that they are about to teach a procedure and later explicitly
end it, or the teacher can return to a sequence of previous
commands stored in the timeline (Fig. 3-B), highlight them,
and label them as a procedure. It is also possible to nest
procedures within one another. Note that this method for
teaching procedures has none of the formal semantics or ad-
vanced programming interfaces of other teaching systems.
Instead it simply allows the teacher to naturally demonstrate
grounded commands in a natural manner and puts the onus
on the electronic student to determine a general policy for
enacting these procedures.
Teaching concepts by examples: The teacher can define
concepts (such as “cargo boat”) by pointing to and labeling
objects (that are visible in the current sensor range) on the
map (Fig. 3-C). Names can be re-used and again, the elec-
tronic student would need to build a general representation
of the concept being taught based on the sensed features of
the object (e.g. “Cargo boats are boats with cargo holds”).
Teaching by reinforcement: The teacher can give feedback
to student, in the form of 1-3“happy” faces or 1-3“frowny”
faces, to indicate how satisfied he/she is with the student’s
performance. The teacher could also create new or use ex-
isting labels to indicate when “goals” were achieved.
Teaching Task
In each of our trials, the human teacher is presented with the
following task:
Your task is to instruct the student to identify all cargo
boats in a specified body of water. There are two main
kinds of boats: cargo boats and fishing boats, and you
will need to teach the student how to tell the differ-
ence. Once a cargo boat has been identified, the student
needs to take its radiation sensor reading and generate
a report.
In order to ensure that the teacher has no doubt about the
correct answer, we provide the teacher printouts of all sce-
nario files where each boat has been labeled as cargo or fish-
ing. Additionally, he/she is informed of the property (visible
only through high-resolution camera tracking) that distin-
guishes a cargo boat from a fishing boat.
As before, the purpose of the teaching task with the
BLUI is to require that multiple concepts and procedures
are taught, some depending on first mastering others. We
also want the concepts taught to involve teaching of defini-
tions, procedures, conditions, and potentially sub-goals. In
this case, the teacher would first need to teach the student
the distinction between cargo and fishing boat using cam-
era tracking and assigning object labels. Then the teacher
may instruct student to follow a set of actions/procedures
that need to be performed every time a cargo boat has been
identified. We do not share this information with teacher
beforehand though.
Empirical Results
Thus far, we have run the BLUI WOZ experiements on 12
people. Each participant went through an introductory demo
of the teaching interface before beginning the teaching ses-
sion. On average, the participants spent 30 minutes teach-
ing the student the assigned task. After the teaching ses-
sion, each participant was asked whether he/she believed
they were able to successfully communicate with the elec-
tronic student; 7participants replied “Yes”, 3replied “No”
Mode Phase1 Phase2 Phase3
Demonstration 32.34 34.73 32.93
Concept by Example 33.90 38.81 37.29
Reinforcement 7.50 30.00 62.50
Table 1: Percentage of all commands of a certain mode in
each stage.
and the rest replied “Mostly”. Most participants made use of
interface features to teach by demonstration, although some
exclusively tried to teach the task to the student by teaching
concepts (object labels) through examples (see Figure 4).
Two of the participants exclusively used procedure defini-
tions through demonstration to teach the student. While we
see that the majority of the participants taught with multiple
instruction modes, it is worth noting that the interface has
accommodated at two distinct teaching styles. The feedback
feature was the least used of the three instruction modes. We
also analyzed each teaching session in fixed time window to
detect whether a certain mode of instruction became more or
less popular as the session continued (see Table 1). Interest-
ingly, we did not find any change in preference for teaching
by demonstration or teaching by example; however, we did
notice a significant shift in the use of teaching through re-
inforcement. This observation indicates that reinforcement
feedback is the most useful in this task for fine tuning behav-
ior that has been bootstrapped with other instruction modes.
0
20
40
60
80
100
120
A
C
D
E
F
G
H
I
J
K
L
!"#$"%&'(")*+),%-&#.$&/*%-)*+)0'$1)2*3"
4"'$1"#)!'#&/$/5'%&),6-)78'&"(*#/$'9:
;/-&*(#'<)*+)4"'$1"#)!'#&/$/5'%&-=)>-'(")*+)6/++"#"%&)2*3"-)*+),%-&#.$&/*%
Teaching;by;Demonstration
Teaching;Concepts;by;Example
Teaching;through;Reinforcement
Figure 4: The percentage of each mode of instruction for
each participant int he BLUI study.
Conclusion
To the best of our knowledge, our BLUI teacher interface is
the first interface to combine all the three common modes of
teacher-student interaction over the agent’s lifetime: teach-
ing by demonstration, teaching concepts by example and
teaching through reinforcement. These results are prelim-
inary, and in general there is much more work to be done
to better understand human instruction requirements. How-
ever, so far it does look like the BLUI interface accommo-
dates multiple teaching modes, and we have initial evidence
that it supports at least two different teaching styles (mixed
demonstration or concepts by example only). Our next step
is to use these results to inform the design of the backend
of the instruction interface that will package teacher instruc-
tions in an appropriate form for machine learning. This com-
plements the efforts already under way in multi-modal learn-
ing systems such as Mable (Mailler et al. 2009)
Acknowledgments The work was supported in part by
DARPA contract HR0011-07-C-0060 as part of the Boot-
strapped Learning Program. We also thank SRI for support
of the X-Plane simulation environment.
References
Argall, B.; Chernova, S.; Veloso, M. M.; and Browning,
B. 2009. A survey of robot learning from demonstration.
Robotics and Autonomous Systems 57(5):469–483.
Chernova, S., and Veloso, M. 2009. Interactive policy learn-
ing through confidence-based autonomy. Journal of Artifi-
cial Intelligence Research 34(1):1–25.
Hewlett, D.; Hoversten, S.; Kerr, W.; Cohen, P. R.; and
Chang, Y.-H. 2007. Wubble world. In AIIDE-07.
Judah, K.; Roy, S.; Fern, A.; and Dietterich, T. G. 2010.
Reinforcement learning via practice and critique advice. In
AAAI, 481–486.
Knox, W. B., and Stone, P. 2010. Combining manual feed-
back with subsequent mdp reward signals for reinforcement
learning. In AAMAS.
Mailler, R.; Bryce, D.; Shen, J.; and Oreilly, C. 2009.
MABLE: A modular architecture for bootstrapped learning.
In AAMAS09.
Morrison, C. T.; Fasel, I. R.; and Cohen, P. R.
2010. Fall 2009 human-instructable computing wizard
of oz studies. Technical Report TR10-05, University of
Arizona Department of Computer Science, Available at:
http://cs.arizona.edu/˜clayton/TR10-05.pdf.
Natarajan, S.; Kunapuli, G.; Maclin, R.; Page, D.; O’Reilly, C.;
Walker, T.; and Shavlik, J. 2010. Learning from human teach-
ers: Issues and challenges for ilp in bootstrap learning. In AAMAS
Workshop on Agents Learning Interactively from Human Teachers.
Perry, D. E. 2008. Report for the human teacher study. Technical
Report TR-2317, BAE Systems.
Smart, W. D., and Kaelbling, L. P. 2002. Effective reinforcement
learning for mobile robots. In ICRA.
Stumpf, S.; Rajaram, V.; Li, L.; Wong, W.-K.; Burnett, M.; Diet-
terich, T.; Sullivan, E.; and Herlocker, J. 2009. Interacting mean-
ingfully with machine learning systems: Three experiments. Inter-
national Journal of Human-Computer Studies 67(8):639–662.
Thomaz, A. L., and Breazeal, C. 2008. Teachable robots: Un-
derstanding human teaching behavior to build more effective robot
learners. Artificial Intelligence 172(6-7):716–737.
Thomaz, A. L., and Cakmak, M. 2009. Learning about objects
with human teachers. In HRI.
Thomaz, A. L.; Hoffman, G.; and Breazeal, C. 2006. Reinforce-
ment learning with human teachers: Understanding how people
want to teach robots. In ROMAN.
Walsh, T. J.; Subramanian, K.; Littman, M. L.; and Diuk, C. 2010.
Generalizing apprenticeship learning across hypothesis classes. In
ICML.
... (This observation appears to be independent of the teaching task since it was noted also in our prior pilot studies [2]). 2. We found at least 4 distinct patterns used to switch between teaching and testing of the electronic student. ...
... Teachers preferred to test the Student intermittently throughout the teaching session rather than doing a monolithic testing episode at the end. The importance of testing in teaching was also observed in our pilot studies, using different teaching tasks [2]. We catalogued several levels of organization that characterized teaching trajectories, and noted that teachers frequently used the GUI in unexpected ways. ...
Conference Paper
Full-text available
Our goal is to develop methods for non-experts to teach complex behaviors to autonomous agents (such as robots) by accommodating “natural” forms of human teaching. We built a prototype interface allowing humans to teach a simulated robot a complex task using several techniques and report the results of 44 human participants using this interface. We found that teaching styles varied considerably but can be roughly categorized based on the types of interaction, patterns of testing, and general organization of the lessons given by the teacher. Our study contributes to a better understanding of human teaching patterns and makes specific recommendations for future human-robot interaction systems.
Article
Full-text available
Bootstrap Learning (BL) is a new machine learning paradigm that seeks to build an electronic student that can learn using natural instruction provided by a human teacher and by bootstrapping on previously learned concepts. In our setting, the teacher provides (very few) examples and some advice about the task at hand using a natural instruction interface. To address this task, we use our Inductive Logic Programming system called WILL to translate the natural instruction into first-order logic. We present approaches to the various challenges BL raises, namely automatic translation of domain knowledge and instruction into an ILP problem and the automation of ILP runs across different tasks and domains, which we address using a multi-layered approach. We demonstrate that our system is able to learn effectively in over fifty different lessons across three different domains without any human-performed parameter tuning between tasks.
Conference Paper
Full-text available
While reinforcement learning (RL) is not traditionally designed for interactive supervisory input from a human teacher, several works in both robot and software agents have adapted it for human input by letting a human trainer control the reward signal. In this work, we experimentally examine the assumption underlying these works, namely that the human-given reward is compatible with the traditional RL reward signal. We describe an experimental platform with a simulated RL robot and present an analysis of real-time human teaching behavior found in a study in which untrained subjects taught the robot to perform a new task. We report three main observations on how people administer feedback when teaching a robot a task through reinforcement learning: (a) they use the reward channel not only for feedback, but also for future-directed guidance; (b) they have a positive bias to their feedback -possibly using the signal as a motivational channel; and (c) they change their behavior as they develop a mental model of the robotic learner. In conclusion, we discuss future extensions to RL to accommodate these lessons
Conference Paper
Full-text available
A general learning task for a robot in a new environment is to learn about objects and what actions/eects they aord. To approach this, we look at ways that a human partner can intuitively help the robot learn, Socially Guided Ma- chine Learning. We present experiments conducted with our robot, Junior, and make six observations characterizing how people approached teaching about objects. We show that Junior successfully used transparency to mitigate er- rors. Finally, we present the impact of \social" versus \non- social" data sets when training SVM classiers. Categories and Subject Descriptors: I.2.6 (Articial In-
Conference Paper
Full-text available
As learning agents move from research labs to the real world, it is increasingly important that human users, including those without programming skills, be able to teach agents desired behaviors. Recently, the tamer framework was in- troduced for designing agents that can be interactively shaped by human trainers who give only positive and neg- ative feedback signals. Past work on tamer showed that shaping can greatly reduce the sample complexity required to learn a good policy, can enable lay users to teach agents the behaviors they desire, and can allow agents to learn within a Markov Decision Process (MDP) in the absence of a coded reward function. However, tamer does not al- low this human training to be combined with autonomous learning based on such a coded reward function. This pa- per leverages the fast learning exhibited within the tamer framework to hasten a reinforcement learning (RL) algo- rithm's climb up the learning curve, eectively demonstrat- ing that human reinforcement and MDP reward can be used in conjunction with one another by an autonomous agent. We tested eight plausible tamer+rl methods for combin- ing a previously learned human reinforcement function, ^ H, with MDP reward in a reinforcement learning algorithm. This paper identies which of these methods are most eec- tive and analyzes their strengths and weaknesses. Results from these tamer+rl algorithms indicate better nal per- formance and better cumulative performance than either a tamer agent or an RL agent alone.
Conference Paper
Full-text available
This paper develops a generalized apprenticeship learning protocol for reinforcementlearning agents with access to a teacher who provides policy traces (transition and reward observations). We characterize sufficient conditions of the underlying models for efficient apprenticeship learning and link this criteria to two established learnability classes (KWIK and Mistake Bound). We then construct efficient apprenticeship-learning algorithms in a number of domains, including two types of relational MDPs. We instantiate our approach in a software agent and a robot agent that learn effectively from a human teacher. 1.
Article
While Reinforcement Learning (RL) is not traditionally designed for interactive supervisory input from a human teacher, several works in both robot and software agents have adapted it for human input by letting a human trainer control the reward signal. In this work, we experimentally examine the assumption underlying these works, namely that the human-given reward is compatible with the traditional RL reward signal. We describe an experimental platform with a simulated RL robot and present an analysis of real-time human teaching behavior found in a study in which untrained subjects taught the robot to perform a new task. We report three main observations on how people administer feedback when teaching a Reinforcement Learning agent: (a) they use the reward channel not only for feedback, but also for future-directed guidance; (b) they have a positive bias to their feedback, possibly using the signal as a motivational channel; and (c) they change their behavior as they develop a mental model of the robotic learner. Given this, we made specific modifications to the simulated RL robot, and analyzed and evaluated its learning behavior in four follow-up experiments with human trainers. We report significant improvements on several learning measures. This work demonstrates the importance of understanding the human-teacher/robot-learner partnership in order to design algorithms that support how people want to teach and simultaneously improve the robot's learning behavior.
Article
Although machine learning is becoming commonly used in today's software, there has been little research into how end users might interact with machine learning systems, beyond communicating simple “right/wrong” judgments. If the users themselves could work hand-in-hand with machine learning systems, the users’ understanding and trust of the system could improve and the accuracy of learning systems could be improved as well. We conducted three experiments to understand the potential for rich interactions between users and machine learning systems. The first experiment was a think-aloud study that investigated users’ willingness to interact with machine learning reasoning, and what kinds of feedback users might give to machine learning systems. We then investigated the viability of introducing such feedback into machine learning systems, specifically, how to incorporate some of these types of user feedback into machine learning systems, and what their impact was on the accuracy of the system. Taken together, the results of our experiments show that supporting rich interactions between users and machine learning systems is feasible for both user and machine. This shows the potential of rich human–computer collaboration via on-the-spot interactions as a promising direction for machine learning systems and users to collaboratively share intelligence.
Article
We present a comprehensive survey of robot Learning from Demonstration (LfD), a technique that develops policies from example state to action mappings. We introduce the LfD design choices in terms of demonstrator, problem space, policy derivation and performance, and contribute the foundations for a structure in which to categorize LfD research. Specifically, we analyze and categorize the multiple ways in which examples are gathered, ranging from teleoperation to imitation, as well as the various techniques for policy derivation, including matching functions, dynamics models and plans. To conclude we discuss LfD limitations and related promising areas for future research.
Conference Paper
We consider the problem of incorporating end-user advice into reinforcement learning (RL). In our setting, the learner alternates between practicing, where learning is based on actual world experience, and end-user critique sessions where advice is gathered. During each critique session the end-user is allowed to analyze a trajectory of the current policy and then label an arbitrary subset of the available actions as good or bad. Our main contribution is an approach for integrating all of the information gathered during practice and critiques in order to effectively optimize a parametric policy. The approach optimizes a loss function that linearly combines losses measured against the world experience and the critique data. We evaluate our approach using a prototype system for teaching tactical battle behavior in a real-time strategy game engine. Results are given for a significant evaluation involving ten end-users showing the promise of this approach and also highlighting challenges involved in inserting end-users into the RL loop. Copyright © 2010, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Article
We present Confidence-Based Autonomy (CBA), an interactive algorithm for policy learning from demonstration. The CBA algorithm consists of two components which take advantage of the complimentary abilities of humans and computer agents. The first com- ponent, Confident Execution, enables the agent to identify states in which demonstration is required, to request a demonstration from the human teacher and to learn a policy based on the acquired data. The algorithm selects demonstrations based on a measure of action selection confidence, and our results show that using Confident Execution the agent re- quires fewer demonstrations to learn the policy than when demonstrations are selected by a human teacher. The second algorithmic component, Corrective Demonstration, enables the teacher to correct any mistakes made by the agent through additional demonstrations in order to improve the policy and future task performance. CBA and its individual com- ponents are compared and evaluated in a complex simulated driving domain. The complete CBA algorithm results in the best overall learning performance, successfully reproducing the behavior of the teacher while balancing the tradeoff between number of demonstrations and number of incorrect actions during learning.